Copyright laws shouldn't apply to AI training, proposes Google
Google has told regulators in Australia that its AI systems should be able train on any data, unless the copyright owners expressly opt-out.
If generative artificial intelligence (AI) is to be useful, it needs enormous sources of data to train on, and if you're using that data, you should pay for it. Unless you're Google, in which case you think you're an exception -- and you have a record of trying to bully your way out of paying.
According to The Guardian, Google has presented a case to Australian regulators that it be allowed to do what it wants and, okay, maybe publishers should be able to say no. But that's on the publishers, not Google.
Google's submission, seen by The Guardian, calls for Australia to adopt "copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data while supporting workable opt-outs for entities that prefer their data not to be trained in using AI systems."
Reportedly, this is similar to arguments Google has presented to Australia before, except that it has added what appears to be a reasonable opt-out.
However, requiring publishers to explicitly opt-out of any AI training on their data means the publishers having to know that their work is being mined. Then since the regulators are making plans for all AI providers, it also means that it may not be possible to prove whether a company has ceased mining the data or not.
More, an AI company could in theory delay the process so that by the time it stops mining, it has already used all of the data for its training.
According to The Guardian, Google has not specifically said how it believes such a system could work.
Separately, Google is one of a consortium of Big Tech firms based in the US that has recently pledged to establish best practices for the AI industry.
Read on AppleInsider
Comments
Anyone can be influenced by images, sound, text. So let AI do so.
If AI "copies" like a human, than sue the owners of the AI like a human. Fair use still applies. Sell it for profit, get sued.
No shields are needed. No restricting laws.
Maybe the problem will be AI can flood the world with "inspirations" and derivatives works, thus diluting a unique artists style or technique? That's probably an infringement.
Force credit where credit is due. Maybe that's the only law needed. Credit to human artists when AI is trained on them.
When AI generate something original, I'll be interested. Otherwise, AI is just a copying app.
In order for a human to get to where we are today, we had to read Dr Seuss books (or something similar) to learn English, but we don't have to pay Dr Seuss' estate every time we do something that makes money that exploits or knowledge of the English language. We might even forget the words to the Seuss books, but the principles of the language that we learned from those books remain in our heads. "Language principles" extracted from any data should not be copyrightable, while specific data should be copyrightable.
If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.
The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.
10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing
I'd be happy if someone can correct me, although evidence would be appreciated.
I had not thought of it the way you have. Your's is a well-reasoned post. Thanks.
Where it's gets a little bit pricklier is with the content that is created as a result of these technologies.
For example, many organisations have access to databases of vast content libraries through paid subscription services. It is consulted at will and new content is often born as a result.
With AI models capable of absorbing data in vast quantities (all of the data held within a subscription for example), should the results, produced by the AI, be subject to further fees (on top of what was paid for the initial subscription)?
Is there a difference between someone producing a document for a research paper and using the organisation's paid subscription service to consult specific reference papers and an AI service drawing off everything to produce content?
https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion
More like an unstable diffusion.
https://www.sciencedirect.com/science/article/pii/S2667325821002193
All of Wikipedia (English) uncompressed is 86GB (19GB compressed):
https://en.wikipedia.org/wiki/Wikipedia:Database_download
It doesn't store direct data but it stores patterns in the data. This is much smaller than the source, GPT 3 seems to be around 300-800GB. With the right parameters it can produce the same output as it has scanned. It has to or it wouldn't generate any correct answers.
https://www.reddit.com/r/ChatGPT/comments/15aarp0/in_case_anybody_was_doubting_that_chatgpt_has/
If it's asked directly to print copyrighted text, it says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts" but this is because some text has been flagged as copyright, it still knows what the text is. It can be forced sometimes by tricking it:
https://www.reddit.com/r/ChatGPT/comments/12iwmfl/chat_gpt_copes_with_piracy/
https://kotaku.com/chatgpt-ai-discord-clyde-chatbot-exploit-jailbreak-1850352678
The learning I achieve has value to me, the cost in my time to achieve that result can be reduced by paying someone for their expertise - that's the choice I get to make. That expertise was gained at a cost to that other person/entity and I don't have the right to deny them the chance to extract value from it.