Copyright laws shouldn't apply to AI training, proposes Google

AppleInsider · August 9, 2023 3:48PM

Google has told regulators in Australia that its AI systems should be able train on any data, unless the copyright owners expressly opt-out.

If generative artificial intelligence (AI) is to be useful, it needs enormous sources of data to train on, and if you're using that data, you should pay for it. Unless you're Google, in which case you think you're an exception -- and you have a record of trying to bully your way out of paying.

According to The Guardian, Google has presented a case to Australian regulators that it be allowed to do what it wants and, okay, maybe publishers should be able to say no. But that's on the publishers, not Google.

Google's submission, seen by The Guardian, calls for Australia to adopt "copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data while supporting workable opt-outs for entities that prefer their data not to be trained in using AI systems."

Reportedly, this is similar to arguments Google has presented to Australia before, except that it has added what appears to be a reasonable opt-out.

However, requiring publishers to explicitly opt-out of any AI training on their data means the publishers having to know that their work is being mined. Then since the regulators are making plans for all AI providers, it also means that it may not be possible to prove whether a company has ceased mining the data or not.

More, an AI company could in theory delay the process so that by the time it stops mining, it has already used all of the data for its training.

According to The Guardian, Google has not specifically said how it believes such a system could work.

Separately, Google is one of a consortium of Big Tech firms based in the US that has recently pledged to establish best practices for the AI industry.

Read on AppleInsider

robin huber · August 9, 2023 3:52PM

If EU regulators want to REALLY do us all a solid, make “opt out” illegal. Opt in only should be the standard.

hydrogen · August 9, 2023 3:57PM

Don't be evil : what is yours is mine

xed · August 9, 2023 4:05PM

hydrogen said:

Don't be evil : what is yours is mine

I'd like to note that Google dropped the "don't be evil" motto in 2015 when they renamed themselves Alphabet. The new mottos is "do the right thing" which probably isn't a reference to the Spike Lee movie from 1989.

danox · August 9, 2023 4:07PM

Google wanting to steal (vacuum) all information at their leisure, no surprise whatsoever……

edited August 2023

lam92103 · August 9, 2023 4:34PM

What bullshit. Either get rid of copyright or stop whining.

But no. Google like - Ohh AI is bad. It will threaten humanity. But we cant keep our grubby hands out of it

edited August 2023

mikethemartian · August 9, 2023 4:35PM

So is it OK for someone to train their AI by querying Bard?

gatorguy · August 9, 2023 4:38PM

mikethemartian said:

So is it OK for someone to train their AI by querying Bard?

Of course it is.

bart y · August 9, 2023 4:45PM

As always take before copyright owners (your data, track everything) can opt out. Same old Google, what’s yours is ours, unless you say no, and even then maybe we’ll take it anyway.

How about nothing is yours, and it’s your job to ask permission and then get a decision from the IP owners??? Because that’s the way it’s done by the law, and morally right.

Too hard for a huge company like yours? Too ethical? What happened to “do no evil?” Oh right, rhetorical question.

chadbag · August 9, 2023 5:07PM

robin huber said:

If EU regulators want to REALLY do us all a solid, make “opt out” illegal. Opt in only should be the standard.

I came here to say the same thing. “Opt-out” for almost any law should not be allowed. Everything should be “opt-in”.

eriamjh · August 9, 2023 5:17PM

If a person can read it for free, then so should AI. Just because people can't regurgitate it, doesn't mean that since AI can it somehow should be prevented.

Anyone can be influenced by images, sound, text. So let AI do so.

If AI "copies" like a human, than sue the owners of the AI like a human. Fair use still applies. Sell it for profit, get sued.

No shields are needed. No restricting laws.

Maybe the problem will be AI can flood the world with "inspirations" and derivatives works, thus diluting a unique artists style or technique? That's probably an infringement.

Force credit where credit is due. Maybe that's the only law needed. Credit to human artists when AI is trained on them.

When AI generate something original, I'll be interested. Otherwise, AI is just a copying app.

22july2013 · August 9, 2023 5:52PM

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

In order for a human to get to where we are today, we had to read Dr Seuss books (or something similar) to learn English, but we don't have to pay Dr Seuss' estate every time we do something that makes money that exploits or knowledge of the English language. We might even forget the words to the Seuss books, but the principles of the language that we learned from those books remain in our heads. "Language principles" extracted from any data should not be copyrightable, while specific data should be copyrightable.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

I'd be happy if someone can correct me, although evidence would be appreciated.

gatorguy · August 9, 2023 6:08PM

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

I'd be happy if someone can correct me, although evidence would be appreciated.

I had not thought of it the way you have. Your's is a well-reasoned post. Thanks.

goofy1958 · August 9, 2023 6:33PM

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

In order for a human to get to where we are today, we had to read Dr Seuss books (or something similar) to learn English, but we don't have to pay Dr Seuss' estate every time we do something that makes money that exploits or knowledge of the English language. We might even forget the words to the Seuss books, but the principles of the language that we learned from those books remain in our heads. "Language principles" extracted from any data should not be copyrightable, while specific data should be copyrightable.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

I'd be happy if someone can correct me, although evidence would be appreciated.

Where did you get the 50GB number? I don't see that anywhere in the article.

foregoneconclusion · August 9, 2023 6:52PM

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

You're trying to equate the AI program with a human being when it comes to copyright issues. Not the same thing. Not the same capabilities. For example, why didn't Google simply use 100% public domain material (text or images) for the training? Why would they choose to use copyrighted material if the copyright part of it didn't matter? Using public domain material if you want to avoid copyright/payment is really basic stuff legally. Google could have easily chosen that route.

August 9, 2023 7:08PM

22july2013 said:

But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

Your personal feeling that issues with compressing the whole internet..."therefore it can't violate anyone's copyright" is a non sequitur and irrelevant. A copyright protects an individual work, and it's a property, and Google (or whomever) doesn't get to use it for its own AI purposes without the consent of the owner. Official Google Apologist Gatorguy will support any Googlization of anything.

avon b7 · August 9, 2023 7:28PM

foregoneconclusion said:

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

You're trying to equate the AI program with a human being when it comes to copyright issues. Not the same thing. Not the same capabilities. For example, why didn't Google simply use 100% public domain material (text or images) for the training? Why would they choose to use copyrighted material if the copyright part of it didn't matter? Using public domain material if you want to avoid copyright/payment is really basic stuff legally. Google could have easily chosen that route.

I think that's a very reasonable stance and hope the authorities see it that way too.

Where it's gets a little bit pricklier is with the content that is created as a result of these technologies.

For example, many organisations have access to databases of vast content libraries through paid subscription services. It is consulted at will and new content is often born as a result.

With AI models capable of absorbing data in vast quantities (all of the data held within a subscription for example), should the results, produced by the AI, be subject to further fees (on top of what was paid for the initial subscription)?

Is there a difference between someone producing a document for a research paper and using the organisation's paid subscription service to consult specific reference papers and an AI service drawing off everything to produce content?

strangedays · August 9, 2023 8:45PM

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

Dunno man, tell that to Getty Images, who had their actual watermarks rendered into Stable Diffusion’s output:

https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion

xed · August 9, 2023 8:59PM

StrangeDays said:

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

Dunno man, tell that to Getty Images, who had their actual watermarks rendered into Stable Diffusion’s output:

https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion

More like an unstable diffusion.

Image: https://forums.appleinsider.com/uploads/editor/85/cd06kgns75vj.png

marvin · August 9, 2023 11:10PM

22july2013 said:

I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

ChatGPT 3 is trained on 45TB of uncompressed data, GPT2 was 40GB:

https://www.sciencedirect.com/science/article/pii/S2667325821002193

All of Wikipedia (English) uncompressed is 86GB (19GB compressed):

https://en.wikipedia.org/wiki/Wikipedia:Database_download

It doesn't store direct data but it stores patterns in the data. This is much smaller than the source, GPT 3 seems to be around 300-800GB. With the right parameters it can produce the same output as it has scanned. It has to or it wouldn't generate any correct answers.

https://www.reddit.com/r/ChatGPT/comments/15aarp0/in_case_anybody_was_doubting_that_chatgpt_has/

If it's asked directly to print copyrighted text, it says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts" but this is because some text has been flagged as copyright, it still knows what the text is. It can be forced sometimes by tricking it:

https://www.reddit.com/r/ChatGPT/comments/12iwmfl/chat_gpt_copes_with_piracy/
https://kotaku.com/chatgpt-ai-discord-clyde-chatbot-exploit-jailbreak-1850352678

Image: https://forums.appleinsider.com/uploads/editor/df/fptrfffj1qyy.jpg

filemakerfeller · August 9, 2023 11:36PM

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

If I, as a human being, want to be trained on a particular skill, specialty or subject I can choose to learn from source materials that are made available for no cost or I can choose materials that are available for a fee. It is the right of the publisher to charge a fee; it is not my right to take something from them and then tell them "so many things are given away for free, I just assumed your stuff was free as well."

The learning I achieve has value to me, the cost in my time to achieve that result can be reduced by paying someone for their expertise - that's the choice I get to make. That expertise was gained at a cost to that other person/entity and I don't have the right to deny them the chance to extract value from it.

Copyright laws shouldn't apply to AI training, proposes Google

Comments