Copyright laws shouldn't apply to AI training, proposes Google

dewme · August 10, 2023 12:40AM

eriamjh said:

If a person can read it for free, then so should AI. Just because people can't regurgitate it, doesn't mean that since AI can it somehow should be prevented.

Anyone can be influenced by images, sound, text. So let AI do so.

If AI "copies" like a human, than sue the owners of the AI like a human. Fair use still applies. Sell it for profit, get sued.

No shields are needed. No restricting laws.

Maybe the problem will be AI can flood the world with "inspirations" and derivatives works, thus diluting a unique artists style or technique? That's probably an infringement.

Force credit where credit is due. Maybe that's the only law needed. Credit to human artists when AI is trained on them.

When AI generate something original, I'll be interested. Otherwise, AI is just a copying app.

Thank you. You’ve captured my exact take on this. If a human can do something, albeit at absurdly slow speeds and volumes compared to a machine, then why should we disallow it?

The whole point of automation is to serve its owners by vastly improving the speed, efficiency, productivity, volume, repeatability, etc., of tasks and processes that provide value to humans.

The only intelligence in artificial intelligence is the intelligence that humans apply when they build the mathematical models and algorithms that are executed by machines. No machine has ever had a single “thought,” much less an original thought.

The value behind the creation of intellectual property (IP) is derived from human thought. Invention is rooted in human driven original thought. Innovation is rooted in humans thinking up ways to derive productive value in ways that exploit human invention, domain expertise, experience, practicality, marketability, knowledge, learning, profitability, etc.

I don’t think Google is engaged in anything nefarious here. They are simply applying learning and discovery principles and mechanisms that have been employed by human researchers, innovators, and everyone seeking to extend the state of the art beyond what is currently known and understood. Human knowledge is cumulative.

To build a better mouse trap you have to take a look at what are currently considered the best mouse traps. Otherwise, why bother?

22july2013 · August 10, 2023 12:57AM

Marvin said:

22july2013 said:

I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

ChatGPT 3 is trained on 45TB of uncompressed data, GPT2 was 40GB:

https://www.sciencedirect.com/science/article/pii/S2667325821002193

All of Wikipedia (English) uncompressed is 86GB (19GB compressed):

https://en.wikipedia.org/wiki/Wikipedia:Database_download

It doesn't store direct data but it stores patterns in the data. This is much smaller than the source, GPT 3 seems to be around 300-800GB. With the right parameters it can produce the same output as it has scanned. It has to or it wouldn't generate any correct answers.

https://www.reddit.com/r/ChatGPT/comments/15aarp0/in_case_anybody_was_doubting_that_chatgpt_has/

If it's asked directly to print copyrighted text, it says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts" but this is because some text has been flagged as copyright, it still knows what the text is. It can be forced sometimes by tricking it:

https://www.reddit.com/r/ChatGPT/comments/12iwmfl/chat_gpt_copes_with_piracy/
https://kotaku.com/chatgpt-ai-discord-clyde-chatbot-exploit-jailbreak-1850352678

If Wikipedia requires 19GB uncompressed, as you claim, then you believe that a 50 GB language model (barely twice the size) can contain the ENTIRE Internet including every book ever written? That's what you seem to be saying.

22july2013 · August 10, 2023 12:59AM

goofy1958 said:

22july2013 said:

AI has multiple data processing stages that could have different legal rules apply. It seems that Google is talking only about the training stage here. The training stage needs to examine large data sets, true, but the language model file that results from that training is surprisingly small (smaller than many large computer games, like World of Warcraft.) I'm no AI expert, or information theory expert, but I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright. If you ask a chatbot something like "how many basketballs can fit in an average house?" the file itself doesn't contain the answer in its data, it still has to go to the Internet to get the data (eg, size of a basketball, size of a house) needed to generate the answer. If there is a copyright issue, it's probably at the answer-generation stage, not at the model-training stage, because the model that gets created does not contain any of the data that the training-stage had to read to generate it.

In order for a human to get to where we are today, we had to read Dr Seuss books (or something similar) to learn English, but we don't have to pay Dr Seuss' estate every time we do something that makes money that exploits or knowledge of the English language. We might even forget the words to the Seuss books, but the principles of the language that we learned from those books remain in our heads. "Language principles" extracted from any data should not be copyrightable, while specific data should be copyrightable.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

I'd be happy if someone can correct me, although evidence would be appreciated.

Where did you get the 50GB number? I don't see that anywhere in the article.

I watched some videos about large language models and I picked one size from the video. If you don't like that number, pick an LLM with a different size. The exact size really doesn't matter. Multiple my number by 5 or 10 and my point is the same.

larryjw · August 10, 2023 1:20AM

Copyright protects against copying the work of others, with the idea that one is generating a verbatim copy of a significant portion of that work. Since LLMs merely takes those words from copyrighted works and updated its database of its conditional next word probabilities, there is little chance that an authors set of words will be copied into the LLM response to a prompt.

Copyright also protects derivative works. This is where it can be argued that the LLM is creating a derivative work. I think that is a stretch however.

In any case I don't believe copyright is in any way a law which prevents the wholesale processing of copyrighted works for the purpose of building conditional next word probabilities databases. The harm to authors that copyright laws protect are not at all impinged by building these databases.

xed · August 10, 2023 3:07AM

22july2013 said:

Marvin said:

22july2013 said:

I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

ChatGPT 3 is trained on 45TB of uncompressed data, GPT2 was 40GB:

https://www.sciencedirect.com/science/article/pii/S2667325821002193

All of Wikipedia (English) uncompressed is 86GB (19GB compressed):

https://en.wikipedia.org/wiki/Wikipedia:Database_download

It doesn't store direct data but it stores patterns in the data. This is much smaller than the source, GPT 3 seems to be around 300-800GB. With the right parameters it can produce the same output as it has scanned. It has to or it wouldn't generate any correct answers.

https://www.reddit.com/r/ChatGPT/comments/15aarp0/in_case_anybody_was_doubting_that_chatgpt_has/

If it's asked directly to print copyrighted text, it says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts" but this is because some text has been flagged as copyright, it still knows what the text is. It can be forced sometimes by tricking it:

https://www.reddit.com/r/ChatGPT/comments/12iwmfl/chat_gpt_copes_with_piracy/
https://kotaku.com/chatgpt-ai-discord-clyde-chatbot-exploit-jailbreak-1850352678

If Wikipedia requires 19GB uncompressed, as you claim, then you believe that a 50 GB language model (barely twice the size) can contain the ENTIRE Internet including every book ever written? That's what you seem to be saying.

Marvin wrote "Wikipedia (English) uncompressed is 86GB (19GB compressed)." That means that all the Wikipedia entries for all languages, as well other versions in Enlightening (like Simplified English) would increase that size, as well as all those pages and databases of data from Wikipedia (like their discussion and revision secants for all entries).

Your other comment about the "ENTIRE internet" I don't understand as the entire internet encompasses a great deal of data that couldn't be contained inside the storage capacity of a 12 year old iPhone 4S. I asked ChatGPT how much data it would take to store the Library of Congress if encoded with Unicode-16:

That’s an interesting question! The Library of Congress is one of the largest and most diverse libraries in the world, with over 100 million items in various formats and languages1. To estimate how much data these books would need in Unicode-16, we need to make some assumptions and calculations.

First, let’s assume that the average book has about 300 pages, and each page has about 500 words. That means each book has about 150,000 words. Next, let’s assume that each word has about 5 characters on average, and each character takes 2 bytes in Unicode-16. That means each word takes about 10 bytes, and each book takes about 1.5 megabytes (MB) in Unicode-16.

Now, let’s multiply the average size of a book by the number of books in the Library of Congress. According to Wikipedia1, the Library of Congress had about 32 million books as of 2018. If we assume that this number has not changed much since then, we can estimate that the total size of the books in the Library of Congress is about 48 terabytes (TB) in Unicode-16.

Of course, this is a very rough estimate, and it does not take into account other factors such as illustrations, metadata, compression, encoding schemes, etc. But it gives us a sense of how much data the Library of Congress books would need in Unicode-16. To put this number in perspective, it is equivalent to about 12,000 DVDs or 96,000 CDs. That’s a lot of data! ߘt;br>
I hope this answer was helpful and informative. If you have any follow-up questions or want to learn more about the Library of Congress, feel free to ask me. ߘꦬt;/span>

That's considerably more than 50 GiB and doesn't include audio, images, or video, which take up a lot more space and are clearly part of the "ENTIRE internet" along with your banking websites, social media accounts, and countless other types of data being pushed over the "ENTIRE internet."

edited August 2023

marvin · August 10, 2023 9:48PM

22july2013 said:

Marvin said:

22july2013 said:

I can't see any way that the language model itself (eg, a 50 GB file) "contains a copy of the Internet" therefore the model by itself probably isn't violating anyone's copyright.

If you want to argue that the Google search engine violates copyright because it actually requires 10 exabytes of data (stored on Google's server farms) that it obtained from crawling the Internet, I could probably agree with that. But I can't see how a puny 50 GB file or anything that small could be a violation of anyone's copyright. You can't compress the entire Internet into a puny file like that, therefore it can't violate anyone's copyright.

The reason most people can't run large language models on their local computers is that the "small 50 GB file" has to fit in local memory (RAM), and most users don't have that much RAM. The reason it needs to fit in memory is that every byte of the file has to be accessed about 50 times in order to generate the answer to any question. If the file was stored on disk, it could take hours or weeks to calculate a single answer.

10,000,000,000,000,000,000 = The number of bytes of data on Google's servers
00,000,000,050,000,000,000 = The number of bytes that an LL model file requires, which is practically nothing

ChatGPT 3 is trained on 45TB of uncompressed data, GPT2 was 40GB:

https://www.sciencedirect.com/science/article/pii/S2667325821002193

All of Wikipedia (English) uncompressed is 86GB (19GB compressed):

https://en.wikipedia.org/wiki/Wikipedia:Database_download

It doesn't store direct data but it stores patterns in the data. This is much smaller than the source, GPT 3 seems to be around 300-800GB. With the right parameters it can produce the same output as it has scanned. It has to or it wouldn't generate any correct answers.

https://www.reddit.com/r/ChatGPT/comments/15aarp0/in_case_anybody_was_doubting_that_chatgpt_has/

If it's asked directly to print copyrighted text, it says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts" but this is because some text has been flagged as copyright, it still knows what the text is. It can be forced sometimes by tricking it:

https://www.reddit.com/r/ChatGPT/comments/12iwmfl/chat_gpt_copes_with_piracy/
https://kotaku.com/chatgpt-ai-discord-clyde-chatbot-exploit-jailbreak-1850352678

If Wikipedia requires 19GB uncompressed, as you claim, then you believe that a 50 GB language model (barely twice the size) can contain the ENTIRE Internet including every book ever written? That's what you seem to be saying.

Text doesn't take up a lot of space, you can see examples of books here, they are a few hundred KB each, same as a single image:

https://www.gutenberg.org/ebooks/2701
https://www.gutenberg.org/ebooks/11
https://www.gutenberg.org/browse/scores/top

The Chat GPT 3 model is hundreds of GBs in size and it was trained on 45TB of data (in 2021), which is not the whole internet. 45TB -> 500GB = 90:1 compression.

It also doesn't store the text directly, it's encoded, similar to how file compression like .zip works, which can losslessly compress around 5:1. AV1 lossy compression can compress a 50GB BluRay video down to 1GB and it looks like the original.

Image: https://forums.appleinsider.com/uploads/editor/kz/0o2xjti5v53b.jpg

It stores data that allows it to generate a pattern similar to the data it was trained on. Modify the parameters and it will modify the pattern. If the right parameters are used, it will be able to generate a pattern that is very close to the original material. There are billions of parameter variations so it would be rare to get an outcome with the original material but the pattern is in there.

Some content creators aren't happy that AI software uses their work this way. It allows people to copy the style of famous artists, musicians and writers. People have been doing it with AI voices:

With text, someone could ask it to write a new Stephen King novel. Despite it being a new original work, people would see the familiar style of the text. The technology is very powerful and there's not really a way to stop it now but it's clear to see why original creators would be uneasy about it, especially seeing how powerful it is at such an early stage. Server hardware manufacturers are talking about orders of magnitude jumps in performance for next-gen hardware for AI processing. Part of the recent writer's strike was about regulating the use of AI in writing:

https://time.com/6277158/writers-strike-ai-wga-screenwriting/
https://www.newscientist.com/article/2373382-why-use-of-ai-is-a-major-sticking-point-in-the-ongoing-writers-strike/

It's not hard to imagine a time in the near future when a studio exec could ask an AI to write a blockbuster movie screenplay and generate a sample movie with famous actors acting out the parts and it would be able to generate this in minutes on a studio render farm. Then they could greenlight the best movie they generate. This can cut out the writers while being trained on their original work.

AI sits in a bit of a grey area just now with copyright. It's like a karaoke singer doing a cover of a famous song, which is legal. But it goes to the point of giving everyone the ability to be that karaoke singer and for them to start using that work commercially. Original artists can take legal action against another artist if they notice similarities:

https://www.youtube.com/watch?v=R5sO9dhPK8g
https://www.youtube.com/watch?v=0kt1DXu7dlo

They couldn't take on a million people using AI tools trained on their work. That's why regulators are trying to put rules in place before it gets out of control. I think they are already too late but there would be the possibility to retroactively tag trained data as being without consent for using in an AI generator. Content creators have a right to say how their work is used.

docbburk · August 10, 2023 10:28PM

Google is an internet plague! Such thievery!

toortog · August 10, 2023 11:05PM

There needs to be a universal law that all people own copyright to their personal data or likeness from day of birth. If Google or anyone wants that data they have to get signed release and pay a royalty for its use.

Copyright laws shouldn't apply to AI training, proposes Google

Comments