Apple's generative AI may be the only one that was trained legally & ethically

AppleInsider · April 24, 2024 12:25PM

As copyright concerns plague the field of generative AI, Apple seeks to preserve privacy and legality through innovative training methods for language learning methods, all while avoiding controversy.

Apple's AI may be the only legally-trained one on the market

In recent years, the question of generative AI in relation to copyright law has remained a relatively important and complex issue. As language learning models (LLMs) and generative AI apps increase in popularity, copyright issues have continued to pile up without any kind of meaningful resolution.

Problems arise when companies use copyrighted works in training their generative AI software, and when the outputs of said AI software contain sections of works under copyright protection.

Copying copyrighted works in their entirety or using significant sections of such works for training generative AI software is copyright infringement. There is no "fair use" carve-out for AI training, despite what the companies that are training the models say or believe.

Generative AI and copyright infringement lawsuits

In late December of 2023, OpenAI and Microsoft were sued by The New York Times for copyright infringement. In the lawsuit, it was claimed that the two companies trained their generative AI software using millions of articles published by The New York Times.

This was not the first time OpenAi faced a lawsuit about model training. In September 2023, the company was also sued by several prominent authors, with George R. R. Martin, Michael Connelly and Jonathan Franzen being among them.

The history of generative AI and copyright issues goes back even further, as in July of 2023 over 15000 authors signed an open letter addressed to several prominent companies, including Alphabet, OpenAI, Meta, Microsoft and more.

The letter requested that the authors be properly credited and compensated for their work, which was used in the training of generative AI and language learning models.

Another, similar class-action lawsuit alleging copyright infringement was filed against OpenAI by non-fiction authors Nicholas Basbanes and Nicholas Gage. The lawsuit was filed in January of 2024.

In late April of 2024, another AI-related lawsuit was filed, this time against Amazon. The lawsuit alleges that an Amazon employee was instructed to deliberately ignore and violate copyright law so that Amazon could compete against rival products and services more effectively.

In the lawsuit, a former Amazon employee claims she was told by a supervisor regarding copyright-violating AI training that "everyone else is doing it" -- implying that people from rival companies were knowingly engaging in copyright infringement.

And, it's pretty clear that they are.

AI and publishers' concerns about reproduction of copyrighted content

AI has been known to reproduce copyrighted content on multiple occasions, and the severity of the problem has inspired companies to analyze the frequency at which this happens.

To gain a better understanding of the rate at which AI chatbots generate copyright-protected content, the company PatronusAI decided to look into the matter. The company, which evaluates generative AI models, compared four major AI models - OpenAi's ChatGPT-4, Meta's Llama 2, Mistral's Mixtral and Anthropic's Claude 2.1.

Patronus AI found the rate at which AI generated copyrighted content ultimately varied based on the model, but that rates of copyrighted content generation were high. The company also released its own tool, known as CopyrightCatcher, which would detect potential copyright violations in LLMs.

While the generation of copyrighted content has serious implications, publishers are also concerned over the use of copyrighted material in training language learning models.

An Adobe Firefly-generated image of a wizard mouse. Definitely not Mickey from Disney's 'Fantasia'

In March of 2024, The Wall Street Journal reported that prominent publishers were investigating the use of their copyrighted works in the training of generative AI models. The publishers wanted to be paid for the use of their work by AI.

Given the number of lawsuits related to generative AI and copyright and the seriousness of the concerns expressed by publishers, it makes sense that a company like Apple would try its best to avoid any potential legal issues.

Apple's unique approach to generative AI, language learning models and copyright issues

As a way of avoiding similar copyright issues during the training of its own generative AI software, Apple has reportedly been licensing the works of major news publications.

In December of 2023, it was reported that Apple planned to try and license works from Conde Nast - the publisher of Vogue and The New Yorker. The company had also spoken to IAC and NBC News in an attempt to make a deal worth approximately $50 million.

While Apple developed its language learning model, known internally as Ajax, with basic on-device functionality, the company took a different approach to more advanced features. Apple considered licensing software such as Google Gemini for more complex tasks requiring an internet connection.

By employing this strategy, Apple clearly intended to avoid copyright issues. With the paid licensing, Apple would not be responsible for copyright infringement caused or perpetrated by software such as Google Gemini.

In a research paper published in March of 2024, Apple revealed that it used a carefully curated mixture of images, image-text and text-based input to train its in-house LLM. The method Apple used allowed for better image captioning, multi-step reasoning and preserving privacy, all at the same time.

An example of an image from an Apple generative AI graphic tool.

We were told by industry sources that Apple's Ajax LLM preserves privacy because it does not require an internet connection for basic text analysis. This means that the on-device LLM cannot connect to a database and identify copyrighted content in offline mode, although more advanced features like text-generation would likely feature such checks and connections.

Reporting and documented projects aside, guard rails and licensing are only as secure if they are enforced. Sources familiar with Apple's AI test environments speaking to AppleInsider have revealed that there were seemingly little to no restrictions to prevent someone from using copyrighted material in the input for on-device test environments.

Our source wasn't clear about regulations inside Apple to prevent copyright-violating training. The output, however, is likely more regulated to avoid word-for-word reproduction of copyrighted material.

Apple should debut its generative AI technology during WWDC which starts on June 10.

Read on AppleInsider

draco · April 24, 2024 12:42PM

This is what I like about Apple: They have a track record of not releasing products until they are fully baked and actually work the way you expect, while other companies treat their customers like beta testers.

I'll believe AI is real when Apple releases an AI.

edited April 2024

beautyspin · April 24, 2024 1:00PM

Looks like excuses have already started.

xed · April 24, 2024 1:12PM

beautyspin said:

Looks like excuses have already started.

Ethics are an excuse for what exactly? Being responsible?

22july2013 · April 24, 2024 1:15PM

Apple could brand their AI as "Apple Intelligence."

applezulu · April 24, 2024 1:31PM

This just reinforces my thought that Apple will be rolling out an AI implementation that will allow Siri to provide you with a morning news summary sourced from your Apple News+ app, avoiding copyright issues entirely. It could also include information from other sources to which you have subscribed. It will verbally give you the news summary, naming sources, and then offer to drop links to any items of particular interest that you would like to read in full later.

Such a summary could be an interactive conversation. You would be able to ask Siri what’s the news about a given subject, Siri would search your Apple News+ app for new information on that subject, summarize it for you, and then offer to provide the sources for you to read later.

This would be yet another example of Apple entering a product category “late,” but only because they have taken the time to create something of quality, that avoids things like theft of intellectual property, and that is actually useful.

igorsky · April 24, 2024 2:27PM

I’m old enough to remember when Apple was so far behind in the AI race and would never catch.

xed · April 24, 2024 2:32PM

I’m old enough to remember when Apple was so far behind on the AI race that they built it into their chips for the iPhone and iPad in a previous decade.

edited April 2024

eriamjh · April 24, 2024 2:32PM

I don’t buy the whole concept of “legally and ethically” trained.

I learned a lot of things from copyrighted books and movies, etc. There no such thing as protected concepts, thoughts, ideas, or words. Even IP is only protected from being “used” illegally and learning and understanding it isn’t “use”.

I read a book. I paid for it. Maybe that’s the issue. Patents are protected from being used, not being read or understood.

If it’s in the internet and not behind a paywall, it’s fair use to learn from.

If I write a book report, is that infringement? If I am inspired by a work of art and I paint something, is that infringement? No.

Similar to is not the same as “copied”. Authors and artists are just upset a computer does it and there’s no carve out in law for that… yet.

massiveattack87 · April 24, 2024 2:34PM

This site should be called applerumors.com or something similar.

Here, we get a lot of news containing „may“, „might“ etc.

Apple is searching for an excuse here.

xed · April 24, 2024 3:14PM

Massiveattack87 said:

This site should be called applerumors.com or something similar.

Here, we get a lot of news containing „may“, „might“ etc.

Apple is searching for an excuse here.

An excuse for what exactly?

foregoneconclusion · April 24, 2024 4:54PM

eriamjh said:

I don’t buy the whole concept of “legally and ethically” trained.

I learned a lot of things from copyrighted books and movies, etc. There no such thing as protected concepts, thoughts, ideas, or words. Even IP is only protected from being “used” illegally and learning and understanding it isn’t “use”.

I read a book. I paid for it. Maybe that’s the issue. Patents are protected from being used, not being read or understood.

If it’s in the internet and not behind a paywall, it’s fair use to learn from.

If I write a book report, is that infringement? If I am inspired by a work of art and I paint something, is that infringement? No.

Similar to is not the same as “copied”. Authors and artists are just upset a computer does it and there’s no carve out in law for that… yet.

"Training" is just a marketing term. Generative AI programs, like any computer program, require a database in order to work. No data = no output. If copyrighted works are part of the database, then the company that owns the program needs to have paid to license those works. Think of it like a generative video game similar to No Man's Sky. Yes, the program can procedurally generate things but it's still dependent on the database of assets that the game designers built for it to use.

mknelson · April 24, 2024 5:04PM

foregoneconclusion said:
"Training" is just a marketing term. Generative AI programs, like any computer program, require a database in order to work. No data = no output. If copyrighted works are part of the database, then the company that owns the program needs to have paid to license those works. Think of it like a generative video game similar to No Man's Sky. Yes, the program can procedurally generate things but it's still dependent on the database of assets that the game designers built for it to use.

That's not terribly accurate.

The training data goes through a Neural Net - it doesn't result in anything resembling a database.

xed · April 24, 2024 6:19PM

foregoneconclusion said:

eriamjh said:

I don’t buy the whole concept of “legally and ethically” trained.

I learned a lot of things from copyrighted books and movies, etc. There no such thing as protected concepts, thoughts, ideas, or words. Even IP is only protected from being “used” illegally and learning and understanding it isn’t “use”.

I read a book. I paid for it. Maybe that’s the issue. Patents are protected from being used, not being read or understood.

If it’s in the internet and not behind a paywall, it’s fair use to learn from.

If I write a book report, is that infringement? If I am inspired by a work of art and I paint something, is that infringement? No.

Similar to is not the same as “copied”. Authors and artists are just upset a computer does it and there’s no carve out in law for that… yet.

"Training" is just a marketing term. Generative AI programs, like any computer program, require a database in order to work. No data = no output. If copyrighted works are part of the database, then the company that owns the program needs to have paid to license those works. Think of it like a generative video game similar to No Man's Sky. Yes, the program can procedurally generate things but it's still dependent on the database of assets that the game designers built for it to use.

They still need training on how to interpret the data properly. It’s like saying that saying that people don’t need to be trained how to read because the dictionary has all the words in it.

photosynth · April 24, 2024 6:34PM

foregoneconclusion said:

eriamjh said:

I don’t buy the whole concept of “legally and ethically” trained.

I learned a lot of things from copyrighted books and movies, etc. There no such thing as protected concepts, thoughts, ideas, or words. Even IP is only protected from being “used” illegally and learning and understanding it isn’t “use”.

I read a book. I paid for it. Maybe that’s the issue. Patents are protected from being used, not being read or understood.

If it’s in the internet and not behind a paywall, it’s fair use to learn from.

If I write a book report, is that infringement? If I am inspired by a work of art and I paint something, is that infringement? No.

Similar to is not the same as “copied”. Authors and artists are just upset a computer does it and there’s no carve out in law for that… yet.

"Training" is just a marketing term. Generative AI programs, like any computer program, require a database in order to work. No data = no output. If copyrighted works are part of the database, then the company that owns the program needs to have paid to license those works. Think of it like a generative video game similar to No Man's Sky. Yes, the program can procedurally generate things but it's still dependent on the database of assets that the game designers built for it to use.

Training isn't a marketing term, it's a computer science term. It's gradient descent (or some other kind of) regression to learn the parameters of a neural network. It's similar to how human neurons change synaptic strength during the process of learning. The model *doesn't* have a stored copy of training data, though it may be able to recall fragments of it, similar to a human who has read the same articles.

Secondly, procedural generation as in No Man's Sky, while also under the umbrella of the term AI, is very different from data-driven methods like neural networks (which are what the media calls AI these days). Procedural generation can also be purely mathematical and doesn't have to have hand-designed assets. But of course it has hand-designed math at the very least.

edited April 2024

danox · April 24, 2024 7:25PM

The strange thing is tech companies or anyone can feed in all human knowledge before 1920-1929 all of which is already in the public domain? Yet some tech companies still want to steal the last one hundred and four years.

https://en.wikipedia.org/wiki/Public_domain No mercy should shown pay up basically access to all human knowledge 99.99%........

edited April 2024

cesar battistini maziero · April 24, 2024 8:04PM

That is exactly why. siri. has always been behind, and many other online services too.

Apple does things right, do you think microsoft and Google asked before training their AI on all the info they have?

Youtube, google news, gmail, google photos etc.

edited April 2024

bala1234 · April 24, 2024 8:23PM

Frankly I am unconvinced by the current day AI applications. May be Apple will do its usual thing give it a new purpose. Many of the examples quoted above are good ideas. And hopefully the applications apple choose don't need the general purpose/copyrighted data. Siri is not a great example but hopefully Apple learnt from it.

9secondkox2 · April 25, 2024 12:28AM

I’m shocked the current AI jive has been allowed to go on. It’s been such a rip off of actual creators’ hard work. So far, “training” = stealing.

edited April 2024

radarthekat · April 25, 2024 5:23AM

eriamjh said:

I don’t buy the whole concept of “legally and ethically” trained.

I learned a lot of things from copyrighted books and movies, etc. There no such thing as protected concepts, thoughts, ideas, or words. Even IP is only protected from being “used” illegally and learning and understanding it isn’t “use”.

I read a book. I paid for it. Maybe that’s the issue. Patents are protected from being used, not being read or understood.

If it’s in the internet and not behind a paywall, it’s fair use to learn from.

If I write a book report, is that infringement? If I am inspired by a work of art and I paint something, is that infringement? No.

Similar to is not the same as “copied”. Authors and artists are just upset a computer does it and there’s no carve out in law for that… yet.

I think there’s a difference between generative AI and what your brain does. Generative AI merely draws from a large data model by predicting the next word of each sentence it builds based on the model. It’s not going off and dreaming on its own like your brain allows you to do. What you produce might be influenced by what you read, but not generated by literally drawing each and every word from those works. You’re adding your unique perspective and you’re unique experiences from your life and memories which are not drawn from any data store other than the one your brain generated over your entire unique lifetime.

edited April 2024

jellyapple · April 25, 2024 9:44AM

Legal & ethical? It’s so alarming. A “legal” LLM model from CCP washes1.4 billion brains; a “balance” LGBTQ-ethics would reckon a tran-woman a woman; and whose god is “they” ( because you cannot use he or she anymore.) “What is a man and what is a woman” becomes an Apple-controlled answer. The best and only way is to give iPhone user to choose and install.

Besides “upgrade” means replacement, LLM is being manipulated from Large Language Model to “Language Learning Model” in Apple’s terminology. Can we trust Apple?

edited April 2024

Apple's generative AI may be the only one that was trained legally & ethically

Generative AI and copyright infringement lawsuits

AI and publishers' concerns about reproduction of copyrighted content

Apple's unique approach to generative AI, language learning models and copyright issues

Comments