Supplier used controversial sources for training Apple Intelligence

AppleInsider · July 16, 2024 3:14PM

Apple has made a big deal out of paying for the data used to train its Apple Intelligence, but one firm it used is accused of allegedly ripping off YouTube videos.

Smartphone displaying a colorful, glowing sphere on its screen against a geometric purple background. The device's side buttons are visible.

Apple Intelligence may have been trained less legally and ethically than Apple believed

All generative AI works by amassing enormous datasets called Large Language Models (LLMs), and very often, the source of that data is controversial. So much so that Apple has repeatedly claimed that its sources are ethical, and it's known to have paid millions to publishers, and licensed images from photo library firms.

According to Wired, however, one firm whose data Apple has used, appears to have been less scrupulous about its sources. EleutherAI reportedly created a dataset it calls the Pile, which Apple has reported using for its LLM training.

Part of the Pile, though, is called YouTube Subtitles, which consist of subtitles downloaded from YouTube videos without permission. It's apparently also a breach of YouTube terms and conditions, but that may be a more gray area than it should be.

Alongside Apple, firms who have used the Pile include Anthropic, whose spokesperson claimed that there is a difference between using YouTube subtitles and using the videos.

"The Pile includes a very small subset of YouTube subtitles," said Jennifer Martinez. "YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset."

"On the point about potential violations of YouTube's terms of service," she continued, "we'd have to refer you to the Pile authors."

Salesforce also confirmed that it had used the Pile in its building of an AI model for "academic and research purposes." Salesforce's vice president of AI research stressed that the Pile's dataset is "publicly available."

Reportedly, developers at Salesforce also found that the Pile dataset includes profanity, plus "biases against gender and certain religious groups."

Salesforce and Anthropic are so far the only firms that have commented on their use of the Pile. Apple, Nvidia, Bloomberg, and Databricks are known to have used it, but they have not responded.

Apple Intelligence is Apple's version of AI

The organization Proof News claims to have found that subtitles from 173,536 YouTube videos from over 48,000 channels were used in the Pile. The videos used include seven by Marques Brownlee (MKBHD) and 337 from PewDiePie.

Proof News has produced an online tool to help YouTubers see whether their work has been used.

However, it's not only YouTube subtitles that have been gathered without permission. It's claimed that Wikipedia has been used, as has documentation from the European Parliament.

Academics and even mathematicians have previously used thousands of Enron staff emails for statistical analysis. Now, it's claimed that the Pile used the text of those emails for its training.

It's previously been argued that Apple's generative AI might be the sole one that was trained legally and ethically. But despite Apple's intentions, Apple Intelligence has seemingly been trained on YouTube subtitles it had no right to.

Read on AppleInsider

daalseth · July 16, 2024 3:33PM

In March apple added a feature to PodCasts, automatic Transcriptions. The podcaster doesn’t even have to request it, it’s just done automatically. Of course by doing this Apple was training its AI. Plus the transcripts are open for anyone to copy and paste out so they can be ripped off by anyone else as well.

Don’t talk to me about how Apple’s AI systems are ‘legal and ethical’.

edited July 2024

tht · July 16, 2024 4:26PM

As long as there aren’t any changes to copyright laws that make it illegal for companies to ingest content creators data for the purpose of LLM training, this will just happen over and over.

Basically like it is illegal for anyone to copy a video and redistribute it. I imagine all they have to do is to include words like “transform it”. In the copyright statement.

The AI companies should want it. In the future, it is going to be an Ouroboros of shit, where LLMs are just feeding their output to other LLMs, and it is going to spiral into shit.

No confidence that any government can move at speed to make laws for this or even do the right thing with this.

carisma · July 17, 2024 6:04PM

All we do when we go to school is digest material someone else produced, who probably 'ripped of' information from someone else and so on.

Artschool teaches people everything about existing art so that they can come up, hopefully, with something 'original'.
Isn't education, be it for humans or AI, absorbing existing information to learn and in some cases produce original ideas.
Why do we discriminate AI-entities for doing the same. /s

Supplier used controversial sources for training Apple Intelligence

Comments