Apple Intelligence Not Trained on YouTube Content, Says Apple

MacRumors · Jul 18, 2024

Apple on Thursday addressed concerns about its use of AI training data, following an investigation that revealed Apple, along with other major tech companies, had used YouTube subtitles to train their artificial intelligence models.

The investigation by Wired earlier this week reported that over 170,000 videos from popular content creators were part of a dataset used to train AI models. Apple specifically used this dataset in the development of its open-source OpenELM models, which were made public in April.

However, Apple has now confirmed to 9to5Mac that OpenELM does not power any of its AI or machine learning features, including the company's Apple Intelligence system. Apple clarified that OpenELM was created solely for research purposes, with the aim of advancing open-source large language model development.

On releasing OpenELM on the Hugging Face Hub, a community for sharing AI code, Apple researchers described it as a "state-of-the-art open language model" that had been designed to "empower and enrich the open research community." The model is also available through Apple's Machine Learning Research website. Apple has stated that it has no plans to develop new versions of the OpenELM model.

The company emphasized that since OpenELM is not integrated into Apple Intelligence, the "YouTube Subtitles" dataset is not being used to power any of its commercial AI features. Apple reiterated its previous statement that Apple Intelligence models are trained on "licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler."

The Wired report detailed how companies including Apple, Anthropic, and NVIDIA had used the "YouTube Subtitles" dataset for AI model training. This dataset is part of a larger collection known as "The Pile," which is compiled by the non-profit organization EleutherAI.

Article Link: Apple Intelligence Not Trained on YouTube Content, Says Apple

sniffies · Jul 18, 2024

Thank god for that. Training on YouTube videos from popular content creators would render Apple Intelligence pretty unintelligent.

peneaux · Jul 18, 2024

sniffies said:
Thank god for that. Training on YouTube videos from popular content creators would render Apple Intelligence very unintelligent.

Unintelligent is a very polite way of saying garbage.

Fuzzball84 · Jul 18, 2024

How do we truly know what they have been trained on?

Like a person, it could have been exposed to anything out in the wild and we don’t walk around with a list of references. But we treat this software differently to people… you wouldn’t let anyone off the street on your iPhone or laptop… similar goes for AI.

Havalo · Jul 18, 2024

Never believe anything until it’s been officially denied - Sir Humphrey (Yes, Minister)

neuropsychguy · Jul 18, 2024

sniffies said:
Thank god for that. Training on YouTube videos from popular content creators would render Apple Intelligence pretty unintelligent.

Most of the subtitles come from “educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI.”

It also includes subtitles from some of the most popular channels on YouTube: MrBeast, PewDiePie, etc.

Having some “normal” people’s talking is a good source to learn from. The goal is to have more human like output. That cannot be done by ignoring how people talk and write, even if everything they say or write isn’t correct. If you have too much garbage in, you'll get garbage out, but these large models only include a small portion of "garbage". The scientists and engineers developing them don't want them to be worthless or to just output garbage so they are not going to include too much garbage.

antiprotest · Jul 18, 2024

I believe Apple on this, because from all that we have heard this thing is going to be so delayed that at this point it hasn't been trained on ANY content.

foobarbaz · Jul 18, 2024

Fuzzball84 said:
Like a person, it could have been exposed to anything out in the wild and we don’t walk around with a list of references. But we treat this software differently to people… you wouldn’t let anyone off the street on your iPhone or laptop… similar goes for AI.

I think you're humanizing the AI too much. It's not a person searching knowledge "in the wild". It is a large file that has been created by a training algorithm which is given a lot of crawled data as the input. It doesn't learn anything outside of what its creators are passing along. And crucially, once training is complete, it's no longer acquiring knowledge. (Every interaction you have with it starts with a blank slate or explicit "context" given from your previous sessions/personal data.)

So the model's creators know absolutely what has been used to train it. They're generally just cagey about it, because they don't want to be sued once they admit whose copyrighted content they've used.

Mac Fly (film) · Jul 18, 2024

Havalo said:
Never believe anything until it’s been officially denied - Sir Humphrey (Yes, Minister)

Apple would never listen to our Siri requests to see how accurate they were. Obviously that story back when was not true. Whoops!

BigBrotherNowWatching · Jul 18, 2024

To me one conclusion is getting more into focus:

A.I. for the rest of us has to be really good.

And private.

coffeemilktea · Jul 18, 2024

Maybe that's for the best... imagine if Apple Intelligence had been trained on, say, the subtitles of Mr Beast videos.

"Today, we're going to compare large language models... from LLMs that cost 1 dollar to train, to LLMs that cost ten BILLION dollars to train!"

ifxf · Jul 18, 2024

Not everything available to a web crawler can be used for AI training purposes. I doubt Apple reads the license agreements on any of the websites it crawls.

klasma · Jul 18, 2024

...instead they are training on TikTok content now. YouTube content had too much Apple bashing.

ThunderSkunk · Jul 18, 2024

For a moment, I considered what an ai trained on YouTube’s comment section would be like.

I am censoring myself of every colorful description of it I can think up.

jaw04005 · Jul 18, 2024

So not Wired, Tired then?

Fuzzball84 · Jul 18, 2024

foobarbaz said:
I think you're humanizing the AI too much. It's not a person searching knowledge "in the wild". It is a large file that has been created by a training algorithm which is given a lot of crawled data as the input. It doesn't learn anything outside of what its creators are passing along. And crucially, once training is complete, it's no longer acquiring knowledge. (Every interaction you have with it starts with a blank slate or explicit "context" given from your previous sessions/personal data.)

So the model's creators know absolutely what has been used to train it. They're generally just cagey about it, because they don't want to be sued once they admit whose copyrighted content they've used.

That’s precisely the human aspect I’m referring to… AI is made by humans… it’s not the AI that we should be concerned about but the human oversight from creation, training and deployment and restrictions.

When we use a regular piece of software… it’s easy to forget that a human, in most cases, created it directly. It’s just performing operations that have been created and defined by its creator.

Fuzzball84 · Jul 18, 2024

antiprotest said:
I believe Apple on this, because from all that we have heard this thing is going to be so delayed that at this point it hasn't been trained on ANY content.

They could just train it on Craig Federighi and his hair, which most people claim to be a separate organism.

bozzykid · Jul 18, 2024

I mean Apple's AI doesn't even exist yet, so of course they are going to frame it like they aren't using OpenELM. They likely were going to use it until this report came out.

svish · Jul 18, 2024

Good to know about the response from Apple.

sunny5 · Jul 18, 2024

Such a lie.

It is well known that ALL tech companies are training their own AI from internet without permissions. If not, how do they even find humongous amount of datas? It's too obvious. That's how AI works since it needs tons of data in order to study and improve.

Since AI generated art/image had been restricted due to the copyright, it's just a matter of time before the government restrict AI based technology as well. They cant hide and ignore it forever.

mazz0 · Jul 18, 2024

How do they have access to train even their open source model on youTube subtitles? Aren't those copyright of the people who made the video? And even if those people signed away some rights in the YouTube T&Cs, wouldn't that mean Google control them? I assume Apple wouldn't be paying Google to access that data just to use it in their open source models.

I've always thoughts Apple should set up their own video sharing platform. Not to act as a serious competitor to YouTube, but just to give people the option of hosting videos somewhere that provides a better UX, especially on iPad.

obviouslogic · Jul 18, 2024

sunny5 said:
Such a lie.

It is well known that ALL tech companies are training their own AI from internet without permissions. If not, how do they even find humongous amount of datas? It's too obvious. That's how AI works since it needs tons of data in order to study and improve.

Since AI generated art/image had been restricted due to the copyright, it's just a matter of time before the government restrict AI based technology as well. They cant hide and ignore it forever.

I guess if you educated yourself, you’d know that Apple Intelligence is mainly centered around you and your devices, not “worldly” data as they’ve stated and the reason why they’ve decided to add 3rd party LLM services as optional services.

The web is considered ”public” and reading from it, i.e. gaining knowledge is the entire point of it. When you read articles from this site and learn something, did you pay MacRumors for the knowledge? Do you keep that knowledge to yourself or do you tell other people about it? Do you give them the link so MacRumors get can the ad revenue?

Apple as with EVERY search engine on the planet uses web crawlers to “read” and index websites and the data from them… if that site owner does not want their data mined, there’s a standard method (robots.txt) every web admin should know about to restrict crawlers.

Neberheim · Jul 18, 2024

MacRumors said:
This dataset is part of a larger collection known as "The Pile," …

If it uses YouTube captions, we know what that would make it a Pile of…
💩

Premium1 · Jul 18, 2024

Apple "Trust us bro."

iMac The Knife · Jul 18, 2024

Siri is not trained on intelligence, so there's that.

Apple Intelligence Not Trained on YouTube Content, Says Apple

macrumors bot

macrumors 603

Suspended

macrumors 68040

macrumors regular

macrumors 68040

macrumors 601

macrumors 65816

macrumors 68040

Suspended

macrumors 68000

macrumors 6502a

macrumors G3

macrumors 601

macrumors 601

macrumors 68040

macrumors 68040

macrumors 68030

macrumors G5

Suspended

macrumors 68040

macrumors 6502

macrumors member

Suspended

macrumors 6502a

Our Staff