Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

MacRumors

macrumors bot
Original poster
Apr 12, 2001
67,493
37,781


Apple on Thursday addressed concerns about its use of AI training data, following an investigation that revealed Apple, along with other major tech companies, had used YouTube subtitles to train their artificial intelligence models.

Apple-Intelligence-General-Feature.jpg

The investigation by Wired earlier this week reported that over 170,000 videos from popular content creators were part of a dataset used to train AI models. Apple specifically used this dataset in the development of its open-source OpenELM models, which were made public in April.

However, Apple has now confirmed to 9to5Mac that OpenELM does not power any of its AI or machine learning features, including the company's Apple Intelligence system. Apple clarified that OpenELM was created solely for research purposes, with the aim of advancing open-source large language model development.

On releasing OpenELM on the Hugging Face Hub, a community for sharing AI code, Apple researchers described it as a "state-of-the-art open language model" that had been designed to "empower and enrich the open research community." The model is also available through Apple's Machine Learning Research website. Apple has stated that it has no plans to develop new versions of the OpenELM model.

The company emphasized that since OpenELM is not integrated into Apple Intelligence, the "YouTube Subtitles" dataset is not being used to power any of its commercial AI features. Apple reiterated its previous statement that Apple Intelligence models are trained on "licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler."

The Wired report detailed how companies including Apple, Anthropic, and NVIDIA had used the "YouTube Subtitles" dataset for AI model training. This dataset is part of a larger collection known as "The Pile," which is compiled by the non-profit organization EleutherAI.

Article Link: Apple Intelligence Not Trained on YouTube Content, Says Apple
 
How do we truly know what they have been trained on?

Like a person, it could have been exposed to anything out in the wild and we don’t walk around with a list of references. But we treat this software differently to people… you wouldn’t let anyone off the street on your iPhone or laptop… similar goes for AI.
 
Thank god for that. Training on YouTube videos from popular content creators would render Apple Intelligence pretty unintelligent.
Most of the subtitles come from “educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI.”

It also includes subtitles from some of the most popular channels on YouTube: MrBeast, PewDiePie, etc.

Having some “normal” people’s talking is a good source to learn from. The goal is to have more human like output. That cannot be done by ignoring how people talk and write, even if everything they say or write isn’t correct. If you have too much garbage in, you'll get garbage out, but these large models only include a small portion of "garbage". The scientists and engineers developing them don't want them to be worthless or to just output garbage so they are not going to include too much garbage.
 
Last edited:
  • Like
Reactions: Scotticus
Like a person, it could have been exposed to anything out in the wild and we don’t walk around with a list of references. But we treat this software differently to people… you wouldn’t let anyone off the street on your iPhone or laptop… similar goes for AI.
I think you're humanizing the AI too much. It's not a person searching knowledge "in the wild". It is a large file that has been created by a training algorithm which is given a lot of crawled data as the input. It doesn't learn anything outside of what its creators are passing along. And crucially, once training is complete, it's no longer acquiring knowledge. (Every interaction you have with it starts with a blank slate or explicit "context" given from your previous sessions/personal data.)

So the model's creators know absolutely what has been used to train it. They're generally just cagey about it, because they don't want to be sued once they admit whose copyrighted content they've used.
 
Maybe that's for the best... imagine if Apple Intelligence had been trained on, say, the subtitles of Mr Beast videos. :p

"Today, we're going to compare large language models... from LLMs that cost 1 dollar to train, to LLMs that cost ten BILLION dollars to train!"
 
Not everything available to a web crawler can be used for AI training purposes. I doubt Apple reads the license agreements on any of the websites it crawls.
 
For a moment, I considered what an ai trained on YouTube’s comment section would be like.

I am censoring myself of every colorful description of it I can think up.
 
I think you're humanizing the AI too much. It's not a person searching knowledge "in the wild". It is a large file that has been created by a training algorithm which is given a lot of crawled data as the input. It doesn't learn anything outside of what its creators are passing along. And crucially, once training is complete, it's no longer acquiring knowledge. (Every interaction you have with it starts with a blank slate or explicit "context" given from your previous sessions/personal data.)

So the model's creators know absolutely what has been used to train it. They're generally just cagey about it, because they don't want to be sued once they admit whose copyrighted content they've used.
That’s precisely the human aspect I’m referring to… AI is made by humans… it’s not the AI that we should be concerned about but the human oversight from creation, training and deployment and restrictions.

When we use a regular piece of software… it’s easy to forget that a human, in most cases, created it directly. It’s just performing operations that have been created and defined by its creator.
 
  • Like
Reactions: oldwatery
Such a lie.

It is well known that ALL tech companies are training their own AI from internet without permissions
. If not, how do they even find humongous amount of datas? It's too obvious. That's how AI works since it needs tons of data in order to study and improve.

Since AI generated art/image had been restricted due to the copyright, it's just a matter of time before the government restrict AI based technology as well. They cant hide and ignore it forever.
 
How do they have access to train even their open source model on youTube subtitles? Aren't those copyright of the people who made the video? And even if those people signed away some rights in the YouTube T&Cs, wouldn't that mean Google control them? I assume Apple wouldn't be paying Google to access that data just to use it in their open source models.

I've always thoughts Apple should set up their own video sharing platform. Not to act as a serious competitor to YouTube, but just to give people the option of hosting videos somewhere that provides a better UX, especially on iPad.
 
  • Wow
Reactions: gusmula
Such a lie.

It is well known that ALL tech companies are training their own AI from internet without permissions
. If not, how do they even find humongous amount of datas? It's too obvious. That's how AI works since it needs tons of data in order to study and improve.

Since AI generated art/image had been restricted due to the copyright, it's just a matter of time before the government restrict AI based technology as well. They cant hide and ignore it forever.

I guess if you educated yourself, you’d know that Apple Intelligence is mainly centered around you and your devices, not “worldly” data as they’ve stated and the reason why they’ve decided to add 3rd party LLM services as optional services.

The web is considered ”public” and reading from it, i.e. gaining knowledge is the entire point of it. When you read articles from this site and learn something, did you pay MacRumors for the knowledge? Do you keep that knowledge to yourself or do you tell other people about it? Do you give them the link so MacRumors get can the ad revenue?

Apple as with EVERY search engine on the planet uses web crawlers to “read” and index websites and the data from them… if that site owner does not want their data mined, there’s a standard method (robots.txt) every web admin should know about to restrict crawlers.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.