Not everything available to a web crawler can be used for AI training purposes. I doubt Apple reads the license agreements on any of the websites it crawls.
It's very clear that you dont know how AI works. Without training AI, it wont work even if it's only for your devices. How do they even get mass datas to train with?I guess if you educated yourself, you’d know that Apple Intelligence is mainly centered around you and your devices, not “worldly” data as they’ve stated and the reason why they’ve decided to add 3rd party LLM services as optional services.
The web is considered ”public” and reading from it, i.e. gaining knowledge is the entire point of it. When you read articles from this site and learn something, did you pay MacRumors for the knowledge? Do you keep that knowledge to yourself or do you tell other people about it? Do you give them the link so MacRumors get can the ad revenue?
Apple as with EVERY search engine on the planet uses web crawlers to “read” and index websites and the data from them… if that site owner does not want their data mined, there’s a standard method (robots.txt) every web admin should know about to restrict crawlers.
"The Wired report detailed how companies including Apple, Anthropic, and NVIDIA had used the "YouTube Subtitles" dataset for AI model training. This dataset is part of a larger collection known as "The Pile," which is compiled by the non-profit organization EleutherAI."How do they have access to train even their open source model on youTube subtitles? Aren't those copyright of the people who made the video? And even if those people signed away some rights in the YouTube T&Cs, wouldn't that mean Google control them? I assume Apple wouldn't be paying Google to access that data just to use it in their open source models.
I've always thoughts Apple should set up their own video sharing platform. Not to act as a serious competitor to YouTube, but just to give people the option of hosting videos somewhere that provides a better UX, especially on iPad.
It's very clear that you don't even know how AI works. "AI" is used incorrectly to refer to Machine Learning (Apple is the only company that still talks about Machine Learning), there are also other "AI" technologies, have you ever heard of deep learning, diffusion models, supervised learning, weak AI, etc (I'm mixing types with technologies).It's very clear that you dont know how AI works. Without training AI, it wont work even if it's only for your devices. How do they even get mass datas to train with?
The bigger creators generally generate their own subtitles (or have a company do it for them) and don't rely on the auto generated ones since they are usually garbage.Isn't YouTube subtitles AI generated? I have to turn CC on because of the gawd awful muzak🤮 these content creators keep piping in the background.😒 It feels like I have aphasia trying to understand the YouTube captioning.😑
Fully agree.Thank god for that. Training on YouTube videos from popular content creators would render Apple Intelligence pretty unintelligent.
That doesn’t really answer the question though - how does EleutherAI get the rights to redistribute those works? Presumably either the creators or Google own those rights, and I can’t imagine either of them are happy to give it all away for free."The Wired report detailed how companies including Apple, Anthropic, and NVIDIA had used the "YouTube Subtitles" dataset for AI model training. This dataset is part of a larger collection known as "The Pile," which is compiled by the non-profit organization EleutherAI."
Apple and others don't access the data directly, they just get the data from a "processor/compiler".
What you mentioned is very clear about how AI works: without data, no AI. You just proven my point after all.It's very clear that you don't even know how AI works. "AI" is used incorrectly to refer to Machine Learning (Apple is the only company that still talks about Machine Learning), there are also other "AI" technologies, have you ever heard of deep learning, diffusion models, supervised learning, weak AI, etc (I'm mixing types with technologies).
So, if you don't know that, please do not state that someone else doesn't know. Just learn that AI is more than a single thing.
I think you're humanizing the AI too much. It's not a person searching knowledge "in the wild". It is a large file that has been created by a training algorithm which is given a lot of crawled data as the input. It doesn't learn anything outside of what its creators are passing along. And crucially, once training is complete, it's no longer acquiring knowledge. (Every interaction you have with it starts with a blank slate or explicit "context" given from your previous sessions/personal data.)
So the model's creators know absolutely what has been used to train it. They're generally just cagey about it, because they don't want to be sued once they admit whose copyrighted content they've used.
Yep. The only way they don't know where they got their training data from is they told their engineers go ahead and trawl whatever you think is useful but don't keep logs cause we don't wanna get sued. We'll pretend the data fell off a truck!I think you're humanizing the AI too much. It's not a person searching knowledge "in the wild". It is a large file that has been created by a training algorithm which is given a lot of crawled data as the input. It doesn't learn anything outside of what its creators are passing along. And crucially, once training is complete, it's no longer acquiring knowledge. (Every interaction you have with it starts with a blank slate or explicit "context" given from your previous sessions/personal data.)
So the model's creators know absolutely what has been used to train it. They're generally just cagey about it, because they don't want to be sued once they admit whose copyrighted content they've used.