MR Forums used in Google dataset for training AI’s

KaliYoni · Apr 19, 2023

According to this Washington Post analysis, the forums on MacRumors are part of a Google-created dataset that is used to train AI products:

”…we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA”
(forums.macrumors.com is listed under the sources for Technology, as the #4 site)

Inside the secret list of websites that make AI like ChatGPT sound smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

www.washingtonpost.com

So, I’d say anybody here who is highly concerned about privacy or does not want their future posts used to train AI’s should review how they use MacRumors’ forums.

Nermal · Apr 19, 2023

I see that Reddit wants to be paid for "excessive" use of its content (via API).

My personal opinion is that there should be pushback against this sort of thing; for example my own posts were made with the expectation that they would be readable by the public, but not with the expectation that they would be copied wholesale into another system.

Analog Kid · Apr 19, 2023

If a Google AI wasn’t already biased against Apple, it will be now…

Apple_Robert · Apr 19, 2023

Nermal said:
I see that Reddit wants to be paid for "excessive" use of its content (via API).

My personal opinion is that there should be pushback against this sort of thing; for example my own posts were made with the expectation that they would be readable by the public, but not with the expectation that they would be copied wholesale into another system.

Seeing how we are on a public forum on the internet, there can be no expectation of privacy.

JayAgostino · Apr 20, 2023

Analog Kid said:
If a Google AI wasn’t already biased against Apple, it will be now…

Also, expect it to act extremely cranky… 😉

Scepticalscribe · Apr 20, 2023

Apple_Robert said:
Seeing how we are on a public forum on the internet, there can be no expectation of privacy.

Perhaps.

Although I would dispute the "no expectation" - as this depends on circumstances - and I would imagine that this may well change in the future.

And I would argue that there can be an expectation that while one's own posts can be readable on this forum - which is what we did indeed sign up for - there is no expectation (let alone permission) - as @Nermal has already pointed out - that they would - or could - be copied wholesale into another system.

Nermal · Apr 20, 2023

Scepticalscribe said:
And I would argue that there can be an expectation that while one's own posts can be readable on this forum - which is what we did indeed sign up for - there is no expectation (let alone permission) - as @Nermal has already pointed out - that they would - or could - be copied wholesale into another system.

Indeed, and my comment was more from a copyright perspective than a privacy one.

KaliYoni · Apr 20, 2023

Now that I've thought about this more, my main concerns are:

A third party (parties?) has archived our posts and is using our thoughts and words to generate derivative content without any ability to opt-in or opt-out.
Did MacRumors know about this? Did Google ask permission?
How else are our posts being used by organizations not connected to MacRumors without our knowledge? Not all AI's are used to simulate human writing like Google Bard or Microsoft ChatGPT. For example, what if a hacker group is training an identity theft AI? Or a state-sponsored intelligence agency is building an AI whose purpose is to find and track citizens who are living abroad?

I've always assumed that anything we post here would be crawled by search engines and made available in searches. But for me, having the entire MacRumors forums corpus hoovered up, archived, and repeatedly used to train black-box AI processes goes way beyond that.

Scepticalscribe · Apr 20, 2023

KaliYoni said:
Now that I've thought about this more, my main concerns are:

A third party (parties?) has archived our posts and is using our thoughts and words to generate derivative content without any ability to opt-in or opt-out.

Did MacRumors know about this? Did Google ask permission?

How else are our posts being used by organizations not connected to MacRumors without our knowledge? Not all AI's are used to simulate human writing like Google Bard or Microsoft ChatGPT. For example, what if a hacker group is training an identity theft AI? Or a state-sponsored intelligence agency is building an AI whose purpose is to find and track citizens who are living abroad?

I've always assumed that anything we post here would be crawled by search engines and made available in searches. But for me, having the entire MacRumors forums corpus hoovered up, archived, and repeatedly used to train black-box AI processes goes way beyond that.

Thank you for starting this thread and for raising this topic.

This subject is something I hadn't at all been aware of, let alone given any thought to - and, following your thread, I realised that the Guardian are also covering this story.

Your questions are very good ones, timely and apt and necessary.

As with so much else in the tech revolution, the extraordinary advances mean that our attempts to deal with their consequences and effects mean that we are always lagging several steps behind.

But that is no reason not to try to address this.

KaliYoni · Apr 20, 2023

Scepticalscribe said:
Thank you for starting this thread and for raising this topic.

This subject is something I hadn't at all been aware of, let alone given any thought to - and, following your thread, I realised that the Guardian are also covering this story.

Your questions are very good ones, timely and apt and necessary.

As with so much else in the tech revolution, the extraordinary advances mean that our attempts to deal with their consequences and effects mean that we are always lagging several steps behind.

But that is no reason no to try to address this.

I‘m glad others here–particularly somebody who takes the time to write clearly and thoughtfully–are thinking about this issue too. In many ways, what’s happening with AI technology right now seems like the period when social media companies first were able to integrate all the necessary components that enabled them to become financial, social, and political juggernauts. And of course, as has been the case with humans throughout history, it looks like a lot of the lessons learned earlier are being ignored now.

Is this the Guardian article you saw?

Fresh concerns raised over sources of training material for AI systems

Investigations reveal limited efforts to ‘clean’ datasets of fascist, pirated and malicious material

www.theguardian.com

unrigestered · Apr 20, 2023

i'm glad to help making Google AI a useless mess! 👍

Pakaku · Apr 21, 2023

If you thought your posts were safe on a public forum before the AI takeover, well, sorry for the harsh awakening

unrigestered · Apr 21, 2023

especially on a website that has like... 50 trackers or something! 😂

laptech · Apr 21, 2023

There is a misconception that the internet is a 'public place'. It is not regardless what any one or any expert says. The internet is made up of a lot of private companies, private businesses and private individuals who 'allow' others to see their work for free. What they provide still belongs to them and thus permission must be granted if others want to use it. You cannot just come in, hoover up what you want and then claim 'it's public therefore I can do as I wish'.

These private companies, private business and private individuals make their stuff freely available to the public but it does not mean the internet is public domain. Google is wrong in doing what it is doing and thus website owners should complain to Google that they do not have the right to take stuff from their websites.

maflynn · Apr 21, 2023

KaliYoni said:
So, I’d say anybody here who is highly concerned about privacy or does not want their future posts used to train AI’s should review how they use MacRumors’ forums.

how is this any different then what google does/track and monetize now?

laptech said:
There is a misconception that the internet is a 'public place'. It is not regardless what any one or any expert says. The internet is made up of a lot of private companies, private businesses and private individuals who 'allow' others to see their work for free.

So what your saying is the information from private companies is made public and free, i.e., the internet is a public place

laptech said:
Google is wrong in doing what it is doing

So then, Google and every other search engine should shut down, and sites like the Internet Archive should also be shut down (https://archive.org/web/).

adrianlondon · Apr 21, 2023

I wondered why ChatGPT kept telling me to drink an Ethiopian espresso each time I asked it a question.

laptech · Apr 21, 2023

Because MR is heavily biased towards everything Apple does this mean every time someone asks the AI an Apple related question it is always going to respond with a favorable response towards Apple?

Be such a good way for Apple to get the AI loaded in it's favor.

timber · Apr 21, 2023

That explains a lot about Google's AI intelligence.

I take my full responsibility on my part 😀

laptech · Apr 21, 2023

I wonder if this will turn out to be another Cambridge Analytica type scandal where freely available information is harvested for the financial gains of others but without the consent of those who provide the freely available information. Just because information is freely available does not make it publicly free and thus for others to do as they wish.

Abazigal · Apr 21, 2023

Is it possible for a website to opt out of being scanned in this manner? But then again, it’s difficult to request to not be a part of something you were never aware of in the first place.

maflynn · Apr 21, 2023

Abazigal said:
Is it possible for a website to opt out of being scanned in this manner? But then again, it’s difficult to request to not be a part of something you were never aware of in the first place.

When you sign up to use google, you sign away the rights to your data and as such they are free to use it as they wish. I don't know if that's the same when you use google's adsense, and interact with google search but I suspect that Google has a lot of really good lawyers to ensure that if you do business with them, then they gain access to your website.

Abazigal · Apr 21, 2023

maflynn said:
When you sign up to use google, you sign away the rights to your data and as such they are free to use it as they wish. I don't know if that's the same when you use google's adsense, and interact with google search but I suspect that Google has a lot of really good lawyers to ensure that if you do business with them, then they gain access to your website.

This makes me wonder if it is possible for a web service provider to make it so that content hosted on their servers can’t be scanned for the purpose or improving LLMs. Something like ATT, but for online content. Either nothing of value is obtained or the data extracted is rubbish.

This thought was in part inspired by another thread which mentioned how far behind Siri was in comparison to chatGPT, and it made me think if instead of trying to catch up with the competition, Apple might instead find a way to hobble their progress in the name of privacy.

maflynn · Apr 21, 2023

Abazigal said:
This makes me wonder if it is possible for a web service provider to make it so that content hosted on their servers can’t be scanned for the purpose or improving LLMs. Something like ATT, but for online content. Either nothing of value is obtained or the data extracted is rubbish.

I don't believe there is a way to distinguish a crawl/data scrape that will be used for LLMs vs. search engine indexing. You can block Google from crawling your site but that stops them completely, no index, no nothing.

Abazigal · Apr 21, 2023

maflynn said:
I don't believe there is a way to distinguish a crawl/data scrape that will be used for LLMs vs. search engine indexing. You can block Google from crawling your site but that stops them completely, no index, no nothing.

Shucks, so much for that idea then. Thank you for taking the time to respond. 🙂

laptech · Apr 21, 2023

The problem with the internet is that EVERY company or business or individual that has a presence on the internet has the view 'What is mine is mine and what is on the internet is mine'. Why is this so? As soon as you open your web browser, your email, your IP, the name of the web browser you use, which web site you joined into and which website you went to when you left a website, the time you joined and the time you left is all collected by the websites you visit. ALL these entities feel they have an automatic right to our information ad thus they collect it. Even MR collects this type of information and they ALL do it automatically. Some countries have introduced laws to prevent this data collection going on and thus have to provide 'consent' windows when a person goes to a website but not every country does this and with those countries that do not, people need to ask themselves why is my information being collected without my consent.

Google most probably created crawler software that has the purpose of trawling the internet for 'free' content, harvest it and then report back to Google so Google can input it to their AI creation.

MR Forums used in Google dataset for training AI’s

macrumors 68020

Moderator

macrumors G3

Contributor

macrumors 6502

macrumors Haswell

Moderator

macrumors 68020

macrumors Haswell

macrumors 68020

Suspended

macrumors 68040

Suspended

Suspended

macrumors Broadwell

macrumors 604

Suspended

macrumors 68000

Suspended

Contributor

macrumors Broadwell

Contributor

macrumors Broadwell

Contributor

Suspended

Our Staff