Exactly. People forget about the dirty little secrets that get this stuff started. Cryptocurrency had its value inflated through ransomware payments. Without ransomware bitcoin would be worthless.
ChatGPT plagiarized for free the work of millions of people and trillions of hours. But it can only do that once. Now the AI companies want to pull the ladder up and not let anyone get high quality training data.
So yes you’re exactly right. If we just let AI generate everything it will quickly degrade and our culture will stagnate even more than it has the last fifteen years.
Yeah, this is a good point - ‘competent’ AI/ML depends quite strongly on quality of data being ingested. The ‘wow, look at this’ flood of OpenAI, ChatGPT and the like were all AFAIK done on ‘public internet data’ which includes ofr example generative like MidJourney and the like as well as textual data.
Even before the LLM ‘explosion’ it was already problematic to get reasonably good data for specific use cases - you need a large pool of data for speech for text to accommodate different accents and pronunciation, you need large amounts of data for various automotive or smart city type of applications, etc. or your baseline ‘general’ AI/ML model/application will perform fairly poorly, moreso as environmentals (e.g. training a model for vehicle model recognition in Central America then deploying it in China) change.
Places like Kaggle have a random collection of ‘total trash to moderately useful for a proof of concept’ but rarely something qualitatively complete or thorough enough to make something ‘v1-like’/beyond proof of concept. Synthetic data has been used and is still in use, although in general it’s not all that fast (fun fact that things like Unity and others are in use for synthetic data for AI/ML use).
Two inherent rules apply:
1. The best data will always be specific to the use case at hand (location, resolution, camera perspective, sensor data types and positioning, spoke enunciation and local colloquialisms, etc.), and the wider an audience attempting to serve, the more covering <all of these variations> is needed. ’Best data’ is ideally provided directly from the ‘customer’ of the use cases. When we get to LLMs, well, they’re trying to do an awful lot in most people’s heads, but even handling input in N languages is quite a bit needed for effective training.
2. Even if the cases where ‘very good’ data is used for training, variables will change over time. Visually, for example, vision models trained for car model recognition may have leveraged a ‘mostly sunny’ environment for training data - guess what? Weather changes and will impact things. Likewise if camera positioning changes, or zoom levels/perspective, etc. There are a LOT of variables (and data) needed in general to go from ‘good for a proof of concept’ to ‘very good in many permutations.’ You could also consider this as part of ‘fine tuning over time’ where <some amount of data> going back to <whomever manages the core model/app(s)> is almost required, e.g. some data making it back to Apple, all data to Amazon, Google, etc..
There are clever mechanisms and evolving model architectures that do reduce the amount of data needed for some purposes, but the need for more and more data for training and fine-tuning never goes away completely.
There are companies out there trying to service the above via synthetic data, and things like videos are a growing part of places like shutterstock and the like, all in the name of ‘monetizing their data,’ much of which is for AI/ML training purposes.
Midjourney and the like put a real focus on it, as obviously various types of ‘styles’ become apparent, as well as ‘how does it even know who trademarked properties (Batman for example, or any movie/game/etc) are, and if the data ingestion was ‘pull in everything possible’ or followed fair use practices. It takes another fun step forward, this one IMO completely properly, where various actors likenesses or voices are replicated by AI for ’for profit use’ (such as in advertising) without paying the
source of that data (the actor), and same arguments are being made for the generative visual models where you can, for example, give a prompt ‘in the style of Van Gogh’ or even when not, various artists styles were used in the training.
All of the above is simply on the start of the ‘full mess’ of AI/ML, the data needed just to train a given type of model, and in general, the need for ongoing data <from somewhere> in order to either improve it’s general outcomes/inferences or to make them baseline reasonable from the start. All before we get to ‘what can it
do from there?’ It’s a fun field, but separating the hype from reality can get complicated quickly.