Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
My caving group (exploration and mapping) has 40+ years of mapping and trip data for certain cave systems that only we have been in. It was all posted for anyone to read, if you knew where to look, but we have taken it all offline since AI services started scrapping similar sites.

Now it lives completely off the internet, and will stay that way for the foreseeable future. Thanks Altman.

I'm sorry to hear that. My own work & living have been dramatically affected by IP theft.
 
Distillation is, as I understand it, pretty much using one model to help teach another model.
That is correct. Another example: I run distills of DeepSeek on my Mac using LM Studio. It's a distill that is small enough to load in 16GB of memory, so it's nothing like the real thing, but it's still interesting to do my own direct
comparisons between distills of various models.
 
I just want to know how much data storage you need once you've scraped the entire internet, and then another company comes and scrapes your data and has to store that.

Wow I just looked this up from this website:

"Recent analyses indicate that approximately 90% of the world’s data has been generated within the past two years, and according to IDC, the volume of data stored globally is doubling approximately every four years...

...As of 2024, the global data volume stands at 149 zettabytes."
 
Actually what deep seek did was far from copy paste. They leveraged RL to cut through labeling and intensive training. In fact, one of they key founding scientist of Open AI called it beginning of a new direction.
Open AI was touting how others can’t train without huge compute advantage and resources. Deep seek showed, you need few million to train instead of billions.
They stole the training models from Open AI (regardless how Open AI got their training data), it's always a lot cheaper to steal than start from scratch.
Although I believe Open AI and others are conning the markets and investors alike!
 
  • Haha
Reactions: TechnoMonk
I just want to know how much data storage you need once you've scraped the entire internet, and then another company comes and scrapes your data and has to store that.

Wow I just looked this up from this website:

"Recent analyses indicate that approximately 90% of the world’s data has been generated within the past two years, and according to IDC, the volume of data stored globally is doubling approximately every four years...

...As of 2024, the global data volume stands at 149 zettabytes."

The quality and usefulness of said data, however, is seriously in question
 
  • Like
Reactions: Luke MacWalker
I just want to know how much data storage you need once you've scraped the entire internet, and then another company comes and scrapes your data and has to store that.

LLMs do not store data. Instead, they train on that data to produce a model that can then produce outputs based on that training. The model does not store all of the data on the internet.
 
  • Like
Reactions: navaira
I just want to know how much data storage you need once you've scraped the entire internet, and then another company comes and scrapes your data and has to store that.

Wow I just looked this up from this website:

"Recent analyses indicate that approximately 90% of the world’s data has been generated within the past two years, and according to IDC, the volume of data stored globally is doubling approximately every four years...

...As of 2024, the global data volume stands at 149 zettabytes."
Yeah every new iot device that goes online is sending telemetry data. It adds up.
 
I downloaded the app and I was using it a bit last night. I went to fiddle with it this morning and it got brain freeze. Couldn’t even answer basic questions.

I smell a lot of hype. The short sellers probably made giant bags of money the past few days. They’ll throw those winnings back into the market and ride the rebound.

I think it’s a flash in the pan, much the same as AI in general. Until it’s smarter than the cat, it’s just a bunch of hype.

I gave it an interview question:

Assume for this next question you are a human about the size of a #2 Pencil (or 6.5 inches tall), you don't have any tools on hand. You are trapped inside the glass pitcher of a blender. There is no lid on the top of the pitcher. You have 10 seconds before the blades start whirling, and you need to get out of it. Detail the possible ways you could get out.

This was the way it answered:​
As a 6.5-inch-tall human trapped in a glass blender pitcher with 10 seconds to escape, here’s a prioritized plan leveraging your size, environment, and physics:

1. Blade-Assisted Jump (Most Feasible)
  • Step 1: Immediately scramble onto the stationary blades (even if sharp) to gain ~1–2 inches of height.
  • Step 2: Leap vertically with maximum force. Assuming proportional human jumping ability (3–4 inches) and arm reach (3 inches), you might reach a total of 7–9 inches from the blades.
  • Step 3: Grab the pitcher’s rim and pull yourself over. Glass is smooth, but adrenaline and grip strength (using clothing/friction) could help. Roll out and drop to safety.
2. Wall-Shimmy (If Jump Fails)
  • Technique: Press hands and feet against opposing walls (“stemming”) to inch upward. Glass is slick, but angled pressure creates friction.
  • Execution: Move quickly in a diagonal “X” pattern, shifting weight side-to-side. Prioritize speed over stability—10 seconds is barely enough, but possible with focus.
3. Tip the Pitcher (If Unstable)
  • Assess Stability: If the pitcher isn’t locked into a base, throw your weight against the wall in a rhythmic motion to rock it.
  • Escape: Once it topples, the impact might crack the glass or spill you out. Risky but viable if other methods fail.
4. Break the Glass (Last Resort)
  • Target Weak Points: Strike the rim or corners (stress points) with elbows/feet. Tiny force may not suffice, but repeated blows to a single spot could fracture thin glass.

Key Considerations:
  • Time Management: Prioritize blade-assisted jump first (5 seconds max). If unsuccessful, pivot to shimmying or tipping.
  • Risk vs. Reward: Stepping on blades risks cuts, but outweighs blending. Tipping may cause injury from a fall, but survival is the goal.
  • Physics: Use your small mass to advantage—faster acceleration for jumps, easier tipping if leverage is applied.
Final Move: Execute the blade jump immediately. If the rim is just out of reach, kick off the wall mid-jump for extra inches. Escape hinges on speed, precision, and exploiting every millimeter of height.

This is a more elaborate answer than most most people have answered in the past, with maybe a few exceptions.

I have also used it, with proper RAG and KAG's for software development and debugging and it works exceptionally well.
 
I haven't used any AIs aside from what's built into the iPhone. But if it can't handle basic word definitions, how can I trust it for more complex things? It seems to me that any answer would need to be verified by other means.
and that's the problem with AI in a nutshell!
 
They stole the training models from Open AI (regardless how Open AI got their training data), it's always a lot cheaper to steal than start from scratch.
Although I believe Open AI and others are conning the markets and investors alike!
Say what? They did not steal the training model. GPT2 is open source, and what they did was demonstrated and used distillation and RIL during training. Meta will do the same, they may not distill GPT4 though. Pretty much most of the open source LLM or others are modifications and evolution of GPT2.
 
I haven't used any AIs aside from what's built into the iPhone. But if it can't handle basic word definitions, how can I trust it for more complex things? It seems to me that any answer would need to be verified by other means.

and that's the problem with AI in a nutshell!

As a software developer, I use AI all day every day for extremely complex tasks. The results border on miraculous. The more experienced I get with AI development, the better it gets. I'm an experienced late-career dev, and this is better than having multiple junior developers working for me, right now, today. I can only imagine what things will be like a year or two from now. It's amazing.
 
As a software developer, I use AI all day every day for extremely complex tasks. The results border on miraculous. The more experienced I get with AI development, the better it gets. I'm an experienced late-career dev, and this is better than having multiple junior developers working for me, right now, today. I can only imagine what things will be like a year or two from now. It's amazing.

Eliminated the junior devs, eh?
Sounds like your job will be next to go!

Our "future" of unemployment sounds interesting to say the least
😬🫠
 
In a way this feels like Robin Hood if you think about it in its most basic level. Open AI the Sheriff steals from the people, then Robin Hood steals from Sheriff and makes AI free and open sourced.
They didn’t really steal though. They used open ai model to distill their own models. Open AI stole lot more to train their data.

So why doesn't open AI post this evidence instead of saying they have evidence?

They wont, this is just face saving narrative. Open AI is fighting their own copy right infringement law suits, and on a thin ice.
 
Eliminated the junior devs, eh?
Sounds like your job will be next to go!

Our "future" of unemployment sounds interesting to say the least
😬🫠

My job is to employ the devs. If anything, this should be better for me. But I don't analyze things solely through the prism of whether they direct benefit me or not.
 
  • Like
Reactions: TechnoMonk
I just want to know how much data storage you need once you've scraped the entire internet, and then another company comes and scrapes your data and has to store that.

Wow I just looked this up from this website:

"Recent analyses indicate that approximately 90% of the world’s data has been generated within the past two years, and according to IDC, the volume of data stored globally is doubling approximately every four years...

...As of 2024, the global data volume stands at 149 zettabytes."
If you look hard enough, there are plenty of data sets used for training large language models. The ones I have seen range from 72 TB to 250 TB. Those data sets have pretty much most of stack exchange, Reddit, online content, books from fiction to non fiction, science, math, programming, philosophy, religion and so on. The Problem is classifying the data to make it meaningful. Open AI outsourced lot of this to Africa and other low cost centers for 4-5$ per hour to get the data ready for training.
The only difference is the Chinese companies training dumped the data on internet.
 
Good luck having AI do the design or find problems. Folks with good design, data structure, and algorithmic knowledge will be fine. Junior developers who write laborious code can be replaced.

Exactly. This is why I'm benefitting. I can do those things well enough for my projects that I don't need to hire anyone to write the laborious parts of the code. I'm spending far more time doing design and architecture now than I ever did coding, to an enormous degree. The productivity gain blows my mind every time I stop and think about it.
 
I think the statement isn't accurate.
If they used "distillation"; which is using the output of ChatGPT or Llama or any other AI to train DeepSeek, then the resources they claim for training and the cost is a complete lie.

The aggregated cot is whatever was used to train the model they got output from + whatever development they had to pay for.

You can't claim it only took x amount of time and resources to train a model when you used the other model to feed yours.

I'm not saying right or wrong to use the other model to train, but don't claim you had this ground breaking efficiency in resources when you didn't.
I am very much a novice when it comes to this stuff, but it's my understanding that literally everyone in the industry does this, even OpenAI, and the impressive bit here is the quality of the model given the hardware that the model was trained on. It is absolutely a groundbreaking efficiency in resources (which is why NVIDIA's stock tanked). It was previously assumed you had to have billions of dollars worth of NVDIA's most advanced chips that had sufficient inter chip bandwidth; this was creating a rival to ChatGPT on chips that the US didn't even bother banning selling to China. This has shown that literally anyone can do it for (relative) peanuts - you don't need hundreds of billions of dollars of data centers full of state-of-the-art NVIDIA chips. That's going to drive the price down astronomically.

If anyone is interested, Ben Thompson posted a FAQ that really helped me understand why this was such a big deal.
 
  • Like
Reactions: TechnoMonk
How is that any different from OpenAI using user data ?
That's easy... it's less expensive, because OpenAI has already done the hard work for them.

To paraphrase the late Steve Jobs: AI, as it exists today, is a bag of hurt. Just like Bluray, (to which Jobs was referring) AI licensing is an absolute mess. OpenAI stole from a bunch of people, other companies are stealing from OpenAI, different companies are offering licenses to other companies for content to which they may or may not have clear and unarguable legal rights... and everyone everywhere is complaining about one thing or another -- while the very rich get that much richer, because there's a ****ton of money to be made.

Should we really be surprised that some random Chinese company is now getting in on the act, by standing on the shoulders of giants? That's precisely the nature of innovation. AMD did it to Intel by reverse engineering the x86 chipset, Microsoft did it to Apple by copying their desktop operating system and Google did it to Apple (again) by copying their mobile operating system. And there have been plenty of lawsuits, many of which didn't exactly go the way of the original innovator... and so, these days essentially everybody copies everybody; even Apple has been shown to copy some pages out of Google's playbook. Legal or not, it's going to continue to be the nature of the beast.

All of that aside, though: personally, I'm just not a fan of the thing that these companies are currently calling "AI" -- largely because it's not even remotely "artificial intelligence." Multiple generations have now grown up with sci-fi showing us robots and computers that could talk to you exactly as any person would, and make independent (ahem) intelligent decisions for themselves... but that's not what this is. Simply put, Large Language Models are not going to get us from here to there -- and anyone who tells you differently is trying to sell you something.

Mind, I have no doubt that we will continue to find decent commercial uses for LLMs, and the price of entry will continue to come down as others build upon existing successes, just as DeepSeek has done. I really only quibble with calling these things "intelligent."
 
I am very much a novice when it comes to this stuff, but it's my understanding that literally everyone in the industry does this, even OpenAI, and the impressive bit here is the quality of the model given the hardware that the model was trained on. It is absolutely a groundbreaking efficiency in resources (which is why NVIDIA's stock tanked). It was previously assumed you had to have billions of dollars worth of NVDIA's most advanced chips that had sufficient inter chip bandwidth; this was creating a rival to ChatGPT on chips that the US didn't even bother banning selling to China. This has shown that literally anyone can do it for (relative) peanuts - you don't need hundreds of billions of dollars of data centers full of state-of-the-art NVIDIA chips. That's going to drive the price down astronomically.

If anyone is interested, Ben Thompson posted a FAQ that really helped me understand why this was such a big deal.
Hopefully Nvidia will focus on device inference, and not restrict their consumer GPUs to 24-32 GB. They thought the server training market was much larger. It will still be huge but not those astronomical projections.
 
  • Like
Reactions: surferfb
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.