The problem foremost is that LLMs are judged purely on the illusion of competency. Apple and Google were well aware of the issues with next token prediction from the mountains of data they have from their small language models (like autocomplete). It is impossible to know exactly what someone intends. Even the best models today require a lot of re-prompting to get satisfactory results—which only someone who has an idea of what should be the correct response will know. Sounds true enough isn’t really a satisfactory answer for general queries and is actually unacceptable for taking real world action. Natural language is simply not a sufficiently expressive user interface, but for some reason, an alarmingly large portion of the public believe it can be.