It's an interesting discussion. With enough will and effort it should be possible to come up with some (ideally industry-wide) standard set of activities to run a phone through and measure how long it lasts until it powers itself off. I can't remember which site it is but there is someone/somewhere that does do that when reviewing phones to test their battery life - a well-defined mixture of video, browsing, email, messaging, playing games and probably other stuff I forgot and then looping through that repeatedly and measuring how long it is until a phone shuts itself off.
It would take effort because you'd have to precisely define that sequence in terms of web sites browsed (and you'd probably need to set up a dedicated in-house web server for the test so the results weren't skewed by how busy a public server was at any given time) and similar with the email and messaging, some automated test robot to play some defined game (or maybe a test game written specifically for the task if it was an industry wide standard), etc.
I'd say it's unlikely to happen but on a far simpler scale there is precedent, for instance in the sound insulation industry there is a standard test signal defined that is a well defined frequency/amplitude mix intended to simulate typical traffic noise. Again though that's an approximation because the mix of small cars, vans and heavy trucks going past your window might not be the exact traffic mix simulated by the industry-standard test signal.
In theory Apple giving the video numbers for every model does at least allows someone to compare between models and generations so, although admittedly not giving you much idea on it's own about how that's going to transfer to your real-world battery life, it does at least give some hints that if one phone is coming in at 20 hours and another at 30 then that second one has a pretty good chance of delivering the better real-world battery life.
In general I look at the percentages and if I see Apple claiming a 10% (one year it was over 30%!) increase in video playback time my initial expectation is that I will see that sort of percentage in my real life use and on getting a new phone I keep an eye on how it does perform to see if that expectation is met and if it's not I scale it back for the next time. (It's always been a scale-back, I don't think I've ever got a real-world increase that exceeded Apple's increase in the video playback metric.)
I suppose it should also be said that perhaps even a simple "how long will it play video?" test is only useful, even as a model-to-model comparison, if we trust the tester to keep the conditions the same. It must be exactly the same video used every time across all the tests, presumably played in a loop long enough to expire any cached video data. If the compression algorithm was changed that would invalidate any comparison and even if the video content was changed that could change the results if more or less fast wide-area movement in one video vs its predecessor resulted in changes in the sizes of the intermediate delta frames that are part of many (most?) compression algorithms.