It's as much psychoacoustics as physics - the manner in which the brain processes sound - whether you're fooling the eye or fooling the ear... it's all happening in the brain. I had a long career in audio recording/broadcast. Nothing about this violates physics or the principles of sound recording/reproduction.
The best a recording engineer (or photographer) can do is create the impression of "natural" (or explore the possibilities of the unnatural).
If you start from the premise that two-speaker stereo is "natural," then there's room to be skeptical. But stereo is only a rough approximation of the acoustical environment. In a live musical venue, sound emanates from multiple sources in three dimensions (each instrument and voice in the ensemble) and those sounds radiate in many directions, bouncing off every surface in the room. You hear a mixture of direct and reflected sound, and precise localization of a particular sound element can be difficult without visual confirmation of its location (and vice versa - we also use sound localization to direct our visual attention).
It's a well-established principle of acoustics that low frequency (bass) sounds are hard to localize, whereas higher frequencies (midrange and treble) are easier to localize. That's why many speaker systems have a single sub-woofer that can be placed fairly indiscriminately in the room, while you have a pair (or more) of additional speakers to cover the higher frequencies. HomePod uses a similar approach - one woofer, seven highly-directional tweeters. Presuming the stereo signal is distributed in varying proportions among those various tweeters, a person will still get a multi-dimensional/multi-directional sound field. Like the original Bose Direct-Reflecting speaker systems, which debuted in 1968 and are still popular today, sound is intentionally directed towards nearby walls to add reflected sound to the mix; a closer approximation of the live sound environment.
The HomePod also comes equipped with six microphones. While one purpose is to listen for Siri commands, another is to measure the sound environment to tune the sound to the conditions in the room - increasing/decreasing the volume to each of those seven tweeters so that there's the right balance of direct and reflected sound (if the nearest surface is highly reflective, less sound energy has to be sent in that direction).
Overall, as long as the execution is up to Apple's usual standards, I have no doubt that this can work.