It's plausible that this camera is not even a camera as you understand the term.
You think of a camera as something that
1. gathers light
2. to form an image
But you can create a device that performs the first task, while doing something different with that light, a system based on a light sensor but no lens; instead the role of the lens is performed by a "mask", a pattern of holes. (This is not a fresnel lens, though it shares with a fresnel lens the property of requiring less space).
The result generates "patterns" on a sensor that are nothing like an image, but which contain/encode the same data as an image.
What's the point?
1. It can be crazy thin (and cheap, and robust)
2. What do you want to do with the signal? If the only point is to feed it to an AI for situational awareness and to answer questions, then no image is necessary. The AI can train directly on this encoded signal!
3. It can probably be lower power if what you feed to the AI training is a raw sensor signal, with none of the computational pipeline that transforms that signal to a human-recognizable image.
So always-on - reasonable.
Recording what's around the user - possibly only for the last minute or so with constant overwriting.
And quite possibly in a form that's not even human interpretable.
I've no idea what Apple have in mind, but tech like this exists:
https://imagesci.ece.cmu.edu/files/paper/2017/flatcam_tci17.pdf