I disagree that that's the
only way. Ignoring the
link to the previous post with a component image provided by the leaker, and acknowledging that I haven't yet seen any rumors to this effect: it might also be possible to shrink it by merging the optical camera and the IR camera. If I were researching this kind of tech, merging discrete components in this fashion is definitely one direction I'd be interested in going, as it has value even in an under-screen implementation; more space for other stuff.
True, it's not literally the only way. Merging the selfie/RGB camera with the infrared Face ID camera is a future possibility, but the technology to do that still isn't quite ready to allow this in a way that wouldn't compromise the functions of both cameras in noticeable ways that probably couldn't be resolved to Apple's satisfaction, at least not yet.
With the technology Apple is currently using in its front-facing cameras, the lens of the IR camera needs a more pronounced, IR-optimized anti-reflective coating that would interfere with the clarity of the selfie camera, and there's also a dark red translucent filter panel above the IR camera's lens to minimize the amount of non-IR light that reaches the IR sensor, and this would interfere with taking normal photos even more, making all of them dark red.
If you optimize a lens for good selfie quality, IR sharpness can suffer, and vice versa, largely because the focal point for visible and IR light are different, and computational fixes aren't able to entirely compensate for this.
A normal RGB sensor also has a filter that blocks a lot of IR since that's needed to get the best quality RGB images, while an IR camera wants high transmission of infrared light (about 940 nm) and rejects visible light. If you let a normal RGB sensor see IR, you get color contamination and focus shifts.
To make one sensor and one lens work for both IR and visible light, you’d need a switchable filter system, like a mechanical IR-cut filter like many DSLR/mirrorless cameras use, or one that's electrically switchable, or some other approach. Another method would be a beam splitter prism that would direct light to two different sensors, but that would reduce the light that reaches the sensors, and probably wouldn't reduce the footprint of the camera module.
Also, all these methods would add complexity, and thus lower reliability, which Apple would probably want to avoid versus maintaining the simpler current two-camera approach. Any such methods would also add a significant amount of thickness to the camera module, and it's already so thick that it's one of the reasons iPhones have a "plateau" on the rear.