Thanks for your input. I'll see what I can come up with.
I am a bit frustrated that I am being told by my superiors that some forms of latency are okay while others or not. Presumably the distance-from-the-speakers latency is okay because the speakers are always in the same position in the sound booth, and the baby is always sitting in the same spot.
I still can't tell if your superiors are accounting for operator latency, i.e. the amount of time it takes the person operating the switch to sense a response in the infant and either press or release the switch. Since that's on the order of 200 ms, even in high attention situations (e.g. a trained runner at the start of a race), I honestly don't understand how that could be left unaccounted for while worrying over things in the 2-10 ms range.
Back when I was working on video games, we ran some tests just to get an idea of what kind of switch debouncing the software should do, and the relationship of audio to video. With 1 ms resolution using a real-time dedicated microcontroller, we saw pretty high variation (often over 100 ms) between operators, depending on age, attentiveness, multiple stimuli (on-screen targets), etc. There was also moderately high variation for a single operator, around 50 ms, even with training runs and focused attention.
We also found we could "trick" the operator by triggering a sound (audio stimulus) either before or after the on-screen target's appearance (visual stimulus), and cause a lag or a lead of ~50 ms or so. That is, if the video came before the audio, we would reliably see a slower response time than if audio was concurrent with or preceded the video. With a suitable gap between video and audio, we could reliably make response time worse than if there were no audio at all.
In the end, we did two things:
1. Make sure audio started regardless of what was on-screen; leading audio was better than lagging. (We concluded that both vertical retrace and visual-sensing lag by the human were contributing factors.)
2. Sample the switches faster than vertical retrace, and register a press or release on first change, then debounce for at least 1 complete vertical retrace interval.
Vertical retrace was 60 Hz or 50 Hz (NTSC vs. PAL). I think we ended up using horizontal retrace for switch sampling, but I don't recall with certainty. There may have been another timer source we used, but it was definitely faster than vertical retrace.