I reckon it's multiple cameras/single camera with multiple takes (easier with music videos as there are auditory cues).
Then a 3d scene is created in AE with a 3d camera that just tracks through the various scenes at various points.
Simple in concept but probably quit difficult to construct with all the rendering power required!