I guess if you had 10 cameras and then key framed some horizontal blur for the effect of motion... it should work. The video just seems a little old to do that .
I reckon it's multiple cameras/single camera with multiple takes (easier with music videos as there are auditory cues).
Then a 3d scene is created in AE with a 3d camera that just tracks through the various scenes at various points.
Simple in concept but probably quit difficult to construct with all the rendering power required!
Doubtful. That seems more complicated than necessary. I'm sure it's much like the method discussed in mBox's link. An array of cameras capturing the same take from multiple angles, then composited and morphed in post.
It might be more complicated but it would be more accurate motion blur and easier to adjust angles once all the frames had been set up.