I think it's because video uses 'sensor crop' instead of sampling the whole frame. So it takes the middle 1920x1080 pixels of the camera sensor as the video instead of sampling every pixel of the sensor.
So, I did some (highly scientifical) testing to check this out for myself.
I put my iPhone on the desk and taped some paper to the desk. Then I put it into each mode making sure it was only focussed on the wall, and then drew dots on the edge of the view. Then I joined the dots and measured it.
Maybe it'll help?
EDIT: The iPhone 4 doesn't have the video stabilisation, and I don't have a 4S to draw my pictures up with. Maybe someone can do it for comparison.