AI generative image models are trained on billions of images, and the weights in the models reflect it. One of the biggest problem is the art datasets, used in training. The human structure in a painting is far from perfect. There are other technical reasons too.
The models have problems with things that can be in innumerable positions. Shoe laces and hand positions can be in many positions. They cannot be predicted so errors occur in the output. It's easy predict where arms are when they hang at the side of the body, but ask these image generators to produce images of people walking on their hands or waving their arms in the air and the results are super bad.
But they also have problems with compositionally. Try this simple string of coloured primitives in any order and the output is guaranteed not to be satisfactory:
'A yellow triangle inside a green square inside a red circle inside a blue hexagon'
'A yellow pyramid on top of a red ball on top of a green cube'
Or any combination of the above. A child could draw these in a few minutes, but these systems using hundreds of GPUs struggle badly.