What makes video so hard on a DSLR? ....zoom, focusing, camera shake and steadiness.
There are many reasons, some already listed. First you must define what you mean by "video...on a DSLR". Casual video on a low-end true DSLR in full programmed auto? ENG video on a DSLR-like camera like a GH4? Feature film 2nd unit video or serious documentary work on a fully-rigged high end DSLR?
Casual video to non-exacting standards on a modern DSLR with video AF or similar mirrorless camera is easy -- you just hit record. The results often won't be great but the results from a cell phone or low-end camcorder are also frequently poor, given a totally novice operator. Look how many cell phone videos are shot vertically. You could argue this type of DSLR video isn't hard, but the results aren't good either.
Moving up a notch to ENG but non-cinematic work, it's somewhat harder but increasingly TV news departments are using lower-end DSLRs instead of camcorders. The reporters are not highly trained but manage to get decent results. Here is a three-camera DSLR shoot by ABC News in front of the White House:
https://joema.smugmug.com/Photography/ABC-News-Using-DSLRs/n-BsScJC
Moving up further to serious documentary or dramatic work, DSLR video is quite difficult. We generally must shoot in full manual mode, which means manual zoom, manual focus, and manual exposure (which typically means manual ND filter and often manual ISO) . The 180 degree shutter rule mandates the shutter speed is locked at 2x the frame rate, e.g, 1/60th sec for 30 frames/sec. The aperture is fairly wide, else why use a DSLR. This leaves only ISO to balance the exposure, which can only go so low. That in turn requires a manual ND filter, either fixed or variable.
Of course using a similarly sophisticated large-sensor camcorder like a Canon C100 is also not easy, but at least it has variable ND built in, plus some other aids. But you can't just stick a DSLR or a C100 in the hands of a novice and expect good results.
True DSLRs (vs mirrorless EVF cameras) have another complexity of often requiring a viewfinder loupe or an external EVF. That often requires brackets to accommodate both EVF and on-camera shotgun mic. HDMI-connected accessories like EVF or recorders are additional battery-powered devices and HDMI was never designed for field use.
DSLR lenses are not designed for video use and often have non-linear zoom rates. It is often very difficult to get a smooth zoom, so this requires yet more strap-on aids like velcro sticks, gears or a follow focus system.
We put up with DSLRs because if properly used, the quality is very lush and cinematic and the price is relatively inexpensive. When the 5D Mark II was introduced in 2008 it was revolutionary -- it previously took a $50,000 cinema camera to get that look. Things have changed and large-sensor camcorders now exist, but are still more expensive than a lower-end DSLR. Yet that lower-end DSLR if carefully employed can produce impressive footage.
Newer mirrorless EVF cameras like the GH4 and Sony A7 series partially bridge the complexity gap but they are still harder to use than a prosumer camcorder.
It's harder to optically stabilize a DSLR, since any compensation mechanism must be larger due to the larger diameter optical path. Lens-based stabilization can work very well in some cases, even though not designed for video. However it's not as good as late-generation smaller-sensor camcorders.
Ergonomically DSLRs face another challenge since they are designed for brief, momentary use at eye level, not sustained use. They have no EVF so must be held away to view the LCD panel. You can add a big EVF like the Zacuto Z-Finder EVF Pro, and this really transforms the camera. Mounted on the camera hot shoe, the 3rd contact point against your eye makes the camera very stable and allows holding it at chest height while keeping the upper arms vertical. I can easily hand hold a 5 min. interview with a 70-200 lens that way. However it's a lot more complexity and expense, and is quite fragile in a rough field environment.