Assuming the memory positioning hasn't changed since the original Mac Pros, the best config would be:
Riser A: A1, A2 - 2GB each; A3, A4: 1GB each
Riser B: B1, B2 - 2GB each
The order of install is:
Riser A1/2, Riser B1/2, Riser A3/4, Riser B3/4. It's probably better to have the larger DIMMs in slots 1 and 2 of each riser, rather than 3 and 4, but there is really no requirement for this. You do always want to fill slots 1 and 2 before 3 and 4, though I'm not sure if anything will break if you don't. Most systems (and I assume Mac Pros are included) don't need DIMMs installed in a specific order, they just prefer it for optimal performance.
The reason you want to fill each riser with 2 DIMMs rather than one riser with 4 DIMMs is because each riser is for one of the CPUs. To access Riser A from CPU B, CPU B has to go via CPU A's bus system, thus causing additional latency on memory access. Data for each CPU in a multithreaded application will be placed on the appropriate Riser's ram for faster access, so you'll get significantly better overall performance by alternating risers. Also, since CPU A will get load before CPU B, if you have odd quantities of memory you'll generally want to put the extra on Riser A.
While not for your situation, I think about the only exception to placing a larger DIMM on the first Riser would be if you had 4x1GB and 2x2GB memory sets. Then you'll want to keep the 4x1GB on the same riser. The choice of which riser probably doesn't matter as much in this case. However, 1GB DIMMs usually have slightly faster internal addressing timings than 2GB DIMMs, so I'd put the 4x1GB on Riser A and the 2x2GB on Riser B. Generally you want to balance memory quantity for each riser, not the number of DIMMs (beyond being in pairs). Thus if you have 2x1gb, 2x2gb, and 2x4gb, put the 2x4 on Riser A and the other 4 on Riser B so you have 8GBs optimally for CPU A and 6GBs optimally for CPU B. While this isn't an "endorsed" configuration by apple (because you filled out Riser B before Riser A), it should be more efficient than the "endorsed" configuration.