New Mac Pro sacrificed second CPU for better RAM performance?

Discussion in 'Mac Pro' started by goMac, Jun 25, 2013.

  1. macrumors 603

    Joined:
    Apr 15, 2004
    #1
    Interesting post on Hacker News:
    https://news.ycombinator.com/item?id=5923300

    To summarize the post:
    On dual processor machines, all RAM is not shared. Each bank of RAM is associated with a certain CPU. This is why the dual processor Mac Pro has double the RAM support. But things start to get messy one one CPU needs data from the other CPU's RAM bank. You take a performance hit as the two CPUs need to synchronize in order for the information to move between them.

    Other operating systems have better support for something called NUMA. A full NUMA implementation basically tries to pair memory and CPU in the most efficient way to try to keep a program and it's data on the same CPU. This reduces the amount of synchronization CPUs must do. But Mac OS X does not have a robust NUMA implementation.

    This gets worse under the i7/newer Xeon architecture as each PCIe card is also associated with a different CPU. So if you have a program on the wrong CPU trying to talk to the GPU, you'll see diminished performance. One commentator noted an almost 30% speed loss in GPU performance with a dual CPU machine vs. a single CPU machine.

    I would think this could also cause problems as disks become faster. Either a PCIe disk or a SATA disk would be linked to a certain CPU, and would perform worse if a program was running off of a different CPU.

    This isn't to diminish workflows that are CPU core bound. But I thought this might provide some insight on why Apple might be willing to sacrifice dual CPU configurations if it could net them at least better GPU and memory performance.

    Microsoft's article on NUMA:
    http://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx
     
  2. macrumors 604

    Digital Skunk

    Joined:
    Dec 23, 2006
    Location:
    In my imagination
    #2
    Yes indeed. Good read, and it seems that is an issue with the older MacPro. The article says it may also be a hardware issue as well, and something that Apple may have had a hard time fixing.

    I could imagine a really big gain from a single socket system once the machine starts getting taxed.

    From what I am gathering and reading, the PC workstations don't seem to have that problem. The issue has been addressed with NUMA, BiOS, and Windows from systems as far back as 2006.

    Not taking a pop shot at Apple though, but this does seem to reiterate some users speculation that Apple just didn't have the time, resources, or desire to compete in the high-end workstation game. I am glad that they were smart enough to give us an upgrade path from an iMac and/or Macbook Pro.
     
  3. thread starter macrumors 603

    Joined:
    Apr 15, 2004
    #3
    The most interesting thing was Apple hardcoding interleaving NUMA into other OSs on the Mac Pro, basically enforcing that you can't bench the difference between a Windows NUMA implementation and OS X on the same machine.

    Apple could have possibly recoded Mach to support a more robust NUMA implementation, but think of the risk that could introduce, and the small number of dual processor Mac users out there. Probably wasn't worth the stability risk.
     
  4. macrumors 6502a

    Joined:
    Jun 26, 2010
    #4
    That's really fascinating. Thank you for posting that.
     
  5. macrumors 603

    VirtualRain

    Joined:
    Aug 1, 2008
    Location:
    Vancouver, BC
    #5
    Interesting. I've always wondered how this works. Thanks for sharing.
     
  6. macrumors 6502a

    Joined:
    Jun 13, 2013
    #6
    Multi CPU was a solution before we had multiple cores. Now that we can push up to 12 cores on a single die symmetric multi-processing doesn't make much sense. Supercomputers take an entirely different approach in that the problems which are given to them are already segmented for distribution.

    SMP is very probably going away and not coming back, as for general computer it seems that 4 cores is about optimal and 8 is as much as the vast majority can effectively use.
     
  7. deconstruct60, Jun 26, 2013
    Last edited: Jun 26, 2013

    macrumors 603

    Joined:
    Mar 10, 2009
    #7
    If was on to something it might be interesting. It is not though.

    So is the point that robust NUMA can't be added to the Mach kernel or that Apple just isn't capable , or what? He is pointing to "look other people did it and Apple doesn't have it yet" ......... so instead of adding to the software they bend the hardware backwards to paper over the software deficiencies?

    Eh?

    As oppose to the factor of how many complicated x86 cores are enough?

    This also completely ignores any discussion of whether QPI is effective enough at delivering the bandwidth and latency that is required. It pretty much is good enough to be quite competitive.

    Apples CPU workload allocation has been ( still is) crappy about optimizing in the context of Intel Turbo boost technology to. Does that mean they should drop Intel or shift to processors that don't have it????? No.



    First it really isn't i7. It is only the subset of the i7's that are implemented with the Xeon E5 core implementation.

    Second, the new architectures have TWICE the QPI internode bandwidth as before. There are still two QPI links for dual socket E5 designs and products. Sure, there is more PCI-e lanes on the "other side" but there is also another QPI link to carry all of that traffic (plus any memory traffic). There is no QPI link to the Northbridge anymore. That has moved inside. So the two links are both used to join CPU packages. The link to the IOHub/Southbridge is over a separate DMI link.

    Is any of that taken into account? Nope.

    Diminished how much? A 8.0GT/s QPI link is 32GB/s of bandwidth ( a 7.2GT/s QUI link is 28GB/s ). A x16 PCI-e v3.0 link is 15.75GB/s

    Really the QPI link is substantially throttling the x16 PCI-e v3.0 one even though it is substantially faster? Probably not. ( corner cases that are also coupled to horrible CPU/memory placements could depress bandwidth low enough to case problems perhaps. )

    I'd be looking for another possible root cause problem ( such as the drivers somehow convincing themselves to go into a throttling mode because their preconceived notions of timing or gimmick they use to measure timing is just that a gimmick/hack. )

    It could be flawed hardware implementation of how the fused PCI-e landscape is presented/interfaced to. But from a basic bandwidth/latency provisioning standpoint that doesn't seem to be a big problem.




    Memory speeds increasing to levels to swamp QPI is probably a bigger threat than these.

    I suspect is much simpler than that. Single package Xeon E5 1600s are much more affordable. Couple that to the desktop implementations being stuck at four x86 cores and the vast majority of software isn't going to try to scale much further than 4 cores . For 6-12 the 4 core solutions probably will work. To 20-30 not .

    Software vendors tend to optimize software to the hardware that most people have. If 20 complicated x86 cores is never going mainstream then they is zero reason to jump onto a future track going in that direction.

    ----------

    Yes it does because 12 identical cores on a single die talking to a set of memory controlled by that die is Symmetric Multi-Processing (SMP).

    SMB vs NUMA has very little to do with the number of cores and far more to do with how the memory and more slower I/O controller(s) are attached.


    Most modern supercomputers are neither SMB or NUMB. They are multiple computer instances bound together into a fixed cluster. Not only is the memory segmented by the OS is segmented also.


    Not going anywhere.

    As CPUs and GPGPUs more often share the same memory you'll probably see adjustments to OpenCL that either merge or minimize those computations that require large amounts of memory copying/transfers.
     
  8. thread starter macrumors 603

    Joined:
    Apr 15, 2004
    #8
    I don't think the bandwidth is the biggest problem as much as the latency is. But bandwidth could be a problem when you have two processors both hammering the same bank of memory. QPI might give the processors the bandwidth to move bits back and forth (ignoring the latency) but if they're both hammer the same memory bus you're going to see performance diminish.

    I think that also has something to do with it. Cores have diminishing returns, so there is going to be less and less emphasis on core count. I think last time I heard, 72 cores was about the max before there would be no performance improvement due to the overhead, but that could have changed by now? That's a problem not shared by GPUs in since the work bits are so small per core.

    But a weak NUMA implementation would also be a knock against a dual processor configuration, especially if Apple is valuing GPU performance.
     
  9. macrumors 603

    Joined:
    Mar 10, 2009
    #9
    QPI's latency isn't worse than PCI-e latency. Whacked firmware or software is most likely the issue.

    Frankly folks use GPU PCI-e cards on switched PCI lanes all the time in mainstream PCs.


    Actually not. Hammering on single memory controller will effectively serialize the access. However, 5-6 cores on CPU 1 in each targeting 4 different memory controllers on CPU 2 could start to saturate the links.


    QPI doesn't ignore latency issues.

    "... Minimizes lead off latency without sacrificing link utilization ... "

    http://www.hotchips.org/wp-content/...orial-Epub/HC21.23.120.Safranek-Intel-QPI.pdf


    Latency varies though.

    http://software.intel.com/en-us/forums/topic/280235


    but if they're both hammer the same memory bus you're going to see performance diminish.


    Complicated general purpose cores have diminishing returns. Cores purely focused on doing pure computational grunt work will likely continue to increase in count much longer after the general purpose cores plateau out.

    Chopping up problems into smaller pieces and concurrently doing the smaller pieces faster is going to continue to pay off for a while as long as can keep memory bandwidth up. eRAM is going to help greatly with that.


    That's primarily because they get starved. Crank up the memory I/O and it gets uncorked again.


    GPUs have same fundamental problem. They have moved to much faster memory, wider paths, and lower latencies at a quicker pace than general purpose cores have.


    It is the combination of weak NUMA that needs work and low number of users that require it. At the common kernel level this is probably one of the places where the iOS devices probably does has some impact. The iOS devices are all SMP with no NUMA even remotely in sight ( generally the GPUs already share the same memory pool so even GPGPU work is uniform) However, 4-12 cores is in sight. So which one spend time on? (Not NUMA).


    And before anyone start arm flapping about iOSisfication. It is not iOSisfication.... it is the exact limitation that OS X had before iOS even existed. So blaming it on iOS is a humongous pile of BS. It is more so pruning a possible evolutionary path more so than neutering something OS X already had.
     
  10. macrumors 65816

    Joined:
    May 1, 2012
    #10
    I think it's clear that a single 12-core CPU would produce better results than dual 6-core CPU's.

    But I think the point is that some people would want the option of dual 12-core CPU's. :eek:
     
  11. macrumors 6502a

    Joined:
    Mar 16, 2009
    #11
    Yup.

    After that it'll probably be dual 16 cores.

    Then dual 20's or 24's.

    Etc.
     
  12. macrumors 603

    Joined:
    Nov 17, 2008
    Location:
    Hollywood, CA
    #12
    Looks like they'll be getting a Dell, dude.
     
  13. macrumors 601

    derbothaus

    Joined:
    Jul 17, 2010
    #13
    TB interconnect should be just fine, right? TB is the answer to everything recently. Apple thinks of everything. :rolleyes:
     
  14. macrumors 603

    thekev

    Joined:
    Aug 5, 2010
    #14
    You can generally acquire higher clocked 6 core cpus if aligning 2 x 6 to 1 x 12 in terms of price.
     
  15. macrumors 65816

    phoenixsan

    Joined:
    Oct 19, 2012
    #15
    I echo....

    the posts/comments about Apple not fully commited to high end computing. Back in the day, a company how struggles to survive as SGI had a robust implementation of NUMA built into their hardware and IRIX, so figure out....


    Good article. Thanks for post it....:D


    :):apple:
     
  16. macrumors 65816

    Joined:
    May 1, 2012
    #16
    But will the higher clock compensate for the additional overhead of inter-CPU communication?

    For single-threaded tasks, sure-- but single-threaded tasks are not why you'd buy a Mac Pro.
     
  17. macrumors 6502a

    Joined:
    Mar 16, 2009
    #17
    We've had HP, BOXX, and built our own. We'll see what makes sense at the time.
     
  18. macrumors 603

    Joined:
    Mar 10, 2009
    #18
    Not necessarily if they are equal microarchitecture implementation.

    two 6 cores with quad memory interfaces ( total 8) running at the same speed as one 12 core capped at just a single quad memory interfaces has half the memory bandwidth.

    You can wave hands about but some of that is remote ( QPI isn't that slow in latency or bandwidth). It is about the same argument as 3 DIMMs or 4 DIMMs in the current Mac Pro. If the problem solution can leverage more memory ... the additional DIMM is worth the overhead of invoking the multiple rank connection overhead.

    The problem with the single package 12 core is that is has a 3:1 ratio of cores to controllers. A dual 6 core would have 3:2 ratio.

    Do 12 streams of AES realtime crypto from RAM (no localized into a single DIMM) and the dual 6's would turn in better time.

    The question is how large is that a group? With the single package 12 core the upcoming market can peel off those whose workload is not RAM constrained ( past what is reasonably affordable with 4 DIMM slots) and/or isn't memory I/O bound. That is likely a sizable subgroup.
     
  19. macrumors 68040

    Joined:
    Feb 2, 2008
    #19
    The problem with this thread is that it's based on the speculation of one guy. Adding to that, the subject matter is entirely over the heads of anyone here, leading to a bunch of empty techno babble.
     
  20. macrumors regular

    tomvos

    Joined:
    Jul 7, 2005
    Location:
    In the Nexus.
    #20
    Actually NUMA and ccNUMA was one of the few good things which survived at SGI. Today it is called NUMALink and it is one of the reasons the NSA likes to buy Altix Clusters instead of Mac Pro Clusters. :p

    The real mistakes SGI made were more like abandoning MIPS for Titanium, stuff like Fahrenheit, and generally not recognizing that computing moves to more computers at lower price points.


    But back to topic:
    I don't think that Apple switched to single socket in the new Mac Pro because of their lack of NUMA. I guess it is mainly related to available space in the new Mac Pro. Fitting a second CPU might fit, but adding up to four additional memory banks looks ... just like this would not fit any more.
     
  21. macrumors 603

    thekev

    Joined:
    Aug 5, 2010
    #21
    In a given application some things can be poorly threaded compared to others. There's a lot of software that still calls bits of old code or relies on older approaches for specific tasks, so it still matters to some degree. Anyway I think the overhead issue was one of the reasons they got rid of the old FSB.


    I think they're betting on Xeon 1600 variants eventually hitting a point of good enough there. The 12 core really seems like a desire not to drop in core count from the prior generation, as the real performance gains have been from core count increases assuming all cores can be fully utilized.
     
  22. macrumors 603

    Joined:
    Mar 10, 2009
    #22
    No. It is the cores to memory controller ratio. One front side bus that everyone shares. 2-4 cores can probably get away with. 6 sketchy. 8 probably getting throttling. >8 definitely getting throttling if the cores are any good.

    In short, it doesn't scale. Serializing kills off parallel performance.

    With the memory bandwidth coupled to the CPU package as you add more packages bandwidth scales right long with the cores. It all isn't completely uniform access but can be made good enough.


    It is more of a bet don't need homogenous cores. The system's number of total cores here more than doubled at the likely roughly equivalent price points. They are just not all large general purpose x86 cores.

    Besides most current single package users are not exactly hobbled with current 6 cores option. The price is a more often the complaint rather than the performance.

    The E5 v2 12 core is likely a real performance gain. Not some revolutionary jump or one to spur a major design shift, but incremental progress.
     
  23. thread starter macrumors 603

    Joined:
    Apr 15, 2004
    #23
    No, certainly not. Thread title was probably badly worded in that way.

    But I think at some point Apple sat down and had to decide how serious they were about their highest end customers. If they wanted to stay competitive with other vendors in that space (as other posters have noted) they'd need a good NUMA implementation. Without that, they're not staying competitive in that space.

    So it likely contributed to Apple's decision to drop out of the dual processor space.

    From Apple's perspective, they should ship a 24 core box, but if everyone is going to complain about the lack of a proper NUMA architecture and buy better performing boxes from the competition, what's the point?
     
  24. macrumors 603

    Joined:
    Mar 10, 2009
    #24
    In turn....
    buying MIPS in the first place as opposed to working with them in a open vendor ARM like fashion with mulitple implementers. If there have been multiple big system vendors using MIPS then wouldn't have had to turn to Itanium when Intel decided to try to put all the big-iron RISC implementations out of business. What MIPS needed was alot of capitial to keep up and being inside SGI didn't do that.

    Fahrenheit. classic Microsoft 'Embrace, Extended , Extinguish' done in not quite so many steps.

    When have a sales force that works on commissions they tend to gravitate to higher prices tags that generate bigger paychecks and less closings. Same trap Sun fell into. Neither one of them figured out how to navigate from high end workstations to high end PCs well.


    chicken and the egg kind of thing. Did punting the second CPU open up more space reduction or started small and stopped when had packed in the "small" with maximum amount of stuff.

    If trying to maximize the number of cores in the device the second CPU package got tossed because it can't keep up. For the same space on the second side of the triangle can get in more cores with another GPU rather than a CPU. ( the additional 4 DIMMs is another problem though for the same radius. )

    IMHO, I think it much simplier than that. They didn't sell enough and didn't have enough system integration differentiation. Same rational that axed the XRaid , XServe , and other products Apple has crossed off the list.

    For the big-box dual package systems a large fraction just want a box they can jam alot of stuff into just so it looks neater because they can close the lid and make it disappear. Couple with very low unit sales numbers, corner case OS X extensions , and probably going to loose Apple's interest since it is almost as much closet/storage chest as a computer system.
     
  25. thread starter macrumors 603

    Joined:
    Apr 15, 2004
    #25
    Pushing OpenCL is also very smart in since OpenCL runs on many many devices (GPUs, CPUs, Xeon Phi.) It gives them a lot of flexibility in future designs. Second GPU isn't working out? They can put a CPU or a Xeon Phi in instead and user's apps will continue performing well.

    A lot of people are fixated on the second CPU because they have workflows that will eventually become legacy. With OpenCL, computing resources are generalized so it doesn't mater as much what specific processors are in the machine. Some types of processors are better at OpenCL than others, but at least the code doesn't have to care about whether it's a GPU or CPU it's running on.

    OpenCL would also open the door to external processing nodes via Thunderbolt. You could connect a bunch of Xeon Phis into a machine with Thunderbolt and gain the extra OpenCL power.
     

Share This Page