Apple Offers Sneak Peek of Completely Redesigned Mac Pro

wiz329 · Jun 15, 2013

AidenShaw said:
If you define "hardly any" as at least a 10% performance loss, true.

Can you really scale drive performance like that though?

AidenShaw said:
Marketing numbers seldom include actual real world performance - so the 20 Gbps raw number is marketed, and nobody mentions that the payload cannot be more than 16 Gbps.

So is 500 MB/s per lane of PCIe 2.0 marketed or real-world performance?

AidenShaw said:
T-Bolt 1 was PCIe 1.0 - because it wasn't any faster than PCIe 1.0 (in a 4 lane to 4 lane bridge).

We have very little info on T-Bolt 2, but its speed does match with a PCIe 2.0 4 lane to 4 lane bridge.

I thought TB 2.0 was merely the combination of the two 10 Gb/s channels of TB 1.0 into a single, dynamic channel, instead of reserving one channel for display port and one for data. My question was simply, if there is nothing more than 4 lanes of PCIe 2.0 going into the TB controller (I could be out of my league here in terms of technicality), how can they claim anything more than 16 Gb/s even at a theoretical level?

AidenShaw said:
In theory, T-Bolt 2.0 could provide a PCIe 3.0 x1 or x2 interface. Or, it could provide a PCIe 3.0 x4 with pauses between packets - but there would be little actual advantage to doing that other than possibly being able to run at 20 Gbps instead of 16 Gbps depending on the encoding.

In truth, though, T-Bolt is an opaque, proprietary interface that is very difficult to get any facts on unless you are a licensee.

Why would there be pauses between packets?
So, is it known how many lanes each controller/port consumes? I assume on the new Mac Pro all 40 lanes are in use. Presumable at least 16 of them are reserved for the GPUs (I assume each one is x8), which leaves 24 lanes. If each port is a x4 bridge (or whatever it's called), then that would account for the rest of the lanes. Or, as I seem to recall, are there only 3 controllers for all 6 ports? I'm just a little foggy on the facts here and how it works.

firewood · Jun 15, 2013

Renzatic said:
But what about 2 years from now? 3? 5? It used to be you could just swap out your old cards and carry on from there. Now, you have to upgrade the whole machine.

So what makes you think that 2 brand new high-end workstation graphics cards significantly enough faster than 2 FirePro W9000's will be that much cheaper than buying a new (even faster) Mac Pro, 2 or 3 years from now?

We haven't seen the pricing yet...

AidenShaw · Jun 15, 2013

wiz329 said:
So is 500 MB/s per lane of PCIe 2.0 marketed or real-world performance?

See the chart at https://forums.macrumors.com/posts/17431618/ - "raw bit rate" is marketing, "interconnect bandwidth" is peak real world.

(500 MB/s for 2.0 is peak real world).

wiz329 said:
I thought TB 2.0 was merely the combination of the two 10 Gb/s channels of TB 1.0 into a single, dynamic channel, instead of reserving one channel for display port and one for data. My question was simply, if there is nothing more than 4 lanes of PCIe 2.0 going into the TB controller (I could be out of my league here in terms of technicality), how can they claim anything more than 16 Gb/s even at a theoretical level?

Long-standing tradition of marketing using the biggest number that they can find, regardless of whether it's usable payload or not.

wiz329 said:
Why would there be pauses between packets?

If your input pipe is sending you packets at 4 Gbps/lane, and your output pipe is 8 Gbps/lane - you have to pause between output packets to buffer the input.

wiz329 said:
So, is it known how many lanes each controller/port consumes? I assume on the new Mac Pro all 40 lanes are in use. Presumable at least 16 of them are reserved for the GPUs (I assume each one is x8), which leaves 24 lanes. If each port is a x4 bridge (or whatever it's called), then that would account for the rest of the lanes. Or, as I seem to recall, are there only 3 controllers for all 6 ports? I'm just a little foggy on the facts here and how it works.

No, it is not known. Intel has said almost nothing about the T-Bolt 2 architecture, and Apple says even less (except for the 72 point font touting "20 Gbps").

The bandwidth of T-Bolt is roughly PCIe 3.0 x2. The fact that three 4K video outputs are possible suggests that there are three dual-port T-Bolt controllers. (It also suggests that running 3 4K displays and accessing your T-Bolt disks at the same time could be a problem.)

We have no idea if there are bottlenecks (for example, each pair of T-Bolt ports having 20 Gbps total rather than 40 Gbps total).

AidenShaw · Jun 15, 2013

wiz329 said:
Can you really scale drive performance like that though?

Yes - you can. Read/write performance on RAID-0 or JBOD, and read performance on RAID-5, scales nearly linearly until a bus or controller bandwidth limitation is met.

And the point that I've been making is that, unlike the fans who think that T-Bolt is infinitely fast, T-Bolt is actually quite slow compared to PCIe on the motherboard.

narbalek · Jun 15, 2013

AidenShaw said:
T-Bolt 1 was PCIe 1.0 - because it wasn't any faster than PCIe 1.0 (in a 4 lane to 4 lane bridge).

We have very little info on T-Bolt 2, but its speed does match with a PCIe 2.0 4 lane to 4 lane bridge.

In theory, T-Bolt 2.0 could provide a PCIe 3.0 x1 or x2 interface. Or, it could provide a PCIe 3.0 x4 with pauses between packets - but there would be little actual advantage to doing that other than possibly being able to run at 20 Gbps instead of 16 Gbps depending on the encoding.

In truth, though, T-Bolt is an opaque, proprietary interface that is very difficult to get any facts on unless you are a licensee. We don't know if the host side is a PCIe 2.0 or 3.0 part, nor if the device side presents a PCIe 1.0, 2.0 or 3.0 bus.

Another answer to the 20Gbps T-Bolt 2 limitation is that this is the maximum bandwidth carried over the DMI 2.0 link between the Northbridge or combined Northbridge/CPU package and the Southbridge I/O or Platform Controller Hub (PCH) used on Sandy Bridge, Ivy Bridge, and newer architectures.

AidenShaw · Jun 15, 2013

narbalek said:
Another answer to the 20Gbps T-Bolt 2 limitation is ....

I hope that you're wearing your flak jacket if you dare to say "T-Bolt 2 limitation" here....

The fans don't like reality to intrude.

narbalek · Jun 15, 2013

AidenShaw said:
And the point that I've been making is that, unlike the fans who think that T-Bolt is infinitely fast, T-Bolt is actually quite slow compared to PCIe on the motherboard.

Especially when people consider that single and multi-CPU Xeon systems using QuickPath Interconnect and PCH architecture have the PCIe slots very closely coupled to the CPU package.

...which of course means it makes no sense to have two high-performance graphics cards route their output through DMI and T-bolt which is shared with all other I/O traffic.

I'd also add that even though (?Apple?) is saying there are multiple independent T-Bolt buses (bridges) in the tube, as long as Apple is using the Intel chipset architecture, then its all being fed through a single shared 20Gbps DMI 2.0 bus.

----------

AidenShaw said:
I hope that you're wearing your flak jacket if you dare to say "T-Bolt 2 limitation" here....

The fans don't like reality to intrude.

How true. LOL.

wiz329 · Jun 15, 2013

AidenShaw said:
Yes - you can. Read/write performance on RAID-0 or JBOD, and read performance on RAID-5, scales nearly linearly until a bus or controller bandwidth limitation is met.

And the point that I've been making is that, unlike the fans who think that T-Bolt is infinitely fast, T-Bolt is actually quite slow compared to PCIe on the motherboard.

edit: just looking up some scaling tests over at Tom's Hardware. They're old, so scaling performance may have increased, but in sequential read/writes from 1 drive to 5 in RAID 0, you lose about 17%. So, the first drive performed at 235 MB/s (yes, faster drives exist now), a 5 drive array maxed out right around 1000 MB/s. With those small losses, you probably wouldn't get to much above 2000 MB/s with 4 drives I'd imagine. But that's beside the point -- I understand that your overall point is about the limitations of TB vs. the much faster PCIe.

Fantastic. Thanks for sharing your expertise.

One final question: what are the latency differences in PCIe and TB (if known), and in what types of situations does this become a critical factor?

narbalek said:
Especially when people consider that single and multi-CPU Xeon systems using QuickPath Interconnect and PCH architecture have the PCIe slots very closely coupled to the CPU package.

...which of course means it makes no sense to have two high-performance graphics cards route their output through DMI and T-bolt which is shared with all other I/O traffic.

I'd also add that even though (?Apple?) is saying there are multiple independent T-Bolt buses (bridges) in the tube, as long as Apple is using the Intel chipset architecture, then its all being fed through a single shared 20Gbps DMI 2.0 bus.

There must be a way to get around this. A 4K display takes ~15 Gb/s bandwidth, depending on refresh rate. Apple may like to use the highest "theoretical" numbers on their keynotes, but they can't outright lie. With only 20 Gb/s bandwidth, it would be impossible to run 3 x 4K displays simultaneously, even at a very low refresh rate.

narbalek · Jun 15, 2013

wiz329 said:
There must be a way to get around this. A 4K display takes ~15 Gb/s bandwidth, depending on refresh rate. Apple may like to use the highest "theoretical" numbers on their keynotes, but they can't outright lie. With only 20 Gb/s bandwidth, it would be impossible to run 3 x 4K displays simultaneously, even at a very low refresh rate.

I'm afraid thats impossible to answer unless Apple publishes an architectural diagram of the system.

Edit: been doing some speculating on how it might work -- which shouldn't be taken or rumored to be how it will.

Starting from the Wikipedia article on 4K display resolution, a 4K display can display anywhere from 3840x2160 to 4096x3112 pixels. Multiplied by either 24 or 32 bits per pixel and a 60Hz or 75Hz refresh rate, I'm getting a maximum possible bandwidth usage of 15.82Gbits/sec on each T-bolt port.

With Aiden's post showing 8Gbits/sec each way on every PCIe 3 lane and T-bolt 2 topping out at 16Gbps usable bandwidth each way, its possible that the two graphics cards each have two T-bolt2 bridges/adapters on them that are connected via internal cabling to the T-bolt2 ports on the back of the tube. Two x T-bolt2 adapters per card = 4 x bi-directional PCIe 3 lanes per card as a bare minimum. Allowing a PCIe3 x8 connector per card gives a total of 16 x PCIe3 lanes for graphics/T-bolt. Assuming one graphics connector is electrically equivalent to a PCIe3 x16 slot to accommodate an extra 2 lanes for the SSD blade attached to one of the graphics cards and another 2 x PCIe3 lanes for bridging to the PCH and its downstream USB3, GbE, and audio in/out ports, and everything fits within the maximum 20x PCIe3 lanes allowed by a single CPU.

So it might work, but given that each graphics card would have a different configuration (one an x16 with flash drive and one an x8 without flash drive) I wouldn't like to speculate on upgradeability or maintainability and certainly not on price. Custom configs always drive up the cost because they can't be sourced in bulk and take advantage of economies of scale. Probably best to think of the tube as a modern equivalent to the 20th-anniversary Mac.

Edit #2: I also wouldn't like to speculate on real-world T-bolt2/display performance if each T-bolt port was being used for a 4K display and external storage I/O at the same time.

sna · Jun 16, 2013

delete plz

subsonix · Jun 16, 2013

wiz329 · Jun 16, 2013

narbalek said:
I'm afraid thats impossible to answer unless Apple publishes an architectural diagram of the system.

Edit: been doing some speculating on how it might work -- which shouldn't be taken or rumored to be how it will.

Starting from the Wikipedia article on 4K display resolution, a 4K display can display anywhere from 3840x2160 to 4096x3112 pixels. Multiplied by either 24 or 32 bits per pixel and a 60Hz or 75Hz refresh rate, I'm getting a maximum possible bandwidth usage of 15.82Gbits/sec on each T-bolt port.

With Aiden's post showing 8Gbits/sec each way on every PCIe 3 lane and T-bolt 2 topping out at 16Gbps usable bandwidth each way, its possible that the two graphics cards each have two T-bolt2 bridges/adapters on them that are connected via internal cabling to the T-bolt2 ports on the back of the tube. Two x T-bolt2 adapters per card = 4 x bi-directional PCIe 3 lanes per card as a bare minimum. Allowing a PCIe3 x8 connector per card gives a total of 16 x PCIe3 lanes for graphics/T-bolt. Assuming one graphics connector is electrically equivalent to a PCIe3 x16 slot to accommodate an extra 2 lanes for the SSD blade attached to one of the graphics cards and another 2 x PCIe3 lanes for bridging to the PCH and its downstream USB3, GbE, and audio in/out ports, and everything fits within the maximum 20x PCIe3 lanes allowed by a single CPU.

So it might work, but given that each graphics card would have a different configuration (one an x16 with flash drive and one an x8 without flash drive) I wouldn't like to speculate on upgradeability or maintainability and certainly not on price. Custom configs always drive up the cost because they can't be sourced in bulk and take advantage of economies of scale. Probably best to think of the tube as a modern equivalent to the 20th-anniversary Mac.

Edit #2: I also wouldn't like to speculate on real-world T-bolt2/display performance if each T-bolt port was being used for a 4K display and external storage I/O at the same time.

So, if I understand you right, both the data and display port portions of the TB signal would go through the x16 graphics card lanes? So those lanes are doing "double duty"?

Second, on a normal motherboard, do connections like FW, USB etc. take PCIe lanes, or is that an entirely different protocol?

And finally, I was under the impression that each CPU had 40 total lanes of PCIe that come directly off of it, not 20.

edit: After doing some research: I don't think the majority of the PCIe lanes are routed through the PCH (and thus DMI). Sandy Bridge-E processors and the 2011 socket have 40 lanes that come directly off of the CPU, and 8 lanes that presumably connect via the PCH for things like SATA and USB 3.0. So, I'm sure (especially with the new generation of Ivy Bridge-EP) that all of the Thunderbolt ports are using the faster lanes that come directly from the CPU. Not sure if that means they go through the graphics cards, although I would presume this is the case, since Thundbolt is multiplexing data+displayport signal.

james1758 · Jun 16, 2013

Looks good

AidenShaw · Jun 16, 2013

wiz329 said:
edit: just looking up some scaling tests over at Tom's Hardware. They're old, so scaling performance may have increased, but in sequential read/writes from 1 drive to 5 in RAID 0, you lose about 17%. So, the first drive performed at 235 MB/s (yes, faster drives exist now), a 5 drive array maxed out right around 1000 MB/s. With those small losses, you probably wouldn't get to much above 2000 MB/s with 4 drives I'd imagine.

Without looking at the test rig specs, it could be hard to speculate why there was a slowdown. (Main memory bandwidth, PCIe bus, controller, CPU,OS,....) Do you have the link?

wiz329 said:
One final question: what are the latency differences in PCIe and TB (if known), and in what types of situations does this become a critical factor?

It's unlikely to be a factor for disk I/O, since data will stream. A handful of nano-seconds additional T-Bolt latency to set up the DMA transfer, the a large number of pipelined transactions.

wiz329 · Jun 16, 2013

AidenShaw said:
Without looking at the test rig specs, it could be hard to speculate why there was a slowdown. (Main memory bandwidth, PCIe bus, controller, CPU,OS,....) Do you have the link?

http://www.tomshardware.com/reviews/ssd-raid-iops,2848.html

AidenShaw said:
It's unlikely to be a factor for disk I/O, since data will stream. A handful of nano-seconds additional T-Bolt latency to set up the DMA transfer, the a large number of pipelined transactions.

So in what circumstances would the extra latency be problematic?

AidenShaw · Jun 16, 2013

wiz329 said:
edit: just looking up some scaling tests over at Tom's Hardware. They're old, so scaling performance may have increased, but in sequential read/writes from 1 drive to 5 in RAID 0, you lose about 17%. So, the first drive performed at 235 MB/s (yes, faster drives exist now), a 5 drive array maxed out right around 1000 MB/s.

wiz329 said:
http://www.tomshardware.com/reviews/ssd-raid-iops,2848.html

I didn't see any obvious reasons - but note for the streaming benchmarks that you seem to be referencing even Tom's considered the results to be "almost linear". The IOPs tests were even closer to linear.

One thing that I did notice was the size of the IOs used for the streaming benchmarks wasn't mentioned, nor the "stripe size" of the RAID controller. If your IO is less than the "full stripe" of the array, you won't hit all the drives. (For example, if the stripe is 256 KiB, then an IO of 512 KiB will only hit two or three drives, depending on the alignment. To hit all five drives, issue aligned IOs of 1280 KiB.)

wiz329 said:
So in what circumstances would the extra latency be problematic?

Small, synchronous (ask for small bit of data, wait for it) IOs.

And note that light travels 30cm per nanosecond - so a 2 meter cable adds 6 ns of latency, and that 30m optical cable adds 90 ns. (Roughly, since copper and optical have slightly different speeds.) See Grace Hopper nanosecond .

wiz329 · Jun 16, 2013

AidenShaw said:
I didn't see any obvious reasons - but note for the streaming benchmarks that you seem to be referencing even Tom's considered the results to be "almost linear". The IOPs tests were even closer to linear.

One thing that I did notice was the size of the IOs used for the streaming benchmarks wasn't mentioned, nor the "stripe size" of the RAID controller. If your IO is less than the "full stripe" of the array, you won't hit all the drives. (For example, if the stripe is 256 KiB, then an IO of 512 KiB will only hit two or three drives, depending on the alignment. To hit all five drives, issue aligned IOs of 1280 KiB.)

Sure. I noticed in the article they used the same wording as you did, "almost linear." And 17% total loss in sequential reads and writes over 5 drives isn't much overhead -- only 3% per drive added. I'd say, in perspective, it scaled pretty damn well (especially compared with mechanical drives it seems), and is as you say "almost linear". All to say (and its a minor point without a doubt), you could have at least 3 and probably 4 full speed SSDs running on a single TB channel in RAID 0 without the interface bottlenecking performance.

AidenShaw said:
Small, synchronous (ask for small bit of data, wait for it) IOs.

Would that be similar to the types of frequent random reads/writes that servers typically perform?

AidenShaw · Jun 16, 2013

wiz329 said:
Sure. I noticed in the article they used the same wording as you did, "almost linear." And 17% total loss in sequential reads and writes over 5 drives isn't much overhead -- only 3% per drive added. I'd say, in perspective, it scaled pretty damn well (especially compared with mechanical drives it seems), and is as you say "almost linear". All to say (and its a minor point without a doubt), you could have at least 3 and probably 4 full speed SSDs running on a single TB channel in RAID 0 without the interface bottlenecking performance.

Currently the top SATA 6 Gbps SSDs are around 550 MB/sec read, so 3 of them (1.65 GB/sec) would probably stay reasonably linear, but adding the 4th (needing 2.2 GB/sec) would definitely suffer due to the 2 GB/sec theoretical peak of T-Bolt2. I wouldn't be surprised if the 4th added very little - since in practice the theoretical peak performance is seldom reached.

wiz329 said:
Would that be similar to the types of frequent random reads/writes that servers typically perform?

Even with a 4 KiB transfer at 6 Gbps, it would take about 700 nanoseconds to transfer the data from the disk (assuming it is in cache or on a "wire speed" SSD). Adding a dozen nanoseconds wouldn't be noticeable (especially since typical servers have dozens or hundreds of active threads - so the latency is almost unimportant for throughput statistics). If you have spinning hard drives, a single head movement is orders of magnitude slower than any bus/T-Bolt latency.

You could probably concoct a "T-Bolt latency sucks" case by doing small RDMA transfers over InfiniBand - but since I don't think that there are any InfiniBand drivers for Apple OSX that's a moot point

.

narbalek · Jun 16, 2013

wiz329 said:
So, if I understand you right, both the data and display port portions of the TB signal would go through the x16 graphics card lanes? So those lanes are doing "double duty"?

Yes thats the only only way I'd see this could work -- though if its an Ivy Bridge or Haswell Xeon its probably two x8 graphics card lanes.

wiz329 said:
Second, on a normal motherboard, do connections like FW, USB etc. take PCIe lanes, or is that an entirely different protocol?

Not directly. They would go through the PCH, but the PCH itself takes 2 x PCIe3 lanes for the DMI 2.0 link to the CPU package with its on-die Northbridge functions.

wiz329 said:
And finally, I was under the impression that each CPU had 40 total lanes of PCIe that come directly off of it, not 20.

Sandy Bridge Xeons such as the E5-1650 have 40 x PCIe3 lanes, but Ivy Bridge Xeons such as the E3-1290 v2 have 20 x PCIe3 lanes, and the newest Haswell Xeons such as the E3-1280 v3 have just 16 x PCIe3 lanes -- and I was under the impression the tube used an Ivy bridge or Haswell Xeon.

wiz329 said:
edit: After doing some research: I don't think the majority of the PCIe lanes are routed through the PCH (and thus DMI). Sandy Bridge-E processors and the 2011 socket have 40 lanes that come directly off of the CPU, and 8 lanes that presumably connect via the PCH for things like SATA and USB 3.0. So, I'm sure (especially with the new generation of Ivy Bridge-EP) that all of the Thunderbolt ports are using the faster lanes that come directly from the CPU. Not sure if that means they go through the graphics cards, although I would presume this is the case, since Thundbolt is multiplexing data+displayport signal.

Yes the PCIe lanes come directly off the CPU. Based on the bandwidth requirements, the DMI 2.0 link to the PCH uses either 2 x PCIe3 lanes or in older systems uses 4 x PCIe2 lanes. This leaves the rest of the PCIe lanes for use by the graphics cards -- which in my speculative scenario pull double-duty as bridges to the T-bolt ports as well as the flash-drive connected through one of the cards.

OrangeUglad · Jun 16, 2013

gardinen said:
As a music studio engineer, looking to upgrade our current Mac Pro, the new Mac Pro might be an economically turn off.

Our system needs two PCI-express slots and several harddrives for projects, sample libraries etc. Buying a Mac Pro, AND a PCI-express chassi, AND external harddrive chassi, might be waaaaaay to much money to spend.

Note that i wrote "might be". I sure do hope i.e. Magma chassis get alot cheaper along with other Thunderbolt devices.

Mac Pro might drop price a little as well....Here to hopping.

wiz329 · Jun 16, 2013

narbalek said:
Sandy Bridge Xeons such as the E5-1650 have 40 x PCIe3 lanes, but Ivy Bridge Xeons such as the E3-1290 v2 have 20 x PCIe3 lanes, and the newest Haswell Xeons such as the E3-1280 v3 have just 16 x PCIe3 lanes -- and I was under the impression the tube used an Ivy bridge or Haswell Xeon.

The problem seems to be that you are comparing E3s (from Ivy Bridge and Haswell, both of which have launched I believe) with E5s (The latest generation here is Sandy Bridge -- neither the Ivy Bridge or Haswell versions of this chip have launched). It's apples and oranges so to speak. Speculation points towards a September or October launch of the appropriate Xeons for the Mac Pro. I'm not terribly familiar off the top of my head with the E3 vs. E5 differences, but I can't imagine Intel decreasing the number of lanes coming off the CPU for similar processor models from Sandy Bridge to Ivy Bridge. Apple's slide saying "40 GB/s of total I/O" makes it very likely in my opinion that there are indeed 40 PCIe lanes coming directly off of the CPU and all of them are in use here.

edit:

Confirmation: The old (or current, I guess) Mac Pros are based on the old Xeon naming scheme, using the Bloomfield (W3565) (yes, this CPU is 4+ years old now) and Gulftown CPUs (W3680, E56xx and X56xx). The 5xxx series are capable of dual socket configurations, while the 3 series are not.

However, in the new naming scheme, the E5-16xx series replace the previous high-end single socket CPUs (35xx and 36xx series), while the E5-24xx, E5-26xx, and E5-46xx lines can handle 2, 2, and 4 socket configurations respectively (the first digit is the number of sockets).

All to say, the E3 line (or equivalent) have never been used on the Mac Pros. The E5s are currently 2 generations behind (soon to be 1), while the E7s are actually still on Westmere-EX (3 generations behind Haswell). Those are never used in Mac Pros though -- they're overkill, supporting up to 8 socket configurations.

Perhaps someone more familiar with the Xeon models could enlighten us on why on earth there's such a generation gap between the E3, E5, and E7s.

----------

narbalek said:
Yes the PCIe lanes come directly off the CPU. Based on the bandwidth requirements, the DMI 2.0 link to the PCH uses either 2 x PCIe3 lanes or in older systems uses 4 x PCIe2 lanes. This leaves the rest of the PCIe lanes for use by the graphics cards -- which in my speculative scenario pull double-duty as bridges to the T-bolt ports as well as the flash-drive connected through one of the cards.

Again, I'm probably a little out of my league in terms of the electrical technology here, but I'd imagine the current Retina Macbook Pro would answer the question for us, in terms of lane placement and logical ordering, although TB 2.0 could change that up slightly. Since there is essentially now a single thunderbolt channel at 20 Gb/s (which is bi-directional), you couldn't just route 2 lanes direct for data, and 2 lanes through the GPU to carry the displayport signal. So, I'm not sure if its changed since TB 1 and its implementation (I'd imagine not), or what the connection order is, but I'd speculate it goes from the CPU to the GPU or the Redwood Ridge controller (not sure which comes first), the two signals (data and dp) get multiplexed at some point, then come out of the TB port.

narbalek · Jun 16, 2013

wiz329 said:
The problem seems to be that you are comparing E3s (from Ivy Bridge and Haswell, both of which have launched I believe) with E5s (The latest generation here is Sandy Bridge -- neither the Ivy Bridge or Haswell versions of this chip have launched). It's apples and oranges so to speak. Speculation points towards a September or October launch of the appropriate Xeons for the Mac Pro. I'm not terribly familiar off the top of my head with the E3 vs. E5 differences, but I can't imagine Intel decreasing the number of lanes coming off the CPU for similar processor models from Sandy Bridge to Ivy Bridge. Apple's slide saying "40 GB/s of total I/O" makes it very likely in my opinion that there are indeed 40 PCIe lanes coming directly off of the CPU and all of them are in use here.

I think what has to be clarified here is what does 'total I/O' actually mean. By convention, bandwidth numbers are expressed in terms of the half-duplex value, but I've heard some vendors talk about total I/O in terms of full-duplex throughput.

Now if Apple is claiming 40GB/s of total I/O in terms of total system throughput then we could well be talking about 20 PCIe3 lanes -- each of which provides 1GB/s bandwidth but 2GB/s total throughput assuming data and/or display transfers are saturating the lanes in both directions.

Perhaps someone more familiar with the Xeon models could enlighten us on the differences between the E3 and E5 lineup and why on earth there's such a generation gap between them. The current E5s are two generations behind the E3s, assuming (as I believe I've heard) the Haswell E3s have indeed launched.

As has been reported elsewhere, the simple answer is that Intel is pushing mobile and desktop processors out the door much faster than it does its workstation and server CPU lines.

Edit: In any case, I'm getting my numbers from the wikipedia list of currently available Xeon CPUs and following the links on that page to ark.intel.com for the detailed specs.

Again, I'm probably a little out of my league in terms of the electrical technology here, but I'd imagine the current Retina Macbook Pro would answer the question for us, in terms of lane placement and logical ordering, although TB 2.0 could change that up slightly. Since there is essentially now a single thunderbolt channel at 20 Gb/s (which is bi-directional), you couldn't just route 2 lanes direct for data, and 2 lanes through the GPU to carry the displayport signal. So, I'm not sure if its changed since TB 1 and its implementation (I'd imagine not), or what the connection order is, but I'd speculate it goes from the CPU to the GPU or the Redwood Ridge controller (not sure which comes first), the two signals (data and dp) get multiplexed at some point, then come out of the TB port.

Most likely its PCIe3 straight to the graphics cards and they handle the multiplexing of data and display as well as bridging PCIe3 to TB. To do this each graphics card would require a minimum PCIe3 x4 connector, but an x8 connector to each card is more likely considering that there are six rather than four TB ports on the back of the box and one of the graphics cards has a flash-drive slot which I'm assuming would consume the equivalent of two or four lanes.

i.e. graphics card #1 bridges 4 lanes to two TB outputs plus 4 lanes to the flash-drive slot while graphics card #2 bridges 8 lanes to four TB ports -- and each graphics card can drive a maximum of two 4K displays.

wiz329 · Jun 16, 2013

narbalek said:
I think what has to be clarified here is what does 'total I/O' actually mean. By convention, bandwidth numbers are expressed in terms of the half-duplex value, but I've heard some vendors talk about total I/O in terms of full-duplex throughput.

Now if Apple is claiming 40GB/s of total I/O in terms of total system throughput then we could well be talking about 20 PCIe3 lanes -- each of which provides 1GB/s bandwidth but 2GB/s total throughput assuming data and/or display transfers are saturating the lanes in both directions.

The lowest current baseline E5 CPU (Sandy Bridge based) offers 40 lanes. It won't go down for Ivy Bridge. See the edit on my previous post.

narbalek said:
As has been reported elsewhere, the simple answer is that Intel is pushing mobile and desktop processors out the door much faster than it does its workstation and server CPU lines.

Sure, but to have a platform 3 generations behind is pretty, well, behind.

narbalek said:
Edit: In any case, I'm getting my numbers from the wikipedia list of currently available Xeon CPUs and following the links on that page to ark.intel.com for the detailed specs.

Yep, same here. Just cross referencing with previous part numbers of Mac Pro processors here.

----------

narbalek said:
Most likely its PCIe3 straight to the graphics cards and they handle the multiplexing of data and display as well as bridging PCIe3 to TB. To do this each graphics card would require a minimum PCIe3 x4 connector, but an x8 connector to each card is more likely considering that there are six rather than four TB ports on the back of the box and one of the graphics cards has a flash-drive slot which I'm assuming would consume the equivalent of two or four lanes.

i.e. graphics card #1 bridges 4 lanes to two TB outputs plus 4 lanes to the flash-drive slot while graphics card #2 bridges 8 lanes to four TB ports -- and each graphics card can drive a maximum of two 4K displays.

Where does the controller enter into the equation then?

AidenShaw · Jun 16, 2013

DisplayPort is not PCIe

wiz329 said:
Since there is essentially now a single thunderbolt channel at 20 Gb/s (which is bi-directional), you couldn't just route 2 lanes direct for data, and 2 lanes through the GPU to carry the displayport signal. So, I'm not sure if its changed since TB 1 and its implementation (I'd imagine not), or what the connection order is, but I'd speculate it goes from the CPU to the GPU or the Redwood Ridge controller (not sure which comes first), the two signals (data and dp) get multiplexed at some point, then come out of the TB port.

DisplayPort does not go over PCIe, it's a completely different protocol - the T-Bolt1 architectural diagrams clearly showed the DP signals going to the T-Bolt bridge alongside the PCIe lanes.

wiz329 · Jun 16, 2013

AidenShaw said:
DisplayPort does not go over PCIe, it's a completely different protocol - the T-Bolt1 architectural diagrams clearly showed the DP signals going to the T-Bolt bridge alongside the PCIe lanes.

Yeah, you're right. But I suppose it effectively gets "transformed" along the way, if you think about the whole path from CPU to display. PCIe lanes run from the CPU to the GPU, where they effectively "dead end", and then displayport exits the GPU and then is presumably multiplexed with raw PCIe lanes into "Thunderbolt."

Apple Offers Sneak Peek of Completely Redesigned Mac Pro

macrumors 6502a

macrumors G3

macrumors P6

macrumors P6

macrumors newbie

macrumors P6

macrumors newbie

macrumors 6502a

macrumors newbie

macrumors newbie

macrumors 68040

macrumors 6502a

macrumors regular

macrumors P6

macrumors 6502a

macrumors P6

macrumors 6502a

macrumors P6

macrumors newbie

macrumors newbie

macrumors 6502a

macrumors newbie

macrumors 6502a

macrumors P6

macrumors 6502a

Our Staff