Questions about Throughput

dxp4acu · August 7, 2002 8:54AM

Here are some questions that I have thought of:

I keep hearing on here that Apple should put DDR 333, 400, or even 533 on their computers. Is that not an easy proposition, as Apple has yet to do it? Is it harder than we think, or are they just lazy?

Also, why is the DDR in the xServe considered a "hack" and not really DDR?

And why does this help throughput so much???

I just don't understand this stuff! Thanks for the help guys.

programmer · August 7, 2002 9:23AM

Anything faster than DDR333 is currently not available widespread and in the large quantities that Apple needs -- at least not at near the prices that SDRAM is currently at. Its also not clear that these speeds will ever have an "official" standard, and the day of Apple settings its own memory standards seems to be over. Next year DDR-II should arrive and bring speeds of 400+ with it, at which time Apple will consider adopting it.

The bandwidth benefits aren't as easy and obvious as many people seem to think. Without increasing the burst length on the bus, DDR doesn't help as much as it seems like it should because the bursts are half as long and thus the overhead per byte is doubled. This is why many PC DDR chipsets don't do much better than Apple's SDRAM motherboards. We still don't know what the true throughput of the Xserve's memory controller is -- Apple's PR materials have only mentioned its theoretical throughput.

The Xserve is often referred to as a "hack" because the CPU can't use the full memory bandwidth of the memory subsystem. I don't agree with calling it a "hack", but from a processor performance situation it is not ideal. Fortunately the Xserve is a server and thus needs lots of bandwidth for things other than the processor, so this setup makes sense and is a positive thing not the negative thing that people make it out to be. In a PowerMac (or something positioned as a compute server), however, the same solution becomes a little more dubious. The basic problem is that the G4 (at least up to the current version, 7455) has an interface to the system which only supports approximately 850 MBytes/sec maximum. This means that even if your memory system supported 20 GBytes/sec, the processor(s) in the system could only use 850 MBytes/sec of it. Since the G4 is largely memory bound, especially when using AltiVec, this means its performance is seriously capped regardless of what Apple does to the system. Hopefully a new G4 will change that, and certainly any future "G5" chip (i.e. the chip Apple will use after the G4) will need to address this issue first.

davegee · August 7, 2002 9:30AM

While this topic has been talked about many times here is the short answer.

The PPC that we have today can't 'talk' to the memory any faster than 1:1 (133Mhz bus / 850 MB/s MAX). If Apple were to build a system with DDR or 2:1 the CPU wouldn't be able to take advantage of the extra speed.

The xServe using DDR isn't exactly a 'total waste' since I/O functions such as ethernet, ATA/IDE and the like do benefit from DDR and for a server that's A LOT of what it spends time doing... But, since the CPU is still stuck at accessing the memory at 1:1 people say it's a "HACK".

Is it? I don't think so... Apple is doing the best it can with a CPU (the G4) and a CPU provider (MOT) that they chose to get into bed with.

Back in the 90's Apple had a choice as to what direction the G4 would take. Both IBM and MOT had their own designs... MOT sold Apple with AltiVec and now Apple is paying the price... Who knows what would have happened had Apple chose IBM... Who knows, it could have been worse had they chose the IBM route but being stuck in the MOT-G4 dumps for oh so many years (three - but don't it seem longer?) I can't imagine it woulda been any worse.

MOT came out with the G4 in Aug 99 with a top speed of 500Mhz... Over the following three years the only thing MOT has been able to do is give us a 500Mhz speed bump! No DDR no nothing... Sorry but if that is the best MOT can do IN THREE YEARS then they have clearly shown what they are able (or willing) to do for us. (rant off)

Rumor has it MOT is out when it comes to the 'G5' that Apple will use...

[ edit: Please refer to Programmers post above for much better details on DDR and bandwidth issues]

Dave

[ 08-07-2002: Message edited by: DaveGee ]

hmurchison · August 7, 2002 9:30AM

Thanks Programmer. I had been wondering about that same ? as well.

matsu · August 7, 2002 10:12AM

Question?

Is digital video not one of Apple's targetted markets? Wouldn't improved disk access and I/O help out quite a bit with Video? I mean regardless of the state of the G4, DDR and multiple ATA100 channels wouldn't hurt. And, in light of QE, I'm sure more memory bandwidth could keep the GPU happy without affecting the CPU(s).

Even with 4X AGP (that's 1GB/s) and G4's (another GB/s, or so) the two devices can together suck up all the bandwidth of SDR133. I think a video pro could get the PM into a situation where both the video card and the CPU demand maximum bandwidth, so with DDR at least the demands of one wouldn't affect the performance of the other in those situations. I think.

And doesn't the Xserve have 64/66 PCI slots (533MBps). Some pro video boards wouldn't mind the increased PCI bandwidth either.

At the very least it would mean that someone buying an Xserve-type PM could save a bundle by opting out of SCSI-RAID, and would get marginally better video performance under heavy loads, no?

programmer · August 7, 2002 10:25AM

Yes, the Xserve-style system in a PowerMac would be a lot faster in practice than people seem to acknowledge. Quartz Extreme + GPU alone would consume all the "excess" bandwidth, nevermind disk I/O, network I/O, FireWire, and USB. Personally I'll buy such a machine if that's what Apple ships.

The problem with such a machine is that if you are using it for G4-intensive memory bound tasks, it won't be any faster. The benchmarks won't look any better. Well, actually, compared to another machine that has a heavy I/O load, but an inferior memory subsystem, it will look better since less of the G4's memory bandwidth will be sucked away by "other tasks". But in the case of a machine that is only running this memory-bound G4 task, it won't be any faster. Compared to a Pentium4 2.5 GHz w/ QDR FSB and RAMBus, such a G4 machine would be noticably inferior at memory bound computations. And the really frustrating thing is that so much of what runs on the G4 is memory bound. A 1.4 GHz G4 can do over 24 flops per non-cached floating point value it reads from memory... and most software isn't setup to do anything close to that much per memory access which just exaggerates the problem (sometimes there just isn't that much work to do, either).

So an Xserve-style machine would be a big improvement over the current PowerMac, but we can always hope for an even bigger improvement. We may have to wait until next year though. Not me though, I need to replace my current machine now.

mr. me · August 7, 2002 10:51AM

[quote]Originally posted by dxp4acu:

Here are some questions that I have thought of:

I keep hearing on here that Apple should put DDR 333, 400, or even 533 on their computers. Is that not an easy proposition, as Apple has yet to do it? Is it harder than we think, or are they just lazy?

Also, why is the DDR in the xServe considered a "hack" and not really DDR?

And why does this help throughput so much???

I just don't understand this stuff! Thanks for the help guys.<hr></blockquote>

This article in <a href="http://www.linuxworld.com/site-stories/2002/0805.macx-p3.html"; target="_blank">LinuxWorld</a> would seem relevant to your question.

powerdoc · August 7, 2002 10:55AM

[quote]Originally posted by Programmer:

Yes, the Xserve-style system in a PowerMac would be a lot faster in practice than people seem to acknowledge. Quartz Extreme + GPU alone would consume all the "excess" bandwidth, nevermind disk I/O, network I/O, FireWire, and USB. Personally I'll buy such a machine if that's what Apple ships.

The problem with such a machine is that if you are using it for G4-intensive memory bound tasks, it won't be any faster. The benchmarks won't look any better. Well, actually, compared to another machine that has a heavy I/O load, but an inferior memory subsystem, it will look better since less of the G4's memory bandwidth will be sucked away by "other tasks". But in the case of a machine that is only running this memory-bound G4 task, it won't be any faster. Compared to a Pentium4 2.5 GHz w/ QDR FSB and RAMBus, such a G4 machine would be noticably inferior at memory bound computations. And the really frustrating thing is that so much of what runs on the G4 is memory bound. A 1.4 GHz G4 can do over 24 flops per non-cached floating point value it reads from memory... and most software isn't setup to do anything close to that much per memory access which just exaggerates the problem (sometimes there just isn't that much work to do, either).

So an Xserve-style machine would be a big improvement over the current PowerMac, but we can always hope for an even bigger improvement. We may have to wait until next year though. Not me though, I need to replace my current machine now.<hr></blockquote>

I fear that we only see an X-serve like powermac

cinder · August 7, 2002 11:43AM

If we're only getting an Xserve-like machine - then why put it off for so incredibly long?

The Xserve has been done for a while now . . .

They could've started earlier to clear the channels of PowerMacs . . .

I think it's a given that we will get something different . . .

But it's anyone's guess HOW different and HOW powerful.

Thanks for the info, though.

I wasn't a 100% clear on the 'hack' issue as everyone so rudely calls it.

nevyn · August 7, 2002 12:43PM

[quote]Originally posted by Programmer:

The problem with such a machine is that if you are using it for G4-intensive memory bound tasks, it won't be any faster. The benchmarks won't look any better. Well, actually, compared to another machine that has a heavy I/O load, but an inferior memory subsystem, it will look better since less of the G4's memory bandwidth will be sucked away by "other tasks". But in the case of a machine that is only running this memory-bound G4 task, it won't be any faster. Compared to a Pentium4 2.5 GHz w/ QDR FSB and RAMBus, such a G4 machine would be noticably inferior at memory bound computations. And the really frustrating thing is that so much of what runs on the G4 is memory bound. A 1.4 GHz G4 can do over 24 flops per non-cached floating point value it reads from memory... and most software isn't setup to do anything close to that much per memory access which just exaggerates the problem (sometimes there just isn't that much work to do, either).<hr></blockquote>

...and one of the key benefits of the G4 over the x86 varients is that the integer unit, the floating point unit, and the vector unit are basically completely separate. What this means is that _all_ of these units consume memory bandwidth simultaneously. That is, assume one 32 bit int unit, one 64 bit FPU, and one 128 bit vector unit where each unit can start one calculation every cycle (which doesn't quite mesh with reality). That's 224 bits/cycle consumed, 28 bytes/cycle -> on a 1 GHz machine that's 28 GB/s of CPU-to-memory bandwidth. Do we want to consider a Dual? A Quad?

Just on the face of it, Apple should have been the FIRST company to switch over to either something like DDR/RDR or something more exotic like a NUMA backplane. Their CPUs outstripped the available bandwidth before x86s did, even with the MHz discrepancy.

The early performance of the G4s has generated some interest in science/engineering circles, simply from the single precision floating point theoretical max FLOPS (using both the FPU and the vector unit). A really solid computer with decent memory bandwidth would probably cement it in a lot of places.

Witness how the the dual 1 GHz G4 is holding its own vs dual 2 GHz Athalons in the RC5 calculation.

cthulu · August 7, 2002 3:46PM

Ok,I have to say I dont know what on earth you are talking about.Some here seemed confused about FSB speed and how DDR works.I see this "hack" nonesense on every rumor site but precious little evidence.

1)No commercially available bus runs faster than 133 megahertz as of right now.If Apple releases a system with a FSB of 166 it will be the first.

2)The athlon has neither a 200 nor a 266 FSB,it is either 100 or 133.The P4 has neither a 400 nor a 533 bus,it is 100 or 133.

3)A bus is clocked at a particular frequency and is able to transfer across the bus with each tick of the clock.SDRAM can transfer 1 piece of 64 bit (8 byte) data per cycle.DDR can transfer 2 such pieces,one as the clock peaks and another as it falls.As a result its theoretical throughput is twice sdram at the same clock rate so some clever dicks at amd decided to double there FSB clock specs for marketing purposes.But this is important...THE CLOCK IS STILL 100 or 133.

RDRAM is quad pumped so they multiply their FSB numbers by four.These numbers sound impressive but are not useable except for some simple memory operations.In addition RDRAM suffers from rediculusly high access times on the order of 50-60ns.

4) So in what way is the xserve a hack? It uses DDR in the same way the athlon does.It has a 133 megahertz bus with DDR memory.The only basis I can find for this is on apples xserve spec page under the g4 stuff.It lists the FSB as 133 and says it has over 1 gigabyte throughput.Problem is this comes strait from the powermac spec page.Looks like a goof to me.

5) I hear people claiming the G4 is limited in bandwith in some way but that makes no sense.The processor communicates only with its level one cache at full core speed,for the duel gig machine that is 1 billion cycles a second,far beyond any memory speed.The level 1 cache is fed from main memory governed by the system controler.The g4 has no onboard memory controler,except for its DDR level 3 cache.If the controler didnt recognize

DDR and its function it couldnt even be on the mother board.

6)The performance boost offered by DDR are moderate at best.This is mainly because so much of the processing takes place between the processor and its cache system.Have a look at

<a href="http://www17.tomshardware.com/mainboard/01q3/010808/index.html"; target="_blank">http://www17.tomshardware.com/mainboard/01q3/010808/index.html</a>;

7) Has anyone used an xserve to verify that it is no faster than the powermac at processor intensive stuff?Or is all this a lot of hot air?

Warning,its called "arse" technica for a reason.

amorph · August 7, 2002 4:28PM

[quote]Originally posted by cthulu:



4) So in what way is the xserve a hack? It uses DDR in the same way the athlon does.It has a 133 megahertz bus with DDR memory.<hr></blockquote>

The difference with the XServe - which is called a "hack" by some - is that the bus to the CPU isn't double-pumped. It's 133MHz SDR, which is where the 1GB/s throughput number comes from. So the CPU's access to RAM is throttled to 1GB/s theoretical (about 850MB/s actual) instead of the RAM's theoretical 2GB/s maximum throughput. Apple has no choice in this matter: The MPC7455 CPU expects to be hooked up to either a 60x or an MPX bus - neither of which can be double-pumped.

[quote]I hear people claiming the G4 is limited in bandwith in some way but that makes no sense.The processor communicates only with its level one cache at full core speed,for the duel gig machine that is 1 billion cycles a second,far beyond any memory speed.The level 1 cache is fed from main memory governed by the system controler.The g4 has no onboard memory controler,except for its DDR level 3 cache.If the controler didnt recognize

DDR and its function it couldnt even be on the mother board.<hr></blockquote>

But it is.

The memory controller on the XServe recognizes DDR; the system bus, however, does not. So the memory controller can feed up to 2GB/s theoretical to the CPU and DMA devices and AGP all at once, but only 1 GB/s theoretical to the CPU.

Cache is relevant where locality of reference is relevant, but if you're doing work on streams and huge chunks of data - the sort of work AltiVec excels at - what you want is a fast pipe to main RAM. The G4s on the XServe can only access main RAM about about half the maximum possible speed.

[quote]6)The performance boost offered by DDR are moderate at best.<hr></blockquote>

This is a quality of implementation issue. Apple's SDR bus implementation is as fast or faster than some PC DDR implementations in real-world use, which some people here take as an indication that their DDR implementation will be similarly high quality.

[quote]7) Has anyone used an xserve to verify that it is no faster than the powermac at processor intensive stuff?Or is all this a lot of hot air?<hr></blockquote>

It follows from the simple fact that the XServe has one or two 1GHz MPC7455s connected to RAM by a shared bus (MaxBus, 133MHz, not double-pumped). This is an identical configuration to the PowerMacs. In memory- and compute- bound calculations, the XServe should thus yield identical results. In any application that requires I/O throughput, however, the XServe will demolish the PowerMac, because the RAM is faster, there are more (and faster) I/O controllers, especially for ATA, and the memory controller and the DMA implementation are superb.

rickag · August 7, 2002 4:31PM

[quote]Originally posted by cthulu:



5) I hear people claiming the G4 is limited in bandwith in some way but that makes no sense.<hr></blockquote>

I don't claim to understand this, however, maybe you can explain it in layman's terms to me. There does appear to be bandwidth issues????

<a href="http://arstechnica.infopop.net/OpenTopic/page?q=Y&a=tpc&s=50009562&f=8300945231&m=879095950 4&p=5" target="_blank">Info posted on Arstechnica</a>

From a post by BadAndy

Ars YIQtoPAL test vers 1.1

Timing results in microseconds for the Ars YIQtoRGB of the following versions

#Pixels\t #Loops\t Standard \t Matrix 3x3 \t Unrolled_2X\t Vector

16\t 65536\t 337524\t 181672\t 93985\t 24409

32\t 32768\t 333938\t 172208\t 82820\t 16039

64\t 16384\t 332115\t 167402\t 77460\t 11957

128\t 8192\t 331216\t 165026\t 74579\t 9912

256\t 4096\t 331163\t 163667\t 73224\t 8899

512\t 2048\t 330436\t 163070\t 72522\t 8405

1024\t 1024\t 330385\t 162711\t 72322\t 8481

2048\t 512\t 337419\t 168021\t 73478\t 20390

4096\t 256\t 337432\t 168322\t 74742\t 20269

8192\t 128\t 339020\t 169868\t 75592\t 21397

16384\t 64\t 343992\t 173942\t 80736\t 27730

32768\t 32\t 353733\t 181665\t 90746\t 42028

65536\t 16\t 365741\t 191575\t 110349\t 66890

131072\t 8\t 372499\t 197416\t 123523\t 84452

262144\t 4\t 372612\t 196059\t 123571\t 85748

524288\t 2\t 373002\t 196734\t 123567\t 85077

1048576\t 1\t 372507\t 197022\t 122778\t 85152

Remember that each "pixel" reads 3 and writes 3 floats.

For the very smallest loop calls you can see the vector-splat

coefficient matrix set-up penalty for the vector version, but it

is still clobbering even the fastest scalar code.

For slightly larger run lengths, but still inside L1, the vector version is 6 times faster than the unrolled-2X scalar version ... this is what we expect from the vector 4X advantage plus the "no bubble" VFPU pipeline.

This result shows what altivec can do ... it is smoking fast if you can keep it fed.

But after that we start to progressively take the bandwidth hits. You can see the abrupt jump where it breaks out into L2 and then another when it breaks out into main memory.

This version of the code adds dst (three of them for the thee input streams) ... which doesn't seem to help a great deal.

Sadly, this little code is another damn blitter, and an all-too-vivid demonstration for why we are CHOKING on PC133. For the largest memory size tested (12 MB of float input data, equal output) the vector algorithm is "only" 50% faster than the unrolled scalar algorithm ... and this is all due to bandwidth.

And for _virtually_all_ of these vector algorithms 9600man is going to get the vivid nasty demonstration that running another thread is going to do about squat for these bandwidth-limited cases... except to the extent that the other thread inadvertently helps with tasker pogoing. And remember folks... we are both testing on 533 MHz CPUS ... think about the situation on the modern machines (altho their PC133 does get about 1 GB/s where ours get about 700 MB/s)

While checking I substituted the variable names in the code below to make it more readable... it is the same code as the previous one EXCEPT that the input variable declaration is added and the DSTs have been added too.

Note: I added the bold and italics to his post.

telomar · August 7, 2002 4:43PM

[quote]Originally posted by cthulu:

1)No commercially available bus runs faster than 133 megahertz as of right now.If Apple releases a system with a FSB of 166 it will be the first.<hr></blockquote>I believe you would find the Sahara chips (current G3s) use a 200MHz FSB that isn't a double pumped 100MHz bus.

thatguy · August 7, 2002 10:00PM

Not only that, but the PowerPC in the Gamecube has a 162mhz FSB.

Questions about Throughput

Comments