Multiprocessors...

yevgeny · July 23, 2002 4:16PM

[quote]Originally posted by Aphelion:

MERSI ME!

I'd rather have 8 (or even 4) 7410's @ 500 Mhz than 2 74xx @ 1.5 Ghz.<hr></blockquote>

As I wrote above, you couldn't possibly feed 8 processors at 500Mhz. 8 Altivec units trying to get data? What a disaster. People need to remember that the Daystar Genesis MP was a four way PPC 604 machine and that the 604 did not consume bus bandwidth in the same way that a G4 does. All the Genesis MP did quickly was call the Quicktime API (mostly all that was MP aware at the time was Quicktime and Quickdraw 3D). In every other category, the Genesis MP was as fast as a stock 9500 tower.

Secondly, IF this machine could feed its cpu's, such a parallel machine would only do well at taks that have been parallelized (use multiple threads), but would not be a very good at actually doing something that is not parallelized. It might make for a good email server, but probably not much more.

8 .5 GHz CPU's do not make for a 4GHz computing experience! Aside from bus issues, you run into data integrity issues where it becomes more difficult to keep the CPU's from stepping on each other's feet. God help you if your parallelized app makes heavy use of common mutexes or semaphores. I will take two 1.4 GHz CPU's over eight 500MHZ cpu's any day of the week.

I think that Apple needs to ship dual cpu's with a new bus (hey- a pun!) across the entire Pro line. If they can get the MHz rating up and a bus to feed the CPU's, then this will hold people over until something comes in the future.

bluejekyll · July 23, 2002 4:36PM

Everyone keeps harping on 4 processors not being usable, because the apps need to be parallelized. This is true if you want the one App to run faster on your box, but what if you have tons of apps all running at the same time? The OS will schedule apps to run on different processors, which means that you can utilize both (or more) processors by just running more Applications.

In any case, with the move to OS X almost every application engineer out there will be utilizing multiple threads, gauranteed. It may take a while, but every app will be multi-threaded.

jcg · July 23, 2002 4:55PM

[quote]Originally posted by BlueJekyll:

Everyone keeps harping on 4 processors not being usable, because the apps need to be parallelized. This is true if you want the one App to run faster on your box, but what if you have tons of apps all running at the same time? The OS will schedule apps to run on different processors, which means that you can utilize both (or more) processors by just running more Applications.

In any case, with the move to OS X almost every application engineer out there will be utilizing multiple threads, gauranteed. It may take a while, but every app will be multi-threaded.<hr></blockquote>

Very true, I never have just one App running. Also remember that the OS itself is MP aware, so that all applications, weather multi-threaded or not, will gain at least a minor speed improvment.

Looking at the software that is coming out, I dont think Apple will wait too much longer before Quads are released. As soon as they have a chip that will support the memory, they will put together a MoBo for a high end AV/Graphics system and a new XServe that will take advantage of quads. Apple is releasing Shake, at quite a discount for OS X. This could bring in a lot of converts IF Apple has the hardware to compete. Right now they dont, but I would be willing to bet Shake, FCP, DVD Studio Pro, and probably Web Objects could all benefit from a Quad processor system, and this would make them competative in that market.

yevgeny · July 23, 2002 4:58PM

[quote]Originally posted by BlueJekyll:

Everyone keeps harping on 4 processors not being usable, because the apps need to be parallelized. This is true if you want the one App to run faster on your box, but what if you have tons of apps all running at the same time? The OS will schedule apps to run on different processors, which means that you can utilize both (or more) processors by just running more Applications.

In any case, with the move to OS X almost every application engineer out there will be utilizing multiple threads, gauranteed. It may take a while, but every app will be multi-threaded.<hr></blockquote>

If you have tons of apps running at the same time (mind you, I mean running and taking up cycles, not just in memory and idle), then having 4-8 CPU's would be useful so long as they do not saturate the bus. This is the mail server part of my above response.

If you have instances of Final Cut Pro all performing MPEG compression, then you are in the same boat as if you had two processors at the same speed because you are completely limited by your bus speed.

Just because you are writing a threaded program does not mean that you will extensively use threads throughout your code. A far more likely scenario would be that you would make a particular portion of your code use threads because it is something that can be easily broken up into individual chunks that have nothing to do with each other. There is no way to make a program that from top to bottom uses multiple threads because you get into data concurrency issues which negate the possible benefits of having more than one CPU. If multiple threads have to share data that they are modifying, then they have to lock out any other threads from that data (to make the modification). Locked out threads don't take up CPU cycles, but they don't get work done. It is like asking how fast four people can mow a lawn with one lawn mower. The answer is that they mow the lawn a bit slower than one person because they have to stop one person from mowing so that the other person can switch over.

I know that it is difficult for some people to understand that not everything can take advantage of threading, but please understand that this is the case.

[ 07-23-2002: Message edited by: Yevgeny ]

yevgeny · July 23, 2002 5:04PM

[quote]Originally posted by JCG:



Looking at the software that is coming out, I dont think Apple will wait too much longer before Quads are released. As soon as they have a chip that will support the memory, they will put together a MoBo for a high end AV/Graphics system and a new XServe that will take advantage of quads. Apple is releasing Shake, at quite a discount for OS X. This could bring in a lot of converts IF Apple has the hardware to compete. Right now they dont, but I would be willing to bet Shake, FCP, DVD Studio Pro, and probably Web Objects could all benefit from a Quad processor system, and this would make them competative in that market.<hr></blockquote>

From the numbers that I have seen, a modern x86 bus should be able to feed quad G4's running at 1.4 GHZ. Such a machine would be rather impressive (and expensive). The key is getting a real bus.

Yes, all the mentioned programs would benefit from having more CPUs and more bus bandwidth.

ed m. · July 23, 2002 5:44PM

Here is a Q and A I had with a PPC designer over at Motorola when I first discovered the "R" state or MERSI disappeared with the advent of the 7450

Remember, the point of the argument is that the 7400/7410 microarchitecture supports MERSI, and the 7450 microarchitecture supports MESI.

Q: What good is it to get rid of cache snooping? I always believed that that was one of the MAIN reasons the G4's utilized the bus so much better. I'm not sure I like the sound of the PPC loosing the "R"... How will that affect performance?

A: For various design reasons (reasons that were valid at the time the decision was made), the "R" state was not a good choice to implement on the 7450. Cache snooping still occurs, but the intervention is not processor to processor.

G4 bus utilization is already better in uniprocessor, so it is not surprising that it would be better in MP with or without the "R" state.

Remember, a large cache will soak up a lot of potential traffic that would otherwise go to the bus. And allow the processor to continue to do other things. However, if you are talking about performance gains then programmers also need to add more parallelism into their code structure - something we aren't seeing in a lot of Macintosh apps that are currently shipping. This is a HUGE waste of processor resources. The embedded market seems to understand the architecture better and therefore the chip is being utilized in the manner it was designed.

Anyway, I'm getting a bit off topic. Besides, I'm not arguing that the "R" state is useless - it depends greatly on how the Compiler/OS interact with the data layout of the application/workload. Also, the kind of applications/workload will make a big difference, so it depends. If you are interested, perhaps a 3rd party could run these two-way MP systems and prove whether or not the MERSI was a big deal (and in which applications). Presumably, anyone who had the old MP G4 systems, access to some (or several) MP benchmarks, should be able to handle this testing. Lots of companies and people run benchmarks (although not always correctly or accurately), so this probably wouldn't be that hard. I suggest you contact Adobe. Adobe does a lot of benchmarking across many different platforms. Adobe's testing methods are *among* the best (and most accurate) in the industry and they are geared more toward testing the performance of various desktop implementations.

End Q/A.

---------------------------------------------------------------------------------------------------------------

These documents might help:

<a href="http://e-www.motorola.com/brdata/PDFDB/docs/MPC7450UM_CH3.pdf"; target="_blank">http://e-www.motorola.com/brdata/PDFDB/docs/MPC7450UM_CH3.pdf</a>;

<a href="http://e-www.motorola.com/collateral/SNDFH1102.pdf"; target="_blank">http://e-www.motorola.com/collateral/SNDFH1102.pdf</a>;

In the mean time I e-mailed Moto to see if there is a document posted that talks about the differences and describes what to expect as far as MP configurations. I'm going to assume that the "R" state will return in future processors -- though I could be wrong.

--

Ed M.

franck · July 23, 2002 7:33PM

[quote]Originally posted by Yevgeny:



From the numbers that I have seen, a modern x86 bus should be able to feed quad G4's running at 1.4 GHZ. Such a machine would be rather impressive (and expensive). The key is getting a real bus.

<hr></blockquote>

I high doubt your point is valid.

because on selected altivec algorithms, a G4/533 could easily choke the main memory bandwith (about 700MB/s) on PC133 memory subsystem (link <a href="http://arstechnica.infopop.net/OpenTopic/page?q=Y&a=tpc&s=50009562&f=8300945231&m=879095950 4&p=3" target="_blank">here</a>, 6th post before end-of-page ).

G4 are memory starved since DAy ONE when using altivec. Apple highly need a high bandwith bus in order to show the real G4 performance.

For those who are interested in Altivec-programming, it's a VERY interresting thread.

[quote]



Yes, all the mentioned programs would benefit from having more CPUs and more bus bandwidth.<hr></blockquote>

Well single G4 (< 1GHz) must fight against memory bandwith ( well, may be except for RC5

). IMO, if Apple implements DDR333, ie 2.5GB/s between procs and memory (more than doubling effective memory bandwith), the performance increase shouldbe awesome.

Of course, I'm not talking about word-processing performance, but multimedia-alivec-aware-apps.

[quote]Originally posted by Ed M.:



However, if you are talking about performance gains then programmers also need to add more parallelism into their code structure - something we aren't seeing in a lot of Macintosh apps that are currently shipping. This is a HUGE waste of processor resources.<hr></blockquote>

Agreed. More over, I thing PPC processors are highly hindered by BAD compilers. GCC3.1 may be better than 2.95, but still sucks when compared to x86 ones. (see link above for details)

[ 07-23-2002: Message edited by: Franck ]

ed m. · July 23, 2002 8:20PM

Franck wrote:

[[[Well single G4 (< 1GHz) must fight against memory bandwith ( well, may be except for RC5 ). IMO, if Apple implements DDR333, ie 2.5GB/s between procs and memory (more than doubling effective memory bandwith), the performance increase shouldbe awesome.

Of course, I'm not talking about word-processing performance, but multimedia-alivec-aware-apps. ]]]

Well, what about forgoing the bus all-together? If you read one of my previous posts titled: "G5 Speculation Revisited" I mentioned a few interesting snippets from some articles. Everything that I've posted in that thread topic is traceable back to a legit source BTW. Anyway, what I found MOST promising was this...

[[[To maximize data bandwidth and reduce memory latency, Motorola Inc. said it will likely integrate a DRAM controller directly onto a future high-end PowerPC processor ... By doing so, the processor could bypass an external bus and have a direct link to the DRAM. ... "It makes a lot more sense to add high-speed memory controllers on processors," ... "Anytime you have a bus, you have to arbitrate for the bus. Rather than let it go hungry, you could feed the processor as fast as it can be fed." ]]]

You can find the original topic discussed here:

<a href="http://forums.appleinsider.com/cgi-bin/ultimatebb.cgi?ubb=get_topic&f=1&t=001997"; target="_blank">http://forums.appleinsider.com/cgi-bin/ultimatebb.cgi?ubb=get_topic&f=1&t=001997</a>;

--

Ed M.

nevyn · July 23, 2002 9:52PM

MERSI & 4x boxes.

For number crunching, the AlitVec is basically the redeeming feature of the G4. At these speeds, it's darn near the only redeeming feature

But... an AltiVec unit chews 128 bits/op. It needs 16 bytes/op -> 16 GB/second on a 1GHz box. Assuming the integer unit(s) and FPU(s) are sitting idle. And AltiVec takes up a relatively tiny patch of silicon... so more than one VLU on a chip can be done ok. At 16GB/s, the caches are chewed through in no time at all, and it's off to main memory, which blows.

Two G4s operating as SMP machines on the same bus -> twice as resource starved.

But... what if it's _not_ SMP, but asymmetric multiprocessing? That is, some chips have (slightly) different access to resources, or at least access to different memory? Like each G4 is connected to one of the DDR slots, and all the G4s are connected together across some sort of high-bandwidth bus. Like... RapidIO, or HyperTransport.

At that point, you say 'gosh, MERSI is smoke, to heck with it.' MERSI makes sense when your memorybandwidth-to-memoryconsumption is high, but when you are starved for bandwidth....

Note also that there's a clear demand for high-end machines in the niche(s) Apple's been targeting recently. Would people pay $10,000 for a BLAST with a very scary throughput? An effects machine? Some of the 'low-end workstations' and 'low-end servers' cost a LOT more than that. When you talk about price/chip, the G4 is a bargain. How many will it take to make a seriously competitive box? For me, it's about 4. And as long as there's _some_ proven market, Apple can sell enough to cover R&D while making the OS better and better at MP.

programmer · July 24, 2002 12:43AM

[quote]Originally posted by Nevyn:

MERSI & 4x boxes.

...while making the OS better and better at MP.<hr></blockquote>

The hardware needs to get better at MP first. Since (currently) all memory transactions must cross the MPX, the MPX puts an upper limit on the overall bandwidth of the system. If you double the number of processors in the system, each processor's bandwidth is halved. Considering most algorithms are currently bandwidth starved, it makes little sense to go beyond two processors. Even the duals wouldn't make much sense, except that there are usually a few things going on in the system that can keep the second processor busy & playing happily in its local caches.

As mentioned above and in another thread, one solution could be a G4 w/ MPX bus & on-chip memory controller. This would be an intermediate step to the future advanced bus (RapidIO, HyperTransport, whatever). This is preferable to an asymmetric setup because each processor then has the opportunity to run from its own local (fast) memory. In the asymmetric setup you improve one processor, but the rest are the same as the current situation. And that one faster one is hamstrung by constantly having to serve the others (and the I/O & graphics systems).

bunge · July 24, 2002 8:55AM

Any chance in hell Quartz Extreme will be able to use the APG slot to pull info in directly from ram, process it on the GPU and then move it to the CPU (if necessary) to combine it with other data that's been pulled in across the MPX bus?

Any chance in hell that Mac OS X Server will have a kernel capable of more than 2 processors while the stanard client OS keeps the 2 processor kernel?

programmer · July 24, 2002 10:06AM

[quote]Originally posted by bunge:

Any chance in hell Quartz Extreme will be able to use the APG slot to pull info in directly from ram, process it on the GPU and then move it to the CPU (if necessary) to combine it with other data that's been pulled in across the MPX bus?

<hr></blockquote>

The only communications channel to the CPU is through the MPX bus, so if you're thinking this would be a way around the MPX speed limit... its not. Having the GPU read across AGP, do some computations, and then write back across the AGP is feasible though. There are some performance issues related to caching and AGP performance bottlenecks, but conceptually its possible.

[quote]

Any chance in hell that Mac OS X Server will have a kernel capable of more than 2 processors while the stanard client OS keeps the 2 processor kernel?<hr></blockquote>

Yes. People keep talking about the Darwin kernel configuration like it is some huge deal... I haven't looked at it, but I doubt that it really is. I'm sure that internally, at least, Apple has experimental versions tweaked to run on >2 processors. This is just the kernel we're talking about, which is the core scheduling algorithm and its not really that big a piece of code. The whole idea behind Mach, after all, is that it is a microkernel. The design may have bloated a bit for performance reasons, but the amount of code which cares about the exact number of processors is small.

apple][forever · July 24, 2002 10:15AM

Just a thought- people have been doing a hostinfo on their builds of Jag and seeing "kernel supports 2 processors"--could it be possible that the installer only installs a 4-proc kernel if it detects a flag in the firmware or something? perhaps one of the packages has the answer. Unfortunately I don't have a build to check. (I only have 1 mac and I need it to work, so I can!

)

Otherwise... this memory controller moved from the north bridge to the cpu would make for some pretty interesting board layouts. perhaps that mystery pic floating around moved the drive bays down to permit a much larger processor daughter card.

Ah, blind speculation...

eskimo · July 25, 2002 1:10PM

[quote]Originally posted by Franck:

What has MOESI to do with AMD ?<hr></blockquote>

That AMD's form of cache snooping for multiprocessor use. The joke was related to all the rapant rumors here that Apple was going to switch to AMD processors. Thus who cares what Motorola's cache snoop protocol since you would have dual AMD chips in your computer. It was funnier when I typed it

woozle · July 25, 2002 6:46PM

[quote]Originally posted by Yevgeny:



From the numbers that I have seen, a modern x86 bus should be able to feed quad G4's running at 1.4 GHZ. Such a machine would be rather impressive (and expensive). The key is getting a real bus.

Yes, all the mentioned programs would benefit from having more CPUs and more bus bandwidth.<hr></blockquote>

Amodern x86 bus ( P4 533, 1066 RDRAM ), can do about 4100 mb/s. Thats not sustained, which is at best half that.

The guys in the know on comp.arch think that a bandwidth of 1 byte per flop is quite reasonable for a home computer.

Apple claims that altivec can do 15 GFlops with two cpus, meaning that it needs 15 GBytes/s to maintain that desired ratio.

Thats a pretty fast bus, video cards have just broken that speed, with 256 bit wide 300 mhz DDR.

So I dont think we'll see the G4 achieving its capabilities anytime soon, but that does mean that any improvement in the bus should translate into improved performance for altivec code ( on the other hand I dont believe that the G4 integer or fpu is really held back by the memory atm ).

nevyn · July 25, 2002 6:50PM

[quote]Originally posted by Programmer:



The hardware needs to get better at MP first. Since (currently) all memory transactions must cross the MPX, the MPX puts an upper limit on the overall bandwidth of the system. If you double the number of processors in the system, each processor's bandwidth is halved. Considering most algorithms are currently bandwidth starved, it makes little sense to go beyond two processors. Even the duals wouldn't make much sense, except that there are usually a few things going on in the system that can keep the second processor busy & playing happily in its local caches.

As mentioned above and in another thread, one solution could be a G4 w/ MPX bus & on-chip memory controller. This would be an intermediate step to the future advanced bus (RapidIO, HyperTransport, whatever). This is preferable to an asymmetric setup because each processor then has the opportunity to run from its own local (fast) memory. In the asymmetric setup you improve one processor, but the rest are the same as the current situation. And that one faster one is hamstrung by constantly having to serve the others (and the I/O & graphics systems).<hr></blockquote>

Exactly.

In my assymetric (potentially crazy) plan though, it isn't that any one proc has 'better throughput to memory', just better throughput to PCI-land or other hardware.

Motorola's made a comment about 'adding a memory controller to the chip'. Ok.

Motorola's made a comment that implies MPX is _the_ bus for now & on into the future.

MPX simply can't run at 133 MHz and do anything useful for Apple.

But... what if MPX is just one of the buses, and that bus is the one used for PCIstuff, while the 'on chip memory controller' is running memory through a 'RIO' bus?

As soon as one processor has all the PCI widgets off of it, and the other doesn't have perfectly equal access, it's asymmetric. But access to _memory_ is no longer limited by the bus off the CPU. SMP is predicated on all CPUs having full & equal access to main memory. What if each CPU has it's own _bank_ of main memory, and access to the other CPU's banks through a slower bus? It starts adding more wires, but the point of RIO/HT buses is switching from accessing things in parallel to accessing them over an extremely fast serial link -> many fewer motherboard traces.

programmer · July 25, 2002 7:29PM

[quote]Originally posted by Nevyn:

[QB]But... what if MPX is just one of the buses, and that bus is the one used for PCIstuff, while the 'on chip memory controller' is running memory through a 'RIO' bus?

QB]<hr></blockquote>

The on-chip memory controller is directly connected to memory, so no real "bus" is required. The MPX stays so that other parts of the system (I/O, AGP, other CPUs) can get at the memory attached to the processor. Eventually MPX will be replaced by RIO or HT.

This is "NUMA" -- Non-Uniform Memory Access. Some memory is connected to the processor, some is accessed via a bus. The OS handled the burden of moving memory the a processor will need into that processor's local memory. This would likely be part of the virtual memory subsystem and DMA engines would actually handle the transfers. The OS would probably try to keep a given process bound to a particular processor so that a minimum amount of inter-processor paging would be required.

[ 07-25-2002: Message edited by: Programmer ]

outsider · July 25, 2002 8:06PM

I can see the problem with Apple making a motherboard for a PPC that has a built-in memory controller. With a single processor it will work fine: 1 processor accessing the memory on the motherboard. But adding another processor complicates the matter. With no method of keeping track of changes the other processor made to memory addresses, you'll get corrupted information in memory. If you use MPX bus to access the other processors memory controller to check the tags before accessing the memory you lose the speed up of having an on-die memory controller.

What if you removed all tags from the processor and put them on a separate chip? This chip would be small but even a single processor module would need it for the memory controller to work. That way you can have 1, 2, 4 processors using the same memory banks and the only latency you would have is the extra cycles it would take to check the external memory tags. Would this work?

programmer · July 25, 2002 10:17PM

[quote]Originally posted by Outsider:

I can see the problem with Apple making a motherboard for a PPC that has a built-in memory controller. With a single processor it will work fine: 1 processor accessing the memory on the motherboard. But adding another processor complicates the matter. With no method of keeping track of changes the other processor made to memory addresses, you'll get corrupted information in memory. If you use MPX bus to access the other processors memory controller to check the tags before accessing the memory you lose the speed up of having an on-die memory controller.

What if you removed all tags from the processor and put them on a separate chip? This chip would be small but even a single processor module would need it for the memory controller to work. That way you can have 1, 2, 4 processors using the same memory banks and the only latency you would have is the extra cycles it would take to check the external memory tags. Would this work?<hr></blockquote>

I don't think this is really a problem -- the existing bus snooping solution could be used with some loss of efficiency. One possible improvement might be if the on-chip memory controller could track cachelines (or possibly memory pages?) that had been remotely accessed (i.e. by something other than the processor that "owns" the memory controller) and an MPX transaction would only occur for lines that had been. If they hadn't been touched by anything except the local processor then it could get access at full speed (i.e. without waiting for an MPX transaction to clear). This is after 5 seconds of thought on my part. I'm sure there are many other solutions that have been developed by a great many smart engineers. You'd have to do a bit of research to see what the actual solutions are in practice. NUMA isn't a new thing, the idea has been kicking around for some time now.

Multiprocessors...

Comments