Why dual processor

the one to rescue · November 25, 2004 2:44PM

Quote:

Originally posted by Zapchud

Great. Do you have any numbers?

I have absolutely no Apple-specific number (never worked for Apple at all!!) but from my experience, a quad mobo costs about $400, while a single mobo costs $150 (I'm not talking about crappy boards here), while a top-end CPU costs $300-350 and a low-performance CPU costs something between $50 and $80.

These prices are NOT huge quantity prices, I've never designed systems for high-production purpose, but I think it gives a good idea anyway.

BTW, I'm sorry about the "sorry" thing! That had nothing to do here...

slughead · November 25, 2004 2:50PM

dual is the best price point for super-powered processors, but for mid-range processors, you can get better power for less

Three $100 Athlon MP 2000 are way faster than Two $145 Athlon MP 2600's

Performance/price is never highest at the ends of the processor spectrums.

This is where the future is--not with 1 or 2 uber chips, but with multiple, smaller, low power chips.

Intel didn't get it, AMD did. That's why Intel is totally 0wn3d.

zapchud · November 25, 2004 3:50PM

Quote:

Originally posted by The One to Rescue

I have absolutely no Apple-specific number (never worked for Apple at all!!) but from my experience, a quad mobo costs about $400, while a single mobo costs $150 (I'm not talking about crappy boards here), while a top-end CPU costs $300-350 and a low-performance CPU costs something between $50 and $80.

How do you define a top-end CPU? One with clock frequency pumped to the spectacular levels, with an average amount of transistors/die size (Like the Xeon or Opteron)? Or perhaps a lower-clocked monster like the POWER5 chips?

If we use your numbers, a quad processor monster would have these price properties:

CPUs: $1200

MoBo: $400

= $1600

Or more efficiently, with lower-clocked CPUs:

CPUs: $800

MoBo: $400

= $1200

I don't know about you, but I don't find the motherboard to be insignificant, especially not if we go with somewhat lower-clocked processors. It's not the biggest factor in the equation, that's for sure. It adds up, though.

irene · November 25, 2004 4:17PM

i think the problems is not the cost but no market (or too small)

zapchud · November 25, 2004 4:34PM

Yes, there probably is a small (or no) market because of the cost.

curiousuburb · November 25, 2004 4:47PM

The Daystar Genesis was first a Dual, and with added CPU cards became a Quad (for US$ 12,000). Discontinued 1996. And, given the deficiencies of the MPX bus, was a relatively poor performance bang/buck upgrade for most tasks.

IBM has some POWER4/5 systems with Quad or Octo config, but not cheap.

There are rumours of Quad G5 mules (but you thought Duals were hot/noisy with 9 fans)

\

We are far more likely to see 'quads' as Dual Dual-core CPUs than we are as four G5s.

As for tri-processor systems, I'd be surprised if this was seamless in a predominantly binary environment, but I suppose one might consider the ApplePi (system controller) ASIC as a third 'processor' if you really stretch the definition.

the one to rescue · November 26, 2004 3:25AM

Quote:

Originally posted by Zapchud

How do you define a top-end CPU? One with clock frequency pumped to the spectacular levels, with an average amount of transistors/die size (Like the Xeon or Opteron)? Or perhaps a lower-clocked monster like the POWER5 chips?

Top-end CPU = PPC970-like CPU, here.

Quote:

I don't know about you, but I don't find the motherboard to be insignificant, especially not if we go with somewhat lower-clocked processors. It's not the biggest factor in the equation, that's for sure. It adds up, though.

Agree! It adds up. But I believe that $300 is not a big difference when dealing with high-end stations. ~$1000 is...

mmicist · November 26, 2004 7:53AM

Quote:

Originally posted by The One to Rescue

Top-end CPU = PPC970-like CPU, here.

Agree! It adds up. But I believe that $300 is not a big difference when dealing with high-end stations. ~$1000 is...

When considering G5 based machines, the increased cost of more than two processors is considerable, as a new northbridge would be required. The G5's interface requires the northbridge to do snooping to maintain cache coherence across multiple processors, so the northbridge would be required to have four interfaces in a four processor machine etc. The cost of developing a new northbridge for use in only a small number of machines would be prohibitive. On top of this, the memory interface for such a machine would have to be changed to provide enough bandwidth for all the processors. All this extra complexity would also mean a vast number of extra traces on the motherboard, and therefore more layers, again rapidly increasing the cost.

Xeon based multiprocessor machines use a bus, and thus can use a single connection to the northbridge for any number of processors. The disadvantage to this technique is that the bus bandwidth does not scale with the number of processors, but actually decreases as you add processors, since the bus needs to run more slowly to cope with all the discontinuities on the bus that the processors present.

Opteron based multiprocessors use yet another technique based on multiple point-to-point links between processors and tunneling through intermediate processors, with memory bandwidth scaleable with the number of transistors. This, to my mind, is the most elegant solution, and reduces the cost of adding effective processors significantly, but it is currently limited to a maximum of only eight processors because of addressing limits built into the processors.

michael

jaslu81 · November 26, 2004 10:23AM

Doesn't xGrid accomplish the same thing by connecting multiple computers to act as one? In environments where more than one person are networked together, it seems like this would be the best of both world... affordable individual computers and extra performance of multiple processors.

shawk · November 26, 2004 10:51AM

I assembled a G5 dual 2.5GHz with an xServe and two xServe cluster nodes.

If you need high performance computing, this is a good bet.

The performance was quite brisk.

As the result pinned the xGrid tachometer at 8GHz, I estimate the xGrid performance to equal about 12GHz.

More cluster nodes will be added.

Not future hardware, exactly.

Sometimes the best solution is one that is available.

programmer · November 26, 2004 11:36PM

Quote:

Originally posted by jaslu81

Doesn't xGrid accomplish the same thing by connecting multiple computers to act as one? In environments where more than one person are networked together, it seems like this would be the best of both world... affordable individual computers and extra performance of multiple processors.

There is a big performance gap between shared memory machines, and network-based grid computing -- both in terms of latency and bandwidth. xGrid is useful for some problems, but many require tighter hardware integration between the processors.

One issue which I didn't see mentioned (apologies if I missed it) is that adding more processors requires more inter-processor communication and synchronization. This results in diminishing performance returns as you add more processors (indeed, in some pathalogical cases you'll find that adding processors slows the system down). Multiple cores on a chip usually improves the rate of communication between those cores compared to inter-chip rates, so the effects of this will depend more on the number of chips than the number of cores.

slughead · November 27, 2004 9:58AM

Quote:

Originally posted by Programmer

There is a big performance gap between shared memory machines, and network-based grid computing -- both in terms of latency and bandwidth. xGrid is useful for some problems, but many require tighter hardware integration between the processors.

One issue which I didn't see mentioned (apologies if I missed it) is that adding more processors requires more inter-processor communication and synchronization. This results in diminishing performance returns as you add more processors (indeed, in some pathalogical cases you'll find that adding processors slows the system down). Multiple cores on a chip usually improves the rate of communication between those cores compared to inter-chip rates, so the effects of this will depend more on the number of chips than the number of cores.

yeah but what about the advantages of multi-procs in multithreading?

And at what point do more processors slow the computer down? what are the dependencies?

This stuff is incredibly interesting to me

fallenfromthetree · November 28, 2004 12:56PM

A Quadra design might surely be overkill for the average home user in both price and complexity, however for a single tower workstation

configuration Apple has plenty of competition from Alienware

with just about dual everything available including on board RAID.

For those with even deeper pockets the XServes cover just about anything you might dream up in the near future

Even so, with all this, it's going to be a while before the 64 bit software

availablity catches up with the computing power that we already have.

Re-writing that software to take advantage of 64 bit Quadra configurations will surely boost the need for much additional training

as well.

The price of that Daystar configuration is relative to the times

when a 300Mhz G3 tower was $3000.00 a CD burner was $1,500

and a DVD Burner was still a wet dream.

Last week I priced out a " dream G5 2.5 system" at well over $12,000

and I'm sure it would blow away the Daystar with ease.

So to answer your question about " why dual processor "

my best response is why not?

programmer · November 28, 2004 1:30PM

Quote:

Originally posted by slughead

yeah but what about the advantages of multi-procs in multithreading? And at what point do more processors slow the computer down?

The amount of interaction between threads is a function of what they are doing, the impact this has on performance depends on the hardware and operating system. If the software is multi-threaded but the threads are completely serialized (probably by poor software design) then you will see zero performance advantage by adding processors, and you may lose performance because of having multiple caches. If, on the other hand, the software's threads are completely independent and don't touch the same memory or resources (rarely the case) then multiple processors can give you a linear performance increase (i.e. 2 processors is twice as fast as 1, 3 is three times as fast, etc). The usual case is somewhere in between these two extremes.

Quote:

what are the dependencies?

Anything that must be shared. If two threads have to write to the same memory location they must synchronize (i.e. agree who goes first) otherwise one could stomp the other's results. In deeply pipelined processors, the cost of this synchronization can be considerable -- if the PPC970 has 200+ instructions in flight and it has to scrub what its doing because somebody else got access first then its possible that all those instructions have to be thrown away and done over again (admittedly this is a simplification, but it gives you some idea of what goes on). It is definitely something you want to avoid doing too much of, otherwise you'll quickly find that you spend more time synchronizing than doing real work.

Accessing the file system, network (or another part of the OS) is another common case -- there is only one disk so they need to take turns. If the system isn't very smart then two threads writing at the same time could take turns at a sector level (usually 4K bytes) and you end up spending most of your time seeking. Having them take longer turns (i.e. writing 400K instead of 4K) means less seeking and therefore much higher performance...

Consider a really nasty case, such as playing 2 QuickTime movies from a DVD at the same time. Typical DVD seek times are about 100ms (1/10 of a second), and the read rate is about, say, 2 MB/sec. Discs are formatted in 2K blocks so that is the smallest read possible at a hardware level. If there were two threads and they each read 2K, used it, read 2K more, used that, etc (and your OS was pretty dumb) then you might see one thread read 2K from one location, the other thread read 2K from another location, and so on. The reading of a 2K block takes 1ms (1/1000 of a second) at 2MB/sec, but the seek between locations takes 100ms. This means that in 1 second you can do just about 10 seeks + 10 reads, for which you'll get 20K of data. If you were to have one thread read for half a second, then the other thread read for half a second you would have done 2 seeks and spent the rest of the 800ms reading... for a total of about 1.6 MB. So just by changing the order in which you do things you've increased your realized performance with the same hardware from 20K/second to 1.6 MB/sec... an 80x improvement.

Of course this ignores things like whether the second thread can afford to wait half a second before it starts reading, is there anybody else reading at the same time, and other fun stuff that comes along to complicate life -- like how read rates and seek times vary depending on where you are on the disk, how far you are seeking, whether you get any read errors, and simply differences between different drives and disks even of the same type.

Even without going to the OS or I/O things are very complex. Multiple processors each have their own caches, and there might be shared caches. In most systems the memory is a shared resource as well. Fetching something from another processor's cache isn't particuarly fast, and it can interfere with the other processor's access to its own cache. Main memory can usually only have a limited number of pages open at once (4-8 is common), so if you have multiple processors accessing memory in different locations they can get in eachother's way because the memory controller is busy opening and closing these pages as they are needed. In systems where memory is not shared you start to incur more costs when you do need to pass data back and forth between the seperate memories. Often all these are fairly minor costs, but they add up... very quickly in machines which do billions of operations per second.

fallenfromthetree · November 28, 2004 2:31PM

And hoping that Apple is like minded on dual cores,

this Register article indicates that dual core is just around the corner.

/http://www.theregister.co.uk/2004/11/24/microsoft_dell_amd/

unixpoet · November 28, 2004 4:32PM

Hi all - my first post on these forums. Just for the record: I dont have a Mac (boo!) but I am waiting for the G5 PowerBook. Oh, and the imminent announcement by SJ regarding MacOSX on x86

Anyway, interesting thread but some inaccuracies.

Quote:

Originally posted by Programmer

... you may lose performance because of having multiple caches

You may lose performance if the OS scheduler runs both threads on the same processer; if running on separate processors performance will increase because of reduced cache trashing/less context switches.

Quote:

Originally posted by Programmer

If two threads have to write to the same memory location they must synchronize

They dont have to synchronize - but it is recommended! For certain types of variables ( sizeof(word) ) you can use atomic operations which do not require synchronization. Synchronization is expensive because of the user-land to kernel-space transition.

On Unix you can time a command, say "time ls", to find out how much time an app is spending in user-land vs kernel time.

Quote:

Originally posted by Programmer

Accessing the file system, network (or another part of the OS) is another common case -- there is only one disk so they need to take turns. If the system isn't very smart then two threads writing at the same time could take turns at a sector level (usually 4K bytes) and you end up spending most of your time seeking. Having them take longer turns (i.e. writing 400K instead of 4K) means less seeking and therefore much higher performance...

Now this bit I dont understand why you wrote it. Unless you are running an OS written by a 9 year old disk IO is buffered. On a typical machine there will be hundreds of threads open and they dont need to take "turns".

Quote:

Originally posted by Programmer

Fetching something from another processor's cache isn't particuarly fast, and it can interfere with the other processor's access to its own cache.

Fetching something from another processor's cache is impossible, be it L2 or L1. A challenge in SMP design is when cache coherency is lost. In that case the OS has to flush the cache of the processor with the old data. Expensive.

Quote:

Originally posted by Programmer

Main memory can usually only have a limited number of pages open at once (4-8 is common).

You must be joking sir! All of main memory is made up of pages. I think you're confusing it with virtual memory here. If the OS runs out of real RAM it can swap pages to disk as needed, the number of which is limited only by the adress space. All of the pages are "open" all of the time. If by "open" you meant residing in main memory than the number of open pages is 1gb (amount of ram in my machine) / 4k (page size). Ofcourse the algorithms handling which pages to swap and when is the subject of ongoing research. See, for example, lkml.org for recent posts regarding the issue.

Quote:

Originally posted by Programmer

... so if you have multiple processors accessing memory in different locations they can get in eachother's way because the memory controller is busy opening and closing these pages as they are needed.

The memory controller's job is to I/O cache lines (which is the unit of cache swapping) from RAM to L1/L2 caches. The problem here is not only bus contention but bandwidth as well. AMD Opteron processors come with an onboard memory controller which eliminates contention.

As regards the original post: for PCs the reason is simple - cost. Most people are happy enough with their machine's performance. Games today are more sensitive to the GPU anyway - getting a 20% faster CPU will not bring about a 20% FPS increase. Overclocking the GPU will. I got myself a dual proc when I was doing some CPU intensive stuff and the performance is, ofcourse, better

but not 100% better.

programmer · November 28, 2004 5:04PM

Those weren't inaccuracies, thank you very much:

Quote:

Originally posted by UnixPoet

You may lose performance if the OS scheduler runs both threads on the same processer; if running on separate processors performance will increase because of reduced cache trashing/less context switches.

No, I meant what I said. If two threads on different cores are using the same cacheline then they end up sending it back and forth across the bus. In the single processor case it stays in the one cache.

Quote:

They dont have to synchronize - but it is recommended! For certain types of variables ( sizeof(word) ) you can use atomic operations which do not require synchronization. Synchronization is expensive because of the user-land to kernel-space transition.

On Unix you can time a command, say "time ls", to find out how much time an app is spending in user-land vs kernel time.

Atomic operations aren't actually atomic on modern processors -- there is no such thing as a read/modify/write cycle. Go read the PowerPC manual.

Synchronization that requires a user/kernel transition is extremely expensive. I was talking about the more efficient user-space kind. It is still expensive, more than most realize.

Quote:

Now this bit I dont understand why you wrote it. Unless you are running an OS written by a 9 year old disk IO is buffered. On a typical machine there will be hundreds of threads open and they dont need to take "turns".

I was merely trying to demonstrate how different factors can affect performance. And threads need to take turns any time there are more of them than available hardware resources.

Quote:

Fetching something from another processor's cache is impossible, be it L2 or L1. A challenge in SMP design is when cache coherency is lost. In that case the OS has to flush the cache of the processor with the old data. Expensive.

Not true -- the PowerPC will do a direct processor to processor transfer if one processor holds a cache line that the other wants. Its part of the MERSI standard. If only the MESI standard is supported then the situation is worse because the cacheline has to go back to memory and then from there to the requested processor.

Quote:

You must be joking sir! All of main memory is made up of pages. I think you're confusing it with virtual memory here. If the OS runs out of real RAM it can swap pages to disk as needed, the number of which is limited only by the adress space. All of the pages are "open" all of the time. If by "open" you meant residing in main memory than the number of open pages is 1gb (amount of ram in my machine) / 4k (page size). Ofcourse the algorithms handling which pages to swap and when is the subject of ongoing research. See, for example, lkml.org for recent posts regarding the issue.

No, you are confusing virtual memory pages with what I'm talking about -- the internal operation of how the memory controller communicates with physical memory.

Quote:

The memory controller's job is to I/O cache lines (which is the unit of cache swapping) from RAM to L1/L2 caches. The problem here is not only bus contention but bandwidth as well. AMD Opteron processors come with an onboard memory controller which eliminates contention.

The memory controller also (oddly enough) "controls" memory. It manages the control signals required to fetch the stored data from the RAM chips. Latency is another issue -- how long does it take a memory transaction to be completed from the time it is initiated.

The Opteron's memory controller does a good job of reducing latency... unless you have a dual CPU system with a single operational memory controller. In that case the second processor is still using an off-chip memory controller and there is again contention. And most dual Opterons use that configuration, as opposed to a system with multiple operational memory controllers and banks of memory (in which case the OS will need to figure out how to move which virtual memory pages to which bank of memory in order to optimize access).

slughead · November 28, 2004 9:07PM

AHHH!! MY BRAIN!

You guys rock.

Every time I think I know a lot about computers, I get mentally served

unixpoet · November 29, 2004 3:27AM

Quote:

Originally posted by Zaphcud

This has been, and still is some of a "chicken and egg" problem.

Not at all!

Quote:

Originally posted by Zaphcud

If there are few processors capable to take advantage of multi-threaded code, why multi-thread code? If there is little multi-threaded code out there, why bother implementing more cores on the chip?

Programmers use multi-threaded code not just to take advantage of multi-cpus but also because such a design is cleaner/more elegant/logical. Multi-threaded apps work fine on single-cpu systems. There is more multi-threaded code out there than you think. All of the heavy CPU apps use threads: Photoshop, Max, Lightwave, most games (oops! this is a mac forum, sorry

). Hell, even Word does.

Quote:

Originally posted by Programmer If two threads on different cores are using the same cacheline then they end up sending it back and forth across the bus. In the single processor case it stays in the one cache.

If two threads are heavily accessing the same cache-line then the code was probably written badly and they should be using synchronization anyway.

Maybe they are using a global variable? And if the access are reads there will be no impact.

Quote:

Originally posted by Programmer Atomic operations aren't actually atomic on modern processors -- there is no such thing as a read/modify/write cycle. Go read the PowerPC manual.

Oh dear! Atomic operations are called such, because even if they are converted into multiple micro-ops (on a CICS CPU), the operation is not interruptable and is guaranteed to finish. Its an all or nothing operation - think of it as a transaction.

On the PowerPC arch you need to use lwarx and stwcx which, according to http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixassem/alangref/lwarx.htm#HDRC050214982JEFF can be used for "atomically loading and replacing a word in storage". The two ops basically define transaction boundaries so from the point of view of an OS/application taken together they are atomic.

On the x86, unless you're implying that it isnt a modern processor, atomic operations do not need to be emulated - they are directly supported. See CMPXCHG, etc.

Quote:

Originally posted by Programmer Synchronization that requires a user/kernel transition is extremely expensive. I was talking about the more efficient user-space kind. It is still expensive, more than most realize.

The only cross-process, and not just cross-thread, user space synch primitives I know of is the FUTEX subsystem in linux kernel 2.6. Please correct me if wrong.

Quote:

Originally posted by Programmer

No, you are confusing virtual memory pages with what I'm talking about -- the internal operation of how the memory controller communicates with physical memory.

The memory controller also (oddly enough) "controls" memory.

You were using incorrect terminology and some of the statements you made were just plain false. Memory controllers do not "open" pages, they do not just have 4 or 8 or whatever pages open, etc.

You've got to remember that people might be coming across this board in the future and, unless corrected, will take the contents of a post as fact. The wisdom of relying on assertions made in forums as facts is another matter...

fallenfromthetree · November 29, 2004 4:11AM

Please correct me if I'm wrong here.

Apparently we have a discussion based on 64 bit Apples and Opterons here

noting how each DUAL system would or should functon using a UNIX

based OS. ( considering that Longhorn will be UNIX based)

IF.... you could compare the architecture and efficiency of both systems

both running on OSX or at least a similar UNIX based OS

which system would run more efficiently and why?

Then what effect would the use of dual core processors have on each of these systems?

For the sake of argument, I suppose we should compare the CPU's

that are currently available.

Why dual processor

Comments