Why dual processor

24

Comments

  • Reply 21 of 64
    Quote:

    Originally posted by Zapchud

    Great. Do you have any numbers?





    I have absolutely no Apple-specific number (never worked for Apple at all!!) but from my experience, a quad mobo costs about $400, while a single mobo costs $150 (I'm not talking about crappy boards here), while a top-end CPU costs $300-350 and a low-performance CPU costs something between $50 and $80.

    These prices are NOT huge quantity prices, I've never designed systems for high-production purpose, but I think it gives a good idea anyway.



    BTW, I'm sorry about the "sorry" thing! That had nothing to do here...
  • Reply 22 of 64
    slugheadslughead Posts: 1,169member
    dual is the best price point for super-powered processors, but for mid-range processors, you can get better power for less



    Three $100 Athlon MP 2000 are way faster than Two $145 Athlon MP 2600's



    Performance/price is never highest at the ends of the processor spectrums.



    This is where the future is--not with 1 or 2 uber chips, but with multiple, smaller, low power chips.



    Intel didn't get it, AMD did. That's why Intel is totally 0wn3d.
  • Reply 23 of 64
    Quote:

    Originally posted by The One to Rescue

    I have absolutely no Apple-specific number (never worked for Apple at all!!) but from my experience, a quad mobo costs about $400, while a single mobo costs $150 (I'm not talking about crappy boards here), while a top-end CPU costs $300-350 and a low-performance CPU costs something between $50 and $80.



    How do you define a top-end CPU? One with clock frequency pumped to the spectacular levels, with an average amount of transistors/die size (Like the Xeon or Opteron)? Or perhaps a lower-clocked monster like the POWER5 chips?



    If we use your numbers, a quad processor monster would have these price properties:

    CPUs: $1200

    MoBo: $400

    = $1600



    Or more efficiently, with lower-clocked CPUs:

    CPUs: $800

    MoBo: $400

    = $1200



    I don't know about you, but I don't find the motherboard to be insignificant, especially not if we go with somewhat lower-clocked processors. It's not the biggest factor in the equation, that's for sure. It adds up, though.
  • Reply 24 of 64
    i think the problems is not the cost but no market (or too small)
  • Reply 25 of 64
    Yes, there probably is a small (or no) market because of the cost.
  • Reply 26 of 64
    The Daystar Genesis was first a Dual, and with added CPU cards became a Quad (for US$ 12,000). Discontinued 1996. And, given the deficiencies of the MPX bus, was a relatively poor performance bang/buck upgrade for most tasks.



    IBM has some POWER4/5 systems with Quad or Octo config, but not cheap.



    There are rumours of Quad G5 mules (but you thought Duals were hot/noisy with 9 fans) \



    We are far more likely to see 'quads' as Dual Dual-core CPUs than we are as four G5s.



    As for tri-processor systems, I'd be surprised if this was seamless in a predominantly binary environment, but I suppose one might consider the ApplePi (system controller) ASIC as a third 'processor' if you really stretch the definition.
  • Reply 27 of 64
    Quote:

    Originally posted by Zapchud

    How do you define a top-end CPU? One with clock frequency pumped to the spectacular levels, with an average amount of transistors/die size (Like the Xeon or Opteron)? Or perhaps a lower-clocked monster like the POWER5 chips?





    Top-end CPU = PPC970-like CPU, here.



    Quote:



    I don't know about you, but I don't find the motherboard to be insignificant, especially not if we go with somewhat lower-clocked processors. It's not the biggest factor in the equation, that's for sure. It adds up, though.




    Agree! It adds up. But I believe that $300 is not a big difference when dealing with high-end stations. ~$1000 is...
  • Reply 28 of 64
    Quote:

    Originally posted by The One to Rescue

    Top-end CPU = PPC970-like CPU, here.







    Agree! It adds up. But I believe that $300 is not a big difference when dealing with high-end stations. ~$1000 is...




    When considering G5 based machines, the increased cost of more than two processors is considerable, as a new northbridge would be required. The G5's interface requires the northbridge to do snooping to maintain cache coherence across multiple processors, so the northbridge would be required to have four interfaces in a four processor machine etc. The cost of developing a new northbridge for use in only a small number of machines would be prohibitive. On top of this, the memory interface for such a machine would have to be changed to provide enough bandwidth for all the processors. All this extra complexity would also mean a vast number of extra traces on the motherboard, and therefore more layers, again rapidly increasing the cost.



    Xeon based multiprocessor machines use a bus, and thus can use a single connection to the northbridge for any number of processors. The disadvantage to this technique is that the bus bandwidth does not scale with the number of processors, but actually decreases as you add processors, since the bus needs to run more slowly to cope with all the discontinuities on the bus that the processors present.



    Opteron based multiprocessors use yet another technique based on multiple point-to-point links between processors and tunneling through intermediate processors, with memory bandwidth scaleable with the number of transistors. This, to my mind, is the most elegant solution, and reduces the cost of adding effective processors significantly, but it is currently limited to a maximum of only eight processors because of addressing limits built into the processors.



    michael
  • Reply 29 of 64
    Doesn't xGrid accomplish the same thing by connecting multiple computers to act as one? In environments where more than one person are networked together, it seems like this would be the best of both world... affordable individual computers and extra performance of multiple processors.
  • Reply 30 of 64
    shawkshawk Posts: 116member
    I assembled a G5 dual 2.5GHz with an xServe and two xServe cluster nodes.

    If you need high performance computing, this is a good bet.

    The performance was quite brisk.



    As the result pinned the xGrid tachometer at 8GHz, I estimate the xGrid performance to equal about 12GHz.

    More cluster nodes will be added.



    Not future hardware, exactly.

    Sometimes the best solution is one that is available.
  • Reply 31 of 64
    Quote:

    Originally posted by jaslu81

    Doesn't xGrid accomplish the same thing by connecting multiple computers to act as one? In environments where more than one person are networked together, it seems like this would be the best of both world... affordable individual computers and extra performance of multiple processors.



    There is a big performance gap between shared memory machines, and network-based grid computing -- both in terms of latency and bandwidth. xGrid is useful for some problems, but many require tighter hardware integration between the processors.





    One issue which I didn't see mentioned (apologies if I missed it) is that adding more processors requires more inter-processor communication and synchronization. This results in diminishing performance returns as you add more processors (indeed, in some pathalogical cases you'll find that adding processors slows the system down). Multiple cores on a chip usually improves the rate of communication between those cores compared to inter-chip rates, so the effects of this will depend more on the number of chips than the number of cores.
  • Reply 32 of 64
    slugheadslughead Posts: 1,169member
    Quote:

    Originally posted by Programmer

    There is a big performance gap between shared memory machines, and network-based grid computing -- both in terms of latency and bandwidth. xGrid is useful for some problems, but many require tighter hardware integration between the processors.





    One issue which I didn't see mentioned (apologies if I missed it) is that adding more processors requires more inter-processor communication and synchronization. This results in diminishing performance returns as you add more processors (indeed, in some pathalogical cases you'll find that adding processors slows the system down). Multiple cores on a chip usually improves the rate of communication between those cores compared to inter-chip rates, so the effects of this will depend more on the number of chips than the number of cores.




    yeah but what about the advantages of multi-procs in multithreading?



    And at what point do more processors slow the computer down? what are the dependencies?



    This stuff is incredibly interesting to me
  • Reply 33 of 64
    A Quadra design might surely be overkill for the average home user in both price and complexity, however for a single tower workstation

    configuration Apple has plenty of competition from Alienware

    with just about dual everything available including on board RAID.



    For those with even deeper pockets the XServes cover just about anything you might dream up in the near future



    Even so, with all this, it's going to be a while before the 64 bit software

    availablity catches up with the computing power that we already have.



    Re-writing that software to take advantage of 64 bit Quadra configurations will surely boost the need for much additional training

    as well.



    The price of that Daystar configuration is relative to the times

    when a 300Mhz G3 tower was $3000.00 a CD burner was $1,500

    and a DVD Burner was still a wet dream.



    Last week I priced out a " dream G5 2.5 system" at well over $12,000

    and I'm sure it would blow away the Daystar with ease.



    So to answer your question about " why dual processor "



    my best response is why not?
  • Reply 34 of 64
    Quote:

    Originally posted by slughead

    yeah but what about the advantages of multi-procs in multithreading? And at what point do more processors slow the computer down?



    The amount of interaction between threads is a function of what they are doing, the impact this has on performance depends on the hardware and operating system. If the software is multi-threaded but the threads are completely serialized (probably by poor software design) then you will see zero performance advantage by adding processors, and you may lose performance because of having multiple caches. If, on the other hand, the software's threads are completely independent and don't touch the same memory or resources (rarely the case) then multiple processors can give you a linear performance increase (i.e. 2 processors is twice as fast as 1, 3 is three times as fast, etc). The usual case is somewhere in between these two extremes.



    Quote:



    what are the dependencies?









    Anything that must be shared. If two threads have to write to the same memory location they must synchronize (i.e. agree who goes first) otherwise one could stomp the other's results. In deeply pipelined processors, the cost of this synchronization can be considerable -- if the PPC970 has 200+ instructions in flight and it has to scrub what its doing because somebody else got access first then its possible that all those instructions have to be thrown away and done over again (admittedly this is a simplification, but it gives you some idea of what goes on). It is definitely something you want to avoid doing too much of, otherwise you'll quickly find that you spend more time synchronizing than doing real work.



    Accessing the file system, network (or another part of the OS) is another common case -- there is only one disk so they need to take turns. If the system isn't very smart then two threads writing at the same time could take turns at a sector level (usually 4K bytes) and you end up spending most of your time seeking. Having them take longer turns (i.e. writing 400K instead of 4K) means less seeking and therefore much higher performance...



    Consider a really nasty case, such as playing 2 QuickTime movies from a DVD at the same time. Typical DVD seek times are about 100ms (1/10 of a second), and the read rate is about, say, 2 MB/sec. Discs are formatted in 2K blocks so that is the smallest read possible at a hardware level. If there were two threads and they each read 2K, used it, read 2K more, used that, etc (and your OS was pretty dumb) then you might see one thread read 2K from one location, the other thread read 2K from another location, and so on. The reading of a 2K block takes 1ms (1/1000 of a second) at 2MB/sec, but the seek between locations takes 100ms. This means that in 1 second you can do just about 10 seeks + 10 reads, for which you'll get 20K of data. If you were to have one thread read for half a second, then the other thread read for half a second you would have done 2 seeks and spent the rest of the 800ms reading... for a total of about 1.6 MB. So just by changing the order in which you do things you've increased your realized performance with the same hardware from 20K/second to 1.6 MB/sec... an 80x improvement.



    Of course this ignores things like whether the second thread can afford to wait half a second before it starts reading, is there anybody else reading at the same time, and other fun stuff that comes along to complicate life -- like how read rates and seek times vary depending on where you are on the disk, how far you are seeking, whether you get any read errors, and simply differences between different drives and disks even of the same type.





    Even without going to the OS or I/O things are very complex. Multiple processors each have their own caches, and there might be shared caches. In most systems the memory is a shared resource as well. Fetching something from another processor's cache isn't particuarly fast, and it can interfere with the other processor's access to its own cache. Main memory can usually only have a limited number of pages open at once (4-8 is common), so if you have multiple processors accessing memory in different locations they can get in eachother's way because the memory controller is busy opening and closing these pages as they are needed. In systems where memory is not shared you start to incur more costs when you do need to pass data back and forth between the seperate memories. Often all these are fairly minor costs, but they add up... very quickly in machines which do billions of operations per second.
  • Reply 35 of 64
    And hoping that Apple is like minded on dual cores,

    this Register article indicates that dual core is just around the corner.



    /http://www.theregister.co.uk/2004/11/24/microsoft_dell_amd/
  • Reply 36 of 64
    Hi all - my first post on these forums. Just for the record: I dont have a Mac (boo!) but I am waiting for the G5 PowerBook. Oh, and the imminent announcement by SJ regarding MacOSX on x86



    Anyway, interesting thread but some inaccuracies.



    Quote:

    Originally posted by Programmer

    ... you may lose performance because of having multiple caches



    You may lose performance if the OS scheduler runs both threads on the same processer; if running on separate processors performance will increase because of reduced cache trashing/less context switches.



    Quote:

    Originally posted by Programmer

    If two threads have to write to the same memory location they must synchronize



    They dont have to synchronize - but it is recommended! For certain types of variables ( sizeof(word) ) you can use atomic operations which do not require synchronization. Synchronization is expensive because of the user-land to kernel-space transition.

    On Unix you can time a command, say "time ls", to find out how much time an app is spending in user-land vs kernel time.



    Quote:

    Originally posted by Programmer

    Accessing the file system, network (or another part of the OS) is another common case -- there is only one disk so they need to take turns. If the system isn't very smart then two threads writing at the same time could take turns at a sector level (usually 4K bytes) and you end up spending most of your time seeking. Having them take longer turns (i.e. writing 400K instead of 4K) means less seeking and therefore much higher performance...



    Now this bit I dont understand why you wrote it. Unless you are running an OS written by a 9 year old disk IO is buffered. On a typical machine there will be hundreds of threads open and they dont need to take "turns".



    Quote:

    Originally posted by Programmer

    Fetching something from another processor's cache isn't particuarly fast, and it can interfere with the other processor's access to its own cache.




    Fetching something from another processor's cache is impossible, be it L2 or L1. A challenge in SMP design is when cache coherency is lost. In that case the OS has to flush the cache of the processor with the old data. Expensive.



    Quote:

    Originally posted by Programmer

    Main memory can usually only have a limited number of pages open at once (4-8 is common).




    You must be joking sir! All of main memory is made up of pages. I think you're confusing it with virtual memory here. If the OS runs out of real RAM it can swap pages to disk as needed, the number of which is limited only by the adress space. All of the pages are "open" all of the time. If by "open" you meant residing in main memory than the number of open pages is 1gb (amount of ram in my machine) / 4k (page size). Ofcourse the algorithms handling which pages to swap and when is the subject of ongoing research. See, for example, lkml.org for recent posts regarding the issue.



    Quote:

    Originally posted by Programmer

    ... so if you have multiple processors accessing memory in different locations they can get in eachother's way because the memory controller is busy opening and closing these pages as they are needed.




    The memory controller's job is to I/O cache lines (which is the unit of cache swapping) from RAM to L1/L2 caches. The problem here is not only bus contention but bandwidth as well. AMD Opteron processors come with an onboard memory controller which eliminates contention.



    As regards the original post: for PCs the reason is simple - cost. Most people are happy enough with their machine's performance. Games today are more sensitive to the GPU anyway - getting a 20% faster CPU will not bring about a 20% FPS increase. Overclocking the GPU will. I got myself a dual proc when I was doing some CPU intensive stuff and the performance is, ofcourse, better but not 100% better.
  • Reply 37 of 64
    Those weren't inaccuracies, thank you very much:



    Quote:

    Originally posted by UnixPoet

    You may lose performance if the OS scheduler runs both threads on the same processer; if running on separate processors performance will increase because of reduced cache trashing/less context switches.



    No, I meant what I said. If two threads on different cores are using the same cacheline then they end up sending it back and forth across the bus. In the single processor case it stays in the one cache.



    Quote:

    They dont have to synchronize - but it is recommended! For certain types of variables ( sizeof(word) ) you can use atomic operations which do not require synchronization. Synchronization is expensive because of the user-land to kernel-space transition.

    On Unix you can time a command, say "time ls", to find out how much time an app is spending in user-land vs kernel time.



    Atomic operations aren't actually atomic on modern processors -- there is no such thing as a read/modify/write cycle. Go read the PowerPC manual.



    Synchronization that requires a user/kernel transition is extremely expensive. I was talking about the more efficient user-space kind. It is still expensive, more than most realize.



    Quote:

    Now this bit I dont understand why you wrote it. Unless you are running an OS written by a 9 year old disk IO is buffered. On a typical machine there will be hundreds of threads open and they dont need to take "turns".



    I was merely trying to demonstrate how different factors can affect performance. And threads need to take turns any time there are more of them than available hardware resources.



    Quote:

    Fetching something from another processor's cache is impossible, be it L2 or L1. A challenge in SMP design is when cache coherency is lost. In that case the OS has to flush the cache of the processor with the old data. Expensive.



    Not true -- the PowerPC will do a direct processor to processor transfer if one processor holds a cache line that the other wants. Its part of the MERSI standard. If only the MESI standard is supported then the situation is worse because the cacheline has to go back to memory and then from there to the requested processor.



    Quote:

    You must be joking sir! All of main memory is made up of pages. I think you're confusing it with virtual memory here. If the OS runs out of real RAM it can swap pages to disk as needed, the number of which is limited only by the adress space. All of the pages are "open" all of the time. If by "open" you meant residing in main memory than the number of open pages is 1gb (amount of ram in my machine) / 4k (page size). Ofcourse the algorithms handling which pages to swap and when is the subject of ongoing research. See, for example, lkml.org for recent posts regarding the issue.



    No, you are confusing virtual memory pages with what I'm talking about -- the internal operation of how the memory controller communicates with physical memory.



    Quote:

    The memory controller's job is to I/O cache lines (which is the unit of cache swapping) from RAM to L1/L2 caches. The problem here is not only bus contention but bandwidth as well. AMD Opteron processors come with an onboard memory controller which eliminates contention.



    The memory controller also (oddly enough) "controls" memory. It manages the control signals required to fetch the stored data from the RAM chips. Latency is another issue -- how long does it take a memory transaction to be completed from the time it is initiated.



    The Opteron's memory controller does a good job of reducing latency... unless you have a dual CPU system with a single operational memory controller. In that case the second processor is still using an off-chip memory controller and there is again contention. And most dual Opterons use that configuration, as opposed to a system with multiple operational memory controllers and banks of memory (in which case the OS will need to figure out how to move which virtual memory pages to which bank of memory in order to optimize access).
  • Reply 38 of 64
    slugheadslughead Posts: 1,169member
    AHHH!! MY BRAIN!





    You guys rock.





    Every time I think I know a lot about computers, I get mentally served
  • Reply 39 of 64
    Quote:

    Originally posted by Zaphcud

    This has been, and still is some of a "chicken and egg" problem.




    Not at all!



    Quote:

    Originally posted by Zaphcud

    If there are few processors capable to take advantage of multi-threaded code, why multi-thread code? If there is little multi-threaded code out there, why bother implementing more cores on the chip?




    Programmers use multi-threaded code not just to take advantage of multi-cpus but also because such a design is cleaner/more elegant/logical. Multi-threaded apps work fine on single-cpu systems. There is more multi-threaded code out there than you think. All of the heavy CPU apps use threads: Photoshop, Max, Lightwave, most games (oops! this is a mac forum, sorry ). Hell, even Word does.



    Quote:

    Originally posted by Programmer If two threads on different cores are using the same cacheline then they end up sending it back and forth across the bus. In the single processor case it stays in the one cache.





    If two threads are heavily accessing the same cache-line then the code was probably written badly and they should be using synchronization anyway.

    Maybe they are using a global variable? And if the access are reads there will be no impact.



    Quote:

    Originally posted by Programmer Atomic operations aren't actually atomic on modern processors -- there is no such thing as a read/modify/write cycle. Go read the PowerPC manual.



    Oh dear! Atomic operations are called such, because even if they are converted into multiple micro-ops (on a CICS CPU), the operation is not interruptable and is guaranteed to finish. Its an all or nothing operation - think of it as a transaction.



    On the PowerPC arch you need to use lwarx and stwcx which, according to http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixassem/alangref/lwarx.htm#HDRC050214982JEFF can be used for "atomically loading and replacing a word in storage". The two ops basically define transaction boundaries so from the point of view of an OS/application taken together they are atomic.



    On the x86, unless you're implying that it isnt a modern processor, atomic operations do not need to be emulated - they are directly supported. See CMPXCHG, etc.



    Quote:

    Originally posted by Programmer Synchronization that requires a user/kernel transition is extremely expensive. I was talking about the more efficient user-space kind. It is still expensive, more than most realize.





    The only cross-process, and not just cross-thread, user space synch primitives I know of is the FUTEX subsystem in linux kernel 2.6. Please correct me if wrong.



    Quote:

    Originally posted by Programmer

    No, you are confusing virtual memory pages with what I'm talking about -- the internal operation of how the memory controller communicates with physical memory.



    The memory controller also (oddly enough) "controls" memory.





    You were using incorrect terminology and some of the statements you made were just plain false. Memory controllers do not "open" pages, they do not just have 4 or 8 or whatever pages open, etc.



    You've got to remember that people might be coming across this board in the future and, unless corrected, will take the contents of a post as fact. The wisdom of relying on assertions made in forums as facts is another matter...
  • Reply 40 of 64
    Please correct me if I'm wrong here.



    Apparently we have a discussion based on 64 bit Apples and Opterons here

    noting how each DUAL system would or should functon using a UNIX

    based OS. ( considering that Longhorn will be UNIX based)



    IF.... you could compare the architecture and efficiency of both systems

    both running on OSX or at least a similar UNIX based OS

    which system would run more efficiently and why?



    Then what effect would the use of dual core processors have on each of these systems?



    For the sake of argument, I suppose we should compare the CPU's

    that are currently available.
Sign In or Register to comment.