The number of sockets on the motherboard isn't particularly important. A quad socket board will typically be less efficient than a single socket quad core machine. It will also be more expensive. So why bother talking about the number of sockets/chips, when what we really care about is the number of processing cores?
I would be surprised if Apple ever went to more than 2 sockets on a single motherboard. The system controller(s) and FSBs just get too big and expensive. The chips coming in the next few years will move to dual core, and then beyond. The existing 970FX is ~60 million transistors. The 970MP reportedly doubles that cache per core which probably pushes each core up to something like 80-90 million, for a chip total of 160-180 million (about the same as the original POWER4's 170 million), and about the same area as the original 970 on 130nm. That means that if they can get the 90nm yields under control (and reports are that they are) then the price of this 2-core chip will be roughly the same as the price of the original 970. Going to 65nm will allow this to double to 4 970FX cores on one chip for the same cost, and IBM is claiming that they'll be able to do that by the end of 2005 / early 2006.
Posted by Henroik:
How does SMT work? I've understand that it'll enhance the performance of a processor by around 30% or so by doing two threads simultaneously. These two threads.. Is one thread getting ~100% and the other ~30% or is it that each thread gets ~65%?
If was doing some heavy single threaded stuff, I wouldn't like it to be stuck with only 30% or 65%. Will there be a way to disable SMT per thread basis?
The first thing you have to understand in order to understand SMT is that modern processors are quite often idle, even when running flat out. This is because they are waiting for things and they have many long pipelines. Imagine the machine as a 2D grid with execution units along one axis, and pipeline stages along another. Each of the boxes in the grid holds an instruction that is part way through its execution. In some boxes it needs information to be obtained from somewhere else (a register, the cache, another instruction, etc). If that information is not available then it cannot advance to the next stage in the pipeline and as the clock ticks forward an empty box in the grid appears -- a "bubble". This bubble represents a little bit of inactivity. Imagine what the grid would look like if 50% of the time each instruction got held up... the whole grid would have bubble strewn about, with less than half the boxes holding actual instructions. Each of the empty boxes represents work that could have potentially been done, but failed to happen due to something the thread needed to wait for.
The idea behind SMT is to introduce an additional thread (or threads in more extreme designs) which share this grid with the first thread, but have their own work to do and their own registers to hold their information. If this thread had its own processor then it would have its own grid, but instead it shares the grid with the first thread. The second thread's filled little boxes then get intermingled into the first thread's empty boxes, and in a perfect world the whole grid is filled up. In reality the grid is more full if both threads are inefficient, or overfull if both threads are too efficient. But the goal of increasing the net amount of work done by the available execution units is achieved.
How the two threads interact is a function of the processor hardware. In the IBM POWER5 each thread is assigned a number from 0-31, and an instruction group (1-5 instructions) is dispatched from one of the two threads each cycle depending on the relative priority numbers of the threads. If they have equal numbers they trade back and forth. If one has double the number of the other, then it dispatches two groups for each of the other's. By setting a thread to the maximum value it will get all of the instruction groups, starving the other. If a thread's turn to dispatch instruction groups comes up, but it can't because it is stalled, then the other gets an extra kick at the can. [note: this is actually just a very rough approximation of the actual scheme implemented].
How does all this work out in practice? That depends on how the software and hardware interact. If your multi-threaded software spends most of its time waiting for memory to arrive in the cache and then be transfered to a register, then it is entirely possible that the POWER5 SMT will double the speed of your application. If you are completely computationally bound and you have no stalls in your code at all (very rare), then running a second thread will only slow down your computationally bound one. Fortunately, as I stated above, the IBM's SMT lets a software author tell the hardware how much priority his thread should get. There are enough diagnostics in these processors for the OS to monitor a thread's behaviour and adjust its priority automagically in cases where the author hasn't explicitly done so.
Tasks switching processors is usually a bad thing because, as you observed, it kills the caches. This might not matter in some cases because the cache would get blown by the other task being timesliced in, but some OSes implement "processor affinity" to reduce the problem. IIRC, Darwin implements some level of this. SMT somewhat reduces the problem because the caches are at least partially shared between the threads of the same core. Multi-core designs reduce the problem a little as well since communication between the cores is usually at the chip's clock rate and they might share cache as well.