Some interesting things I found when I did a little looking to understand NUMA
This seems to imply a really fast bus will be needed for a NUMA architecture to work well.
" target="_blank">Article about AMD's use of NUMA</a>
[quote] Advanced Micro Devices will drop its current Athlon and Duron EV6 bus architecture for its upcoming 64-bit Hammer-series processors to allow for connecting large multiprocessing arrays, the Platform Conference was told Monday.
Bob Mitton, AMD marketing manager for workstations and servers, told the meeting here that the 64-bit processors will use a new NUMA (Non-Uniform Memory Access) bus which can link eight-way or more MPUs for high performance multiprocessing. NUMA uses AMD's projected LDT (Lightning Data Transport) controller to handle both the Northbridge memory and Southbridge I/O buses in an array of processors, he said.
Mitton asserted that NUMA is highly scalable and allows each processor to have full access to the processor bus bandwidth.
By contrast, he claimed Intel Corp.'s new IA-64 architecture for Itanium and the follow-on McKinley processors have a shared processor bus that divides the bandwidth among all the processors.
He conceded that in the NUMA scheme a CPU accessing memory at the far end of the multiprocessor array goes further to fetch data than on a shared bus, but claimed the much-faster LDT offsets any potential delay. <hr></blockquote>
" target="_blank">An explanation of NUMA</a>
This seems to imply that a significant tweak of the Unix Architecture to adapt to a NUMA architecture (so a chip change like programmer spoke of may well coincide with an OS update)
quoted from: <a href="http://www-flash.stanford.edu/Hive/papers/SIGMETRICS95/abstract.html
</a> (google cache)
[quote] Abstract: Memory System Performance of UNIX on CC-NUMA Multiprocessors
This study characterizes the performance of a variant of UNIX SVR4 on a large shared-memory multiprocessor and analyzes the effects of possible OS and architectural changes. We use a nonintrusive cache miss monitor to trace the execution of an OS-intensive multiprogrammed workload on the Stanford DASH, a 32-CPU CC-NUMA multiprocessor (CC-NUMA multiprocessors have cache-coherent shared memory that is physically distributed across the machine). We find that our version of UNIX accounts for 24% of the workload's total execution time. A surprisingly large fraction of OS time (79%) is spent on memory system stalls, divided equally between instruction and data cache miss time. In analyzing techniques to reduce instruction cache miss stall time, we find that replication of only 7% of the OS code would allow 80% of instruction cache misses to be serviced locally on a CC-NUMA machine. For data cache misses, we find that a small number of routines account for 96% of OS data cache stall time. We find that most of these misses are coherence (communication) misses, and larger caches will not necessarily help. After presenting detailed performance data, we analyze the benefits of several OS changes and predict the effects of altering the cache configuration, degree of clustering, and cache coherence mechanism of the machine. <hr></blockquote>