Vmx 2

wizard69 · August 12, 2003 7:49AM

Quote:

Originally posted by Programmer

Changing the instruction set is very disruptive to the market. Intel gets away with it because they sell such huge volumes. The PowerPC guys can afford less disruption so the potential gain had better be huge in order to try it.

Changing the instruction set is disruptive, additions to the instruction set are far less disruptive. Even the G5 introduces changes that impact the abiltiy of the G5 to execute existing code, but even here one learns to work around them.

Additional instructions would be far less of an issue than the compatability concerns of the G5. You are correct though that it is silly to implement anyhting that does not have the ability to deliver a potetntially large gain.

Quote:

Exactly... there is less scheduling opportunity. Perhaps looking at the problem in an extreme fashion makes it a bit more obvious -- why not extend the vector registers to 2560 bits, surely that would make it 20 times faster than it currently is? Even if you ignore the implementation cost and the context switching cost, when you start looking at the limitations forced on algorithms trying to use these registers you realize that you've lost a lot of flexibility.

Well that is going in the other extreme! 256 bits is doable if cache line width is that size and transfers to and from cache can be handled in that width. If this facility exist at all it will be implemented on a new core using a new process technology so I do not believe that it is going to be a big implementation challenge.

I'm by far unpriviledged to what the current vector code base implements overall, but I can imagine that many routines would make good use of a doubling in register width. All that I can see is a gain in flexibility, you still have all the old features of the unit.

Quote:

Except that each execution unit is significantly more expensive if its wider, the registers are more expensive, and the internal buses between them and to the cache must be widened as well.

I do not see, in the case of a SIMD unit, the execution units becoming more expensive relative to implementing parallel units. In fact I see the resultant units being comparable, with the wide unit slightly simpler. SIMD units work on parts of the register in parallel, there is no need for extremely wide adders, multipliers and such. The permute unit would be another story though.

The cache line width and internal bus size will increase in size anyways, even if VMX is left untouched. These are really non issues.

Quote:

The performance of the Pentium4 casts serious doubt on that statement... and VMX already has 4 times as many registers as that. The optimal number of rename registers is a function of pipeline depth and width (i.e. number of in-flight instructions). If you can afford to increase the in-flight instruction count then you can certainly afford to increase the dispatch rate. Currently its only in groups of 5 -- going to 2 or 3 groups of 5 per cycle would not be stretch (in terms of expense) if the execution hardware was there to back it up. This is pointless unless you have minimal dependencies (e.g. programming with a larger "virtual" vector width), or you've got SMT. If you actually have wider registers then this is only to your advantage if you're working in the middle of a large vector (which is less often because your boundry zones are larger).

I'm not sure what you are trying to say here, possibly that the Pentium 4 does not perform as well as it could? It would be interesting to learn how the Centrino manages its excellent performance relative to the P4, this might provide some insight into what is feasable design wise.

So you argue for large virtual vector widths or SMT, well the vectors are already wide so wide registers fit the problem you describe. As to SMT, it will be a problem if you do not address the data movement issues. Running a thread through a pipe that is already running at maximum is not going to benefit anyone. It is my contention that much of the code run through a vector unit does staturate if only for brief periods conventional thread scheduling should work just as well. The evidence is there that SMT is a mixed bag on Intel hardware, there is every reason to believe that similar anomallies will show up on the PPC.

Quote:

I completely disagree on this one. VMX is spending lots of time waiting for memory which means there are lots of bubbles in the pipeline that can be filled up by other threads. Some vector algorithms and many non-vector algorithms are not bandwidth bound, or live nicely in the caches ... these will fill in the bubbles while waiting for memory.

Well we both agree on that first statement! The problem then becomes this; if you can't get the data in and out fast enough for one thread how will we be able to do that with two or more threads? Due to the nature of SIMD code both of those threads are likely to be going to main memory, so your likely to have contention all the way to the DRAM chip.

I supose it is possible to have two threads of vector code operating on unrelated data at the same time but I do not see this as the norm.

I still believe at this point, that having parallel execution units operating on 128 bit data will produce slightly more contention and through put issues versus 256 bit data. With two execution units, and 128 bit data, that data transfers would have to be interleaved so you really can't look at the two implementations as producing the same results.

Quote:

As for instruction bandwidth, most vector code is fairly tight loops that fit easily into the L1 I-cache. The instruction bandwidth from I-cache is simply not an issue, so dispatching at 2 or 3 times the current rate ought not be difficult to manage. Nobody does it yet because of the reasons I described above.

Right, which supports my position that there is no need to improve the vector element to instruction ratio at the cost of larger registers, more expensive context switching, fragmenting the ISA, and giving up flexibility by forcing longer vectors.

I'm not sure where the idea comes from that one is giving up flexibility, we are gaining addtional capability. As far as the expense of context switches parallel units just add their own concerns with instruction completetion and you still have additional registers implemented in hardware. Maybe rename registers and buffers are not real but they none the less contribute to the transistor count.

Besides all of the above, this is not a processor designed for realtime systems anyways. Considering that this feature, if it exists, will be implemented on much faster hardware the impact on a context switch may not even be seen.

Quote:

I stand by my view that unless 64-bit float (and integer) support in the vector unit is extremely compelling, it would not be worth widening the registers to 256-bits. Personally I don't think it is that compelling and would rather see the FPU hardware beefed up. If VMX2 really comes along I would rather see it add new operations (dot-product, cross-product, XYZW swizzle, etc) on the existing registers, instead of widening the registers. I guess we'll have to wait and see if IBM thinks the same way.

Well if we drop the doubles and 64 bit integers, you would still stand to gain for some applications. I really don't know where the original register width was defined, but have to think that its original width was selected due to transistor allocation and cache & main memory performance. The capabilities with all of these have steadly increased, so at some time in the future it will make sense to reevaluate how to best accelerate the VMX facility. I believe register width adjustment is a viable feature and likely will be used in cunjunction with other enhancements.

Everybody would like to see the FPU beefed up, but at some the point of diminishing returns will be reached. For integer and FPU operations that are not obviously vector based you have to at some point resort to clock rate increases of SMP. Fortunately; the 970 has a great deal of room for improvements to its execution units. I'm a big believer in SMP though and would rather see dual core dies before a huge explosion in core size for a single processor.

If the core size where to be enlarged by a great deal, then I would have to agree that additional capabilities in the VMX unit are a good place for those enhancements. Well that and another integer unit. I would much prefer these over SMT, but who knows maybe IBMs SMT implementation will be that much better. Now the interesting question is would those enhancements benefit from wider registers? I'm sure both IBM and Apple are or have runned simulations to answer some of the questions that pop up in this thread. We as customers probally won't knw the answers for a couple of years - that hurts. I do know one thing; VMX is a feather in Apples cap and a great marketing feature, they will do whatever is required to keep VMX or whatever it is called at the forefront performance wise. The next couple of years should be very interesting at Apple!

Thanks

dave

programmer · August 12, 2003 9:45AM

Quote:

Originally posted by wizard69

Changing the instruction set is disruptive, additions to the instruction set are far less disruptive. Even the G5 introduces changes that impact the abiltiy of the G5 to execute existing code, but even here one learns to work around them.

The differences in the current G5 are very minor and it is easy to write 32-bit code that works on both the G4 and G5. Writing VMX2 code would only work on G5 processors and probably wouldn't work on any lower power versions. You'd be back in the G3/G4 situation again, just after finally getting out of it.

Quote:

Well that is going in the other extreme!

I'm by far unpriviledged to what the current vector code base implements overall, but I can imagine that many routines would make good use of a doubling in register width. All that I can see is a gain in flexibility, you still have all the old features of the unit.

The "other" extreme? No, I just took the widening of the vector register to a ridiculous degree -- it is the same direction as going 256-bits. I was trying to point out how inflexible programming with this unit would be. I clearly get the impression that you haven't written any VMX code before. VMX code is often a struggle to get the data aligned and into a place where you can operate on the full vector width, and increasing this width can make the struggle worse.

Quote:

I do not see, in the case of a SIMD unit, the execution units becoming more expensive relative to implementing parallel units. In fact I see the resultant units being comparable, with the wide unit slightly simpler. SIMD units work on parts of the register in parallel, there is no need for extremely wide adders, multipliers and such. The permute unit would be another story though.

The cache line width and internal bus size will increase in size anyways, even if VMX is left untouched. These are really non issues.

Making these things wider is not automatic in new processor designs. There are three things components -- speed, width, and number. How many busses to how many execution units, how wide are they, and how fast are they? Designs are always tradeoffs between these three and their cost. Modern busses, for example, have gone narrower and much faster. Wider is very expensive in terms of board real estate, and the same is true on-chip irrespective of the process used.

Quote:

I'm not sure what you are trying to say here, possibly that the Pentium 4 does not perform as well as it could? It would be interesting to learn how the Centrino manages its excellent performance relative to the P4, this might provide some insight into what is feasable design wise.

No, I was saying that the x86 chips (Pentium, Centrino, Athlon, etc) are doing pretty darn well with only 8 of each kind of register. PowerPC has 32 of each. Increasing the amount of architected register space is not necessary to improve performance.

Quote:

So you argue for large virtual vector widths or SMT, well the vectors are already wide so wide registers fit the problem you describe.

"Wide" is not a binary thing. Registers are not "wide" or "narrow"; 128-bits is not equivalent to 256-bits. The current vector width is a balance between how much work a given instruction represents, and how hard it is to work in terms of so much data at once.

Quote:

As to SMT, it will be a problem if you do not address the data movement issues. Running a thread through a pipe that is already running at maximum is not going to benefit anyone. It is my contention that much of the code run through a vector unit does staturate if only for brief periods conventional thread scheduling should work just as well. The evidence is there that SMT is a mixed bag on Intel hardware, there is every reason to believe that similar anomallies will show up on the PPC.

Intel's first shot at it was hacked on to the existing processor. Their next attempt, and IBM's first, will do considerably better. IBM, in particular, has said that you can choose on a per-processor (or per-thread, its not clear) basis how many threads you want active on the core at a time. Even ignoring that, however, not all code on a processor is using the vector unit at the same time, SMT isn't necessarily restricted to all threads in the same process, and not all vector algorithms are completely memory bound. Even if they are, not all memory spaces are equivalent, especially in future machines. In a NUMA architecture, for example, some algorithms will stall worse than others because the memory access speeds are non-uniform. SMT provides flexible dynamic scheduling between arbitrary threads.

Quote:

Due to the nature of SIMD code both of those threads are likely to be going to main memory, so your likely to have contention all the way to the DRAM chip. I supose it is possible to have two threads of vector code operating on unrelated data at the same time but I do not see this as the norm.

If all threads are bandwidth SIMD code running from the same source then you'd be running some other non-bandwidth bound code at the same time, or just the SIMD code in isolation. That doesn't mean that all the other cases aren't going to happen, and probably be more common than the single bandwidth bound thread case. Look how many tasks are running on any given MacOS X box even when you've only got one app open.

Quote:

I still believe at this point, that having parallel execution units operating on 128 bit data will produce slightly more contention and through put issues versus 256 bit data. With two execution units, and 128 bit data, that data transfers would have to be interleaved so you really can't look at the two implementations as producing the same results.

In a single thread the only difference is that you had to issue 2 instructions to work on 256 bits of data rather than one on a wider register. And they don't interfere with eachother more because instead of building one 256-bit wide data path, you build two 128-bit data paths that are independent. I don't know how to explain it better than that.

Quote:

I'm not sure where the idea comes from that one is giving up flexibility, we are gaining addtional capability. As far as the expense of context switches parallel units just add their own concerns with instruction completetion and you still have additional registers implemented in hardware. Maybe rename registers and buffers are not real but they none the less contribute to the transistor count.

If you issue one instruction that operates on 256-bits, what do you do if you only need to operate on 128-bits? That is flexibility in a nutshell (and there are many ramifications).

The context switch is independent of the number of execution units. All that must be saved is the current architected machine state. If you change the register widths, then you've increased the context switch. That's why a context switch on the 970 is more expensive in 64-bit mode than it is in 32-bit mode, but the 32-bit mode cost is the same as it was on the G4. All the non-architected stuff is dealt with "behind the scenes" because the user programming doesn't expose them to the user.

Quote:

Besides all of the above, this is not a processor designed for realtime systems anyways. Considering that this feature, if it exists, will be implemented on much faster hardware the impact on a context switch may not even be seen.

This is a complete cop-out. It will work in realtime systems just fine, and near-realtime is an important problem domain. Context switching costs are a %age of your performance so making the processor faster doesn't factor into it... it just means you get to make more context switches per unit time. A context switch isn't just a pre-emptive multitasking switch, it includes function calls and synchronization hand-offs between threads.

Quote:

I really don't know where the original register width was defined, but have to think that its original width was selected due to transistor allocation and cache & main memory performance.

Usability also factored into it heavily.

Quote:

Everybody would like to see the FPU beefed up, but at some the point of diminishing returns will be reached

The same holds true of the vector unit.

Quote:

If the core size where to be enlarged by a great deal, then I would have to agree that additional capabilities in the VMX unit are a good place for those enhancements. Well that and another integer unit. I would much prefer these over SMT, but who knows maybe IBMs SMT implementation will be that much better.

Much better. So will Intel's next attempt at it. Don't let the first generation attempt at SMT discourage you -- the first generation superscalar processors were pretty lame as well.

Quote:

I do know one thing; VMX is a feather in Apples cap and a great marketing feature, they will do whatever is required to keep VMX or whatever it is called at the forefront performance wise. The next couple of years should be very interesting at Apple!

We certainly agree there. This discussion is really just about what VMX enhancements are worthwhile and effective, not whether any should be done at all.

wizard69 · August 12, 2003 11:28PM

Quote:

Originally posted by Programmer

This has been a very interesting discussion but I still have a reluctance to accept your point of view.

Quote:

The differences in the current G5 are very minor and it is easy to write 32-bit code that works on both the G4 and G5. Writing VMX2 code would only work on G5 processors and probably wouldn't work on any lower power versions. You'd be back in the G3/G4 situation again, just after finally getting out of it.

Yes I understand this but we do have to move forward. Even later on you mention that you would like to see an expanded VMX instruction set, would this expansion of VMX put us into the same position again.

I'm all for improvements to the capabilities of the processor as long as they are not half bake ideas like have been offered in the *86 market. It is obvious that a great deal of thought went into the original Alt-Vec implementation, it is reasonable to believe that VMX can be extended again in a rational manner.

Quote:

The "other" extreme? No, I just took the widening of the vector register to a ridiculous degree -- it is the same direction as going 256-bits. I was trying to point out how inflexible programming with this unit would be. I clearly get the impression that you haven't written any VMX code before. VMX code is often a struggle to get the data aligned and into a place where you can operate on the full vector width, and increasing this width can make the struggle worse.

Making these things wider is not automatic in new processor designs. There are three things components -- speed, width, and number. How many busses to how many execution units, how wide are they, and how fast are they? Designs are always tradeoffs between these three and their cost. Modern busses, for example, have gone narrower and much faster. Wider is very expensive in terms of board real estate, and the same is true on-chip irrespective of the process used.

No, I was saying that the x86 chips (Pentium, Centrino, Athlon, etc) are doing pretty darn well with only 8 of each kind of register. PowerPC has 32 of each. Increasing the amount of architected register space is not necessary to improve performance.

Yep there is a lot of room to improve VMX, I think everybody can agree on that. I will even accept that another execution unit will help to an extent but I still have a hard time believing that you will be moving data in and out as fast as you would with wider registers.

Quote:

[/b]

"Wide" is not a binary thing. Registers are not "wide" or "narrow"; 128-bits is not equivalent to 256-bits. The current vector width is a balance between how much work a given instruction represents, and how hard it is to work in terms of so much data at once.

Well I part of my argument is that a better balance would be achieved by wider registers - that is 256 bit instead of 128 bit. I'm still wondering why you suspect that 256 bit registers make life harder, you would still have all the existing capabilities of the VMX unit. You would simply take advantage of the wider registers when the problem you are solveing can take advantage of it.

Quote:

Intel's first shot at it was hacked on to the existing processor. Their next attempt, and IBM's first, will do considerably better. IBM, in particular, has said that you can choose on a per-processor (or per-thread, its not clear) basis how many threads you want active on the core at a time. Even ignoring that, however, not all code on a processor is using the vector unit at the same time, SMT isn't necessarily restricted to all threads in the same process, and not all vector algorithms are completely memory bound. Even if they are, not all memory spaces are equivalent, especially in future machines. In a NUMA architecture, for example, some algorithms will stall worse than others because the memory access speeds are non-uniform. SMT provides flexible dynamic scheduling between arbitrary threads.

[/b]

If all threads are bandwidth SIMD code running from the same source then you'd be running some other non-bandwidth bound code at the same time, or just the SIMD code in isolation. That doesn't mean that all the other cases aren't going to happen, and probably be more common than the single bandwidth bound thread case. Look how many tasks are running on any given MacOS X box even when you've only got one app open.

Given a single processor machine; all of those running tasks are serialized through the CPU. They are not running at the same time in the sense that SMT threads would be. In effect between each context switch, as directed by the scheduler, you have two or more threads running on a SMT capable machine, on a standard CPU you will only have one thread running. This works, at times, OK with threads flowing through the integer and FP units but I have a harder time believing that you will get consitantly good results through a VMX unit. The primary reason being the need to move that data through the unit.

Quote:

In a single thread the only difference is that you had to issue 2 instructions to work on 256 bits of data rather than one on a wider register. And they don't interfere with eachother more because instead of building one 256-bit wide data path, you build two 128-bit data paths that are independent. I don't know how to explain it better than that.

Yes I know what you are trying to explain but I see tremendous complexity in multiplexing two 128 bit data buses into the VMX register set.

Quote:

[/b]

If you issue one instruction that operates on 256-bits, what do you do if you only need to operate on 128-bits? That is flexibility in a nutshell (and there are many ramifications).

VMX already provides instructions to operate on 128 bit registers.

Quote:

The context switch is independent of the number of execution units. All that must be saved is the current architected machine state. If you change the register widths, then you've increased the context switch. That's why a context switch on the 970 is more expensive in 64-bit mode than it is in 32-bit mode, but the 32-bit mode cost is the same as it was on the G4. All the non-architected stuff is dealt with "behind the scenes" because the user programming doesn't expose them to the user.

Yes!! The context switches are longer but a decision has been made that the advantage of a 64 bit processor are worth the increase in context switch time. Like wise a decision could be made with respect to VMX, I'm not saying it will be made but it is possible.

Quote:

This is a complete cop-out. It will work in realtime systems just fine, and near-realtime is an important problem domain. Context switching costs are a %age of your performance so making the processor faster doesn't factor into it... it just means you get to make more context switches per unit time. A context switch isn't just a pre-emptive multitasking switch, it includes function calls and synchronization hand-offs between threads.

Though not really related to Mac OS/X, in realtime systems time can be an issue. Even in Mac OS/X some operations could be impacted, which is probally and arguement against wider registers. It is possible to address the time issue by speeding up the processor.

Quote:

Usability also factored into it heavily.

The same holds true of the vector unit.

Much better. So will Intel's next attempt at it. Don't let the first generation attempt at SMT discourage you -- the first generation superscalar processors were pretty lame as well.

I wouldn't use the word discouraged, I'd like to think in terms of caution. SMT is one of those things that I can see working well in some applications, and completely futz things up in others. I'm even willing to say that part of the issue is probally operating system related.

Quote:

We certainly agree there. This discussion is really just about what VMX enhancements are worthwhile and effective, not whether any should be done at all. [/B]

A very interesting discusion it is. No matter what approach is used on VMX2, I have to imagine a lot of effort will be put into moving data around. It will be very interesting to see a processor with double or triple the transistors currently on the 970 die.

zapchud · August 13, 2003 4:25AM

Quote:

Originally posted by wizard69

Yes I understand this but we do have to move forward. Even later on you mention that you would like to see an expanded VMX instruction set, would this expansion of VMX put us into the same position again.

It would not put us into the same position again. The VMX2 situation Programmer describes here would be a "VMX2 is incompatible with VMX situation". An expanded instruction set would enable a programmer to write code that'd work on both VMX platforms without rewriting anything, but you could write a fraction of the code to take advantage of the expanded instructions to enable extra efficiency. This code fraction would be much less to rewrite to support the regular VMX instruction set, or to scalar instructions.

programmer · August 13, 2003 9:12AM

Quote:

Originally posted by wizard69

Yes I understand this but we do have to move forward. Even later on you mention that you would like to see an expanded VMX instruction set, would this expansion of VMX put us into the same position again.

Moving forward doesn't have to mean changing the ISA again. I mention other ways to change VMX only because this thread was started by the idea that there is a VMX2 coming and therefore it must be something. If it is going to be changed, what is the lowest impact and more cost effective and widely applicable way to change it.

Quote:

It is obvious that a great deal of thought went into the original Alt-Vec implementation, it is reasonable to believe that VMX can be extended again in a rational manner.

Agreed. IBM is going to have a better idea of what rational is than either of us.

It will also depend on their goals for the extensions.

Quote:

Well I part of my argument is that a better balance would be achieved by wider registers - that is 256 bit instead of 128 bit. I'm still wondering why you suspect that 256 bit registers make life harder, you would still have all the existing capabilities of the VMX unit. You would simply take advantage of the wider registers when the problem you are solveing can take advantage of it.

My argument is that (a) it is cheaper and more backward compatible to provide more 128-bit vector units, and (b) any speed up available by going to a 256-bit vector unit can be achieved by doubling the number of 128-bit execution units and ensuring the dispatch rate is sufficient. This is true because of the nature of SIMD, and having more units of half the width gives better programming flexibility, more OoOE flexibility, and avoids a needless complication to the instruction set. Spend the transistor budget where everyone benefits automatically on both new and old code.

Quote:

This works, at times, OK with threads flowing through the integer and FP units but I have a harder time believing that you will get consitantly good results through a VMX unit. The primary reason being the need to move that data through the unit.

Not everything is bandwidth bound, and not everything is bandwidth bound on the same thing. IBM has already said there will be thread level control of whether SMT will be active for a given thread so the developer or OS will be able to make a decision on a per-thread basis. Much of the time, however, there will be additional threads to be run simultaneously with the bandwidth bound thread -- it is an ideal candidate for this. In the future caches will be even larger and memory pools more non-uniform, making the definition of bandwidth bound more complicated and thus more ameniable to SMT.

Quote:

Yes I know what you are trying to explain but I see tremendous complexity in multiplexing two 128 bit data buses into the VMX register set.

Its no worse than doubling the width of the internal bus, and may allow multiplexing of the internal bus to more execution units by just adding a pipeline stage or two. Again there is flexibility in having more units instead of wider ones. The 970's registers are already dual ported, by the way.

Quote:

VMX already provides instructions to operate on 128 bit registers.

So? You are making an assumption about how the double-width registers are achieved. In the 32-bit to 64-bit PPC transition there is a mode switch an most instructions have their behaviour changed to operate on 64-bit registers. You cannot interchange 32-bit and 64-bit operations without changing the mode and that requires a couple of instructions and might have performance ramifications. If you look at the VMX instruction coding you'll see that they don't have any available bits in most instructions to indicate whether to operate on the narrow or wide register which means they'd need to almost double the number of instructions just to support the same operations without adding any new capabilities.

Note that the rumour mentioned only an ~33% increase in instruction count which would be insufficient to achieve this.

Quote:

Yes!! The context switches are longer but a decision has been made that the advantage of a 64 bit processor are worth the increase in context switch time. Like wise a decision could be made with respect to VMX, I'm not saying it will be made but it is possible.

Ah but the cost/benefit is dramatically different between expanding the GPU and the VPU registers. With the GPU we get the capability for 64-bit memory addressing. With the VPU we get no new capability, just a performance increase that I argue can be achieved by having more execution units.

Quote:

Though not really related to Mac OS/X, in realtime systems time can be an issue. Even in Mac OS/X some operations could be impacted, which is probally and arguement against wider registers. It is possible to address the time issue by speeding up the processor.

But using up the processor's speed increase making up for an architectural change is a waste of potential performance... especially when a less intrusive change would have given you the same performance improvement.

Quote:

A very interesting discusion it is. No matter what approach is used on VMX2, I have to imagine a lot of effort will be put into moving data around. It will be very interesting to see a processor with double or triple the transistors currently on the 970 die.

Yes, bandwidth will continue to be pivotal but this is bandwidth in and out of the processor not between cache and register. I'm interested to see what new capabilities they choose to put into the vector unit -- there are some operations which are currently quite slow but would make the unit much more versatile and make it easier to auto-vectorize code. This is where I think the big payoff in VMX2 would be.

amorph · August 13, 2003 10:04AM

Quote:

Originally posted by wizard69

Yes I understand this but we do have to move forward. Even later on you mention that you would like to see an expanded VMX instruction set, would this expansion of VMX put us into the same position again.

Any excessive eagerness to move it in some direction just to move it fails to respect the amount of careful design work that went into the first version. The PPC ISA hasn't moved anywhere since 1994 because they got it right the first time.

A little ways up, there was some back and forth about what "flexibility" was. Hardware is inflexible almost by default. The more constraints a hardware implementation places on software, the less flexible it is. SIMD is inflexible almost by definition - the massive parallellism of modern clusters and supercomputers is far better suited to working with vectors than AltiVec is (because there is no limit on the size of a vector, mathematically); or, for that matter, matrices, or any other large data sets. The correlation between data and the hardware units that process them are 1:1 (one FP value per FPU) and that means maximum flexibility in manipulating or representing data.

Unfortunately, until languages and programmers get better about threading (OO is a step in that direction, but few if any of the currently popular OO languages are), and as long as there is a significant difference between register size and real data size, SIMD is a useful compromise.

The improvements I've seen called for were based on the real-world experience of talented AltiVec programmers; they're not notional or careless, and they ignore SSE2 altogether. In fact, as a historical matter, the only person who's argued for a feature because it made sense for x86 is you.

At any rate, given the excellence of the first design and the lack of follow-ups or tweaks, we can all expect another careful, long-term revision when it makes sense.

Quote:

Yep there is a lot of room to improve VMX, I think everybody can agree on that. I will even accept that another execution unit will help to an extent but I still have a hard time believing that you will be moving data in and out as fast as you would with wider registers.

Wider registers introduce internal fragmentation into data storage (because any value takes up space in memory corresponding to the size of the register it will be loaded into); for the same reason, wider registers reduce the effectiveness of the cache (they either store half as much or double in size, slowing the CPU down and increasing their own latency - and low latency is the whole point of a cache). In other words, if you go to 256 bit registers, all data becomes 256 bit - it's just a matter of whether the top 128 bits are wasted.

If you double up an execution unit, you improve the implementation of the existing VMX - there are no changes to the ISA or the macro language required, and no changes to any existing code are required. Furthermore, there is no further internal fragmentation of AltiVec data (the programmer already has to introduce some to fit the data into the multiples that AltiVec expects - imagine, for example, working on 5 32 bit floats). Lastly, you open up opportunities for SMT when another thread is either not using, say, the VFPU all that heavily, or when it's waiting the 50-odd cycles before a request for data is answered across the system bus (on the 970 at least - the G4's memory latency is somewhat lower, but still significant).

Quote:

I'm still wondering why you suspect that 256 bit registers make life harder, you would still have all the existing capabilities of the VMX unit. You would simply take advantage of the wider registers when the problem you are solveing can take advantage of it.

Let me give you an example. Several years ago, my employer bought an Alpha computer - run on a pure RISC 64-bit platform. The Alpha's registers were all 64 bit, and so it aligned data in RAM in 64 bit words, and fetched data along those word boundaries. There were facilities to, for example, fetch the 8-bit character in the third byte of one of these words, but the performance hit was horrendous. So as a result, a simple array of ASCII characters took up 64 bits times the number of characters in the array (plus 64 bits for either the string length prefix or the nul)! That means that only 25% of the memory - and the bus bandwidth, and the processor's capability - was actually being used.

Your implementation introduces the same problem, only with SIMD it's worse, because the whole point of SIMD is to deal with densely packed data! (In fact, it seems to me that SIMD started to appear at about the point when general purpose CPUs started getting inefficient at processing small chunks of data - which is to say, most data.) A VMX programmer could take that array, pack the characters 16 at a time into vectors, and gain both the 400% reduction in memory use and bandwidth and a 400% increase in performance - usually you have to choose between one and the other (as in the Alpha). Now, you say, he could pack them 32 at a time into 256-bit registers. True, but we're not operating in a vacuum here: The resulting code would run poorly on all existing AltiVec implementations, and all existing AltiVec code would run poorly on the wider implementation (because of the internal fragmentation). This is why it's so important to get hardware specifications right the first time, and why it is crucial to be circumspect about making any architectural changes to them. Implementation changes, on the other hand, are welcome.

Quote:

Given a single processor machine; all of those running tasks are serialized through the CPU. They are not running at the same time in the sense that SMT threads would be. In effect between each context switch, as directed by the scheduler, you have two or more threads running on a SMT capable machine, on a standard CPU you will only have one thread running. This works, at times, OK with threads flowing through the integer and FP units but I have a harder time believing that you will get consitantly good results through a VMX unit. The primary reason being the need to move that data through the unit.

Remember, VMX isn't a physical unit. It's four units. If one process is doing vector permutes on long streams of data, there's still an opportunity for another process to do an intense amount of numerical (FP or integer) work on a relatively small dataset sitting in cache - or if it's not in cache, it can be requested while the other thread is still chewing through a stream of data, so that its request is answered while the other is still waiting (interleaving the 50+ cycle latencies, in effect, and thus effectively negating its effect on system bandwidth). This is a trick that good AltiVec programmers already use, although it's implemented by hand.

Quote:

VMX already provides instructions to operate on 128 bit registers.

But there are no 128 bit registers in a 256 bit implementation, unless you really want the design to become byzantine. Unless what you're suggesting is a separate set of units from VMX, with its own register set and its own busses? How would that be any less complicated than doubling up an existing unit or two?

Quote:

Though not really related to Mac OS/X, in realtime systems time can be an issue. Even in Mac OS/X some operations could be impacted, which is probally and arguement against wider registers. It is possible to address the time issue by speeding up the processor.

OS X has realtime capabilities, and they're used by e.g. audio.

Speeding up the processor does not reliably address context switch penalties, either.

Quote:

I wouldn't use the word discouraged, I'd like to think in terms of caution. SMT is one of those things that I can see working well in some applications, and completely futz things up in others.

This is the same way that I am looking at a register widening. Either the impact on existing implementations would be painful (if the whole thing went 256 bit, with the same support for 128 bit instructions that 64 bit scalar implementations have for 32 bit data), or the complexity and real estate demands on the CPU would be painful (if you came out with some beast that supported both 128 bit and 256 bit registers).

Quote:

A very interesting discusion it is. No matter what approach is used on VMX2, I have to imagine a lot of effort will be put into moving data around. It will be very interesting to see a processor with double or triple the transistors currently on the 970 die.

It would be more interesting to me to see a CPU half the size of the 970, frankly. Big, hot processors constrain both design and implementation of hardware. And the future of high-performance (not to mention, much of the present) is massive parallellism of inexpensive machines. Two FPUs on two CPUs have all the parallelism of 4 FPUs on one CPU, with twice the bandwidth (assuming that we've left the shared bus topology where it belongs) and twice the cache. You can't add units to a CPU if you don't have the bandwidth to back them up - SMT should, if anything, slow down the expansion of CPUs in practice, because it will make sure that the existing units are used more efficiently.

wizard69 · August 13, 2003 8:34PM

First, I am not at all worried that Apple, Mot or IBM will show excessive enthusiasm for extending VMX. I was talking about your enthusiasm for extending VMX.

Quote:

I really can't see how you can make a statement suggesting that PPC hasn't moved anywhere, Alt-Vec, the G4 and the 970 are all steps forward.

You're confusing implementation with architecture again. The PPC ISA (instruction set architecture, the set of instructions that a CPU must implement in order to be called a PowerPC) has not changed. Not even once. VMX is a separate ISA that happens to appear on some PowerPCs.

I was addressing your idea that things like VMX have to "move forward" with new instructions. Ideally, the fewer revisions a hardware platform goes through, the better, and the PPC is a sterling example.

I've been arguing consistently for better implementations of the VMX architecture, which is what extra execution units would represent. Fortunately, neither Mot nor IBM have been slow with those.

Quote:

If the above was true Apple would not of had much success selling the G4's into compute clusters. Many times the existance of the vector unit proved to be a fine selling point when dealling with technical people. Even the Navy has seen that there can be a cost benefit to buying Apple's G4 hardware. Yes the Navy is using the machines in clusters, but without the vector unit they would not be there.

Did I say they were bad? I was talking about the idea of flexibility and what it meant. My position on the desirability of VMX should be clear at this point.

To be blunt, SIMD is an attractive compromise between efficiency of execution and flexibility. I described what the tradeoffs were, and why they're desirable, in the post you replied to, so I won't repeat them.

Quote:

Threading solves a completely different problem than SIMD.

Not completely different; there is a broad intersection proven by the nature of high-end computing. They achieve the same sort of concurrent execution that SIMD does by using lots of CPUs and massive threading. In fact, since SIMD's parallelism is negligible in comparison, they'll use both parallelism and SIMD when it suits their needs.

In other words, threading can and does solve the same problem, which is doing a lot of identical operations on large amounts of data at once.

Again, as I said in the post you replied to, SIMD might always be useful for dealing with data types that are too small to be handled efficiently in scalar units, due to the scalar register size.

Quote:

Now parallel processing may be used to speed up math operations on large data sets and a threading implementation may implement parallel operations but you will still derive maximum benefit from using the vector instructions. That is given that the problem is vector based in the first place.

Vector != SIMD and SIMD != vector. I've already made the case for the superiority of massive parallelism in vector processing.

On the other hand, Motorola provided a routine to use AltiVec to do fast searches through text, which is hardly a problem that lends itself to vector math, but which is a great use of AltiVec (after all, it was one of the things that Motorola was showing off!).

Quote:

Now you must be confused, I have never argued in favor of a feature because in made sense in the x86 world.

Does "[j]ust as SSE2 improved on the standard Intel FPU an enhanced VMX unit could improve on the PPC FPU" ring a bell?

Quote:

Data does not become 256 bit anymore than it does in a 128 bit register, you are confusing what happens in an integer unit when it is widened to what happens in a vector unit.

The issue is: How is data loaded into registers from memory? The nature of the instructions is irrelevant to this concern. What matters is the size of the register. Because the PPC and VMX ISAs are separate, when you are running scalar code, the size of the scalar registers determines the smallest quantity of data that will be fetched from memory at once. When you are running VMX code, the size of the VMX registers determines the smallest quantity of data that will be fetched from memory. (In practice, cache line sizes determine the amount of data fetched, but it's never less than the amount that can be stored in the relevant register, and it's always a multiple of that amount). So, 256 bit VMX registers will force VMX instructions and data to be stored in 256 bit words.

Quote:

This is a well known problem with RISC processors in general. But again I have to ask what that has to do with arrainging data for vector operations. You aren't trying to tell me that you allocate a 128 bits for each float are you. Placing data on 128 bit boundries is another thing, but data alignment issues have been around a long time.

Data is arranged for SIMD manually. All existing AltiVec code arranges it into chunks 128 bits wide, which are loaded into registers 128 bits wide. If you widen the registers to 256 bits, the chunks are still 128 bits unless the programmer goes and manually rearranges everything - in which case the code no longer runs on the accumulated 4 years' worth of VMX1 units.

This has nothing to do with the representation of the data types that are packed into the SIMD vectors. It has to do with the size of the vectors themselves, and that's where the analogy to scalar values come in. If you say that a vector is to a vector register as a float is to an FP register, then going from 32 bits to 64 bits means that every 32 bit float takes up 64 bits in scalar code (32 bits of which are padding), and every 128 bit vector takes up 256 bits in vector code (128 bits of which are padding).

Quote:

OS X has realtime capabilities, and they're used by e.g. audio.

Yes but that wasn't the types of systems I was thinking about.

Realtime needs are realtime needs. Latency (or rather, the lack thereof) is absolutely crucial to pro audio, and Apple's audio is all software at this point. Anything that negatively affects OS X's ability to do something now had better be damn good.

Quote:

As to you statement that 'two FPU's on two CPU's have all the parallelism of 4 PFU's on one CPU', that simply isn't true. It is not true in the case of any current SMT implementation and certainly won't be in the future.

I'm not talking about SMT, I'm talking about parallelism, or concurrency if you prefer. Both are capable of doing 4 FP instructions on 4 FP data simultaneously, no? And the first implementation has twice the bandwidth, so if anything 2 FPUs on 2 CPUs is better for SMT. The only real world bottleneck - and I've acknowledged this repeatedly - is that threading is not a widely used discipline among desktop programmers, and it's not been encouraged by traditional desktop hardware. But going back to your original proposition, if you want to do supercomputer work on a desktop, you need the implementation used by every hard core number cruncher in existence: Massive parallelism, and massive threading to match. Even for SIMD work.

I'm waiting and hoping for languages to catch up to this need. All of the big ones now tack threading on as a sort of afterthought, or a system capability when they acknowledge it at all (does Standard C++ know what a thread is even now? I don't think so). This discourages vendors from designing threaded software, which discourages platform designers from designing thread-friendly hardware, and off we go.

programmer · August 13, 2003 11:00PM

Quote:

Originally posted by wizard69

I've been arguing consistently for better implementations of the VMX architecture, which is what extra execution units would represent. Fortunately, neither Mot nor IBM have been slow with those.

Interesting statement for somebody who has been advocating wider registers for the past dozen posts.

Quote:

Data is arranged for SIMD manually.

Only in an ideal world. Sadly most of the time an AltiVec coder spends half his effort rearranging data into the correct form to use the vector instructions on it, and then rearranges it back into the memory form for storage. This includes reordering elements, setting up alignment, and packing/unpacking fields. The permute unit gets half the instructions for a reason. Increasing the register width is only going to make this process more awkward because each instruction has to deal with twice as many data elements and when you're futzing around at the end of the array with ones and twos that means you're "farther" from the vector unit's happy place. By itself this wouldn't matter, but combined with the other costs of the wider registers it doesn't help the situation.

Quote:

I'm waiting and hoping for languages to catch up to this need. All of the big ones now tack threading on as a sort of afterthought, or a system capability when they acknowledge it at all (does Standard C++ know what a thread is even now? I don't think so). This discourages vendors from designing threaded software, which discourages platform designers from designing thread-friendly hardware, and off we go.

As with many other things, C++ forces the developer to deal with the problem(s). That is both its strength and weakness. It is a strength because you get control, it is a weakness because you must exercise control. C++ is a very powerful language and can be used to attack a wide range of problems. Unfortunately it is used to attack too many problems where other tools exist that are probably more appropriate. Developers often run into trouble with threading, in particular, due to lack of training / knowledge / experience. It is often a difficult problem and the tools available to work with it are fairly weak. All too often it is retrofitted onto code that wasn't designed to be threaded.

The thread-friendly hardware is coming, and coming fast. Multiple chips with multiple cores equipped with SMT will be with us soon. It has been coming for many years -- pretty much an inevitability due to the problems in making a single instruction stream run faster. Eventually you hit the wall of diminishing returns. Fortunately the OSes have advanced to the point where their threading support is acceptable, even if the tools & languages are lacking.

There was some strangeness in this thread when I first tried replying to one of your comments about chip size. I don't know where your comment went, but I thought I'd toss my comments into the ring...

In chip manufacture there is an "ideal" physical size for a chip at a given wafer size. Offhand I can't remember why, but I'll leave tracking down that reason as an exercise for the reader. The upshot, however, is that chips will remain physically the same size as the process shrinks which increases the transistor budget. Designers have essentially reached the point where they are trying to figure out more effective things to put in the chip -- caches have reached the point of diminishing returns, pipelines are about as long as they want to be, register sets are large enough for the number of execution units, and the number of execution units has grown to the point where its not really worth adding more because they'd just spend most of their time stalled waiting on dependencies. SMT provides a way to fill the current stalls, and provides justification for adding more execution units and dispatch capability. SMT itself will not cause chips to expand or contract, but instead provides a way to more effectively utilize the available transistors that a given process allows.

amorph · August 14, 2003 11:29AM

Quote:

Originally posted by Programmer

Interesting statement for somebody who has been advocating wider registers for the past dozen posts.

Check your attributions. I wrote the post you're quoting.

Quote:

Only in an ideal world. Sadly most of the time an AltiVec coder spends half his effort rearranging data into the correct form to use the vector instructions on it, and then rearranges it back into the memory form for storage. This includes reordering elements, setting up alignment, and packing/unpacking fields. The permute unit gets half the instructions for a reason. Increasing the register width is only going to make this process more awkward because each instruction has to deal with twice as many data elements and when you're futzing around at the end of the array with ones and twos that means you're "farther" from the vector unit's happy place. By itself this wouldn't matter, but combined with the other costs of the wider registers it doesn't help the situation.

Hmm. Interesting. But the basic idea that wider registers introduce rampant internal fragmentation holds, right? Because the existing code is still arranging data into a form half the size of the registers?

Quote:

As with many other things, C++ forces the developer to deal with the problem(s). That is both its strength and weakness. It is a strength because you get control, it is a weakness because you must exercise control. C++ is a very powerful language and can be used to attack a wide range of problems. Unfortunately it is used to attack too many problems where other tools exist that are probably more appropriate. Developers often run into trouble with threading, in particular, due to lack of training / knowledge / experience. It is often a difficult problem and the tools available to work with it are fairly weak. All too often it is retrofitted onto code that wasn't designed to be threaded.

Exactly.

I remember something Dennis Ritchie said about C, that it's strength lies in the fact that it doesn't try to do everything. There's something to be said for using the right tool for the job; the "standardization" on a language has never made sense since IBM tried it with PL/I.

Quote:

SMT provides a way to fill the current stalls, and provides justification for adding more execution units and dispatch capability. SMT itself will not cause chips to expand or contract, but instead provides a way to more effectively utilize the available transistors that a given process allows.

Hmm. After everything I'd written about latency and stalls, I'd forgotten to account for that. Ah, well.

OK, so maybe it's a wash. I'll buy that.

airsluf · August 14, 2003 12:34PM

amorph · August 14, 2003 12:39PM

I believe AirSluf is trying to tell us that we're beating a dead horse, but the sheer magnitude of our effort has struck him dumb.

airsluf · August 14, 2003 12:42PM

Quote:

Originally posted by Amorph

I believe AirSluf is trying to tell us that we're beating a dead horse, but the sheer magnitude of our effort has struck him dumb.

You got that one right.

It was painful, but hissy-fits, threats of electrical mayhem and much beating of head against the desk finally win out over a self induced case of image apoplexy.

programmer · August 15, 2003 12:43AM

Quote:

Originally posted by Amorph

Check your attributions. I wrote the post you're quoting.

Then you'd better check your forum database/software -- the post above mine is the source of the quote and it said, and still says, "wizard69".

Quote:

Hmm. Interesting. But the basic idea that wider registers introduce rampant internal fragmentation holds, right? Because the existing code is still arranging data into a form half the size of the registers?

I've been avoiding replying to your posts because I don't understand what you're talking about with "internal fragmentation".

Quote:

...dead horse...

There are places where horsemeat has value.

wizard69 · August 15, 2003 8:32AM

Quote:

Originally posted by Programmer

Then you'd better check your forum database/software -- the post above mine is the source of the quote and it said, and still says, "wizard69".

I seem to be seeing some confusion in the atributions also. maybe we confused the forum software along with everybody else

Quote:

I've been avoiding replying to your posts because I don't understand what you're talking about with "internal fragmentation".

There are places where horsemeat has value.

Vmx 2

Comments