Vmx 2

wizard69 · August 9, 2003 10:17AM

Quote:

Originally posted by Programmer

You seem to be obsessed with the accuracy of physical measurements. There are a very large set of software problems where there are no physical measurements to be dealt with -- perhaps most problems? Even if you have measurements, and they are lower precision than a 32-bit float, you still need higher precision math in order to run many algorithms on this data to avoid inaccuracies creeping in because of the nature of fixed-precision math.

Its more of a issue of being drawn into the accuracy argument by others. Your one paragraph above though, is very clear on the issue, there are many reasons to support 64 bit operations.

Quote:

You are absolutely right that there is a need for 64-bit numbers; my position is that the FPU(s) is where this data should exist, not the vector unit. The advent of SMT, the need to write new code to leverage VMX2, splitting the hardware base, the constraints on vector processing, the cost of context switching with a large vector register set are all reasons why you wouldn't want to put doubles into the vector unit (even with 100% backwards compatibility). There are some perfectly valid reasons to extend VMX2 in this way, but while it might be an obvious thing to do I believe that the reasons not to do it outweight the advantages. We'll see if IBM agrees with me when they announce the details of VMX2. I have a lot of confidence that whichever course they choose will be the right one since they know a lot more about the subject that anybody here.

I'm not totally convinced one way or the other. Frankly it would be a stretch for me to design a CPU off the cuff. In any event I'm sure that both Apple and IBM are looking into how to best improve the PPC for math ops. In the end both the FPU and the VMX units will be improved, I'm just not convinced that vector operations will be handled in the current FPU. I could see them using the current FPU and an laternative register set to accomplish 64 bit vector operations, that would be neat. In any event I think every body could agree that the current VMX unit could be improved greatly just by elminating bottle necks.

Thanks

Dave

tomb of the unknown · August 9, 2003 1:38PM

Quote:

Originally posted by rickag

Altivec has a simple integer, complex integer, flotating point and permute unit. Would it be possible or even worthwhile to increase the number of execution units? and provide more flexible units(execution units that could perform one or more of the functions - for example an execution unit that could do both simple integer and/or complex integer) And if possible, would this allow for the retirement of more instructions per cycle?

What you describe is certainly possible, but the question really is: what problem would that solve?

Typically, adding execution units is a good thing to resolve issues where you have more data than time to process it. It can give you better scheduling and more IPC, for instance. But this assumes you have a huge cache and or bus that is capable of keeping your execution units fed.

As has been pointed out here before, Altivec as currently designed is still capable of chewing through data so fast that even the 2GHz G5's bus is not capable of pumping data fast enough to prevent stalls. The problem with vector operations isn't having too few execution units, its how to locate and move the data fast enough to make sure it's ready when the execution unit is.

martianmatt · August 9, 2003 7:51PM

Quote:

Originally posted by Programmer

Futhermore, the cost of these huge [256 bit] registers would affect the context switching cost of the processor in a very negative way. Doubling the vector registers would be an extra 0.5K for every context, on top of the already substantial 1K... a 50% increase.

One of the benefits I see with 256 bit/4x64 bit vector registers is that keeping the 32 standard registers plus the same number of rename registers would not cause a performance hit (except due to bandwidth - which would be the same in a 4 FPU chip).

Increasing the number of 64 bit FPUs to 4 to give the same double precision throughput capability of a 256 bit VMX unit would run into a problem with the number of registers. With FPU math you need to store and opperate on more 64 bit numbers than vectors in a 256 bit VPU to perform the same opperations. ie "4I4D" vs SIMD. I assume this would be a factor of 4 difference.

So how could you handle this in a 4 FPU chip? Would 32 FP registers plus the rename registers be sufficient? Would you extend the PPC spec to allow 64 (or 80 or 96...) registers with even more rename registers?

Do you understand my point? Do I understand the problem?... I feel that this issue is probably much broader than what I've posed here - I feel innadequate to ask all the right questions.

\

MM

programmer · August 9, 2003 8:52PM

Quote:

Originally posted by MartianMatt

Do you understand my point? Do I understand the problem?... I feel that this issue is probably much broader than what I've posed here - I feel innadequate to ask all the right questions. \

Yes.

No.

The number of architected FPU registers doesn't change when more FPUs are added. The number of rename registers probably does. Only architected registers need to be preserved in a context switch, however, so the rename registers don't cost anything except transistors. The OoOE nature of the processor allows you to write code ignoring the fact that there are multiple FPUs and the hardware takes care of the dynamic scheduling requirements. The instruction dispatch and load/store capabilities would need to be scaled with the increased FPU capability, but that is likely to happen anyhow. The addition of SMT means that dependent instruction streams (which stall a lot) can be multi-threaded very effectively, while vector-style threads can fully utilize the increased number of execution units.

rickag · August 10, 2003 11:31AM

Quote:

Originally posted by Tomb of the Unknown

....The problem with vector operations isn't having too few execution units, its how to locate and move the data fast enough to make sure it's ready when the execution unit is.

Thank you for the response. You distilled it down enough so that even I could understand, kinda.

bigc · August 10, 2003 11:40AM

Quote:

Originally posted by Programmer

Yes.

No.

The number of architected FPU registers doesn't change when more FPUs are added. The number of rename registers probably does. Only architected registers need to be preserved in a context switch, however, so the rename registers don't cost anything except transistors. The OoOE nature of the processor allows you to write code ignoring the fact that there are multiple FPUs and the hardware takes care of the dynamic scheduling requirements. The instruction dispatch and load/store capabilities would need to be scaled with the increased FPU capability, but that is likely to happen anyhow. The addition of SMT means that dependent instruction streams (which stall a lot) can be multi-threaded very effectively, while vector-style threads can fully utilize the increased number of execution units.

...and for those of you (like me) that don't know what a context switch is and why it is important for SMP.

martianmatt · August 10, 2003 2:32PM

Quote:

Originally posted by Programmer

Yes.

No.

The number of architected FPU registers doesn't change when more FPUs are added. The number of rename registers probably does.

But my point was in doubling the number of FPUs is there a need for more architected registers?

Quote:

Only architected registers need to be preserved in a context switch, however, so the rename registers don't cost anything except transistors.

Since this is the case, would it be a problem that you can only save 32 registers of info when you are now working on twice the ammount of instructions and data? (with 4 FPUs)

Quote:

The OoOE nature of the processor allows you to write code ignoring the fact that there are multiple FPUs and the hardware takes care of the dynamic scheduling requirements. The instruction dispatch and load/store capabilities would need to be scaled with the increased FPU capability, but that is likely to happen anyhow. The addition of SMT means that dependent instruction streams (which stall a lot) can be multi-threaded very effectively, while vector-style threads can fully utilize the increased number of execution units.

Wouldn't SMT also put pressure on the registers since you are now maximising the usage of all the execution units?

MM

programmer · August 10, 2003 3:23PM

Quote:

Originally posted by MartianMatt

But my point was in doubling the number of FPUs is there a need for more architected registers?

Since this is the case, would it be a problem that you can only save 32 registers of info when you are now working on twice the ammount of instructions and data? (with 4 FPUs)

Nope, no need.

Quote:

Wouldn't SMT also put pressure on the registers since you are now maximising the usage of all the execution units?

It puts pressure on physical registers, not architected ones.

blabla · August 10, 2003 6:12PM

Quote:

Originally posted by Programmer

Futhermore, the cost of these huge registers would affect the context switching cost of the processor in a very negative way. Doubling the vector registers would be an extra 0.5K for every context, on top of the already substantial 1K... a 50% increase.

Correct me if Im wrong here but..I thought it was the compiler's ( and the assembly coder's) responsibility to "mark" the altivec-register that should be saved for a context switch. A context switch will only save those altivec registers actually used.

http://e-www.motorola.com/files/32bi...ALTIVECPEM.pdf

chapter 2.3.3

amorph · August 10, 2003 6:33PM

The voltage example misses the point; I'm tired of this tangent, so I'll drop the argument. (Food for thought: With what precision does the source supply 0.5v? If it's accurate to the hundredth of a volt, then yes, it makes perfect sense to ask for 0.25v. If not, then you can't ask for 0.25v in the first place.)

Quote:

Originally posted by wizard69

Well no he is either not communicating well or has atleast a few concepts wrong. It does not make sense to compare changing the width of a register in the main CPU ALU with a change of register size in a vector unit.

In a ALU you are always doing one operation on one piece of data in a register. Within a vector unit you are operating on a number of pieces of data at the same time. The effects of changing the width of a vector unit are differrent than that experienced when changing the width of a processors register.

Then perhaps I didn't understand what you were saying, because the problem of "packing" two 128-bit SIMD operand groups into a 256-bit SIMD register is like the problem of packing as two 32-bit floats into a 64 bit register. The difference becomes relevant if you implement a set of 256-bit SIMD operations so that programmers can rework their VMX code to use 256-bit vectors rather than 128-bit vectors. But if you're talking about doubling the efficiency of 128-bit code by packing its operands into 256-bit registers, that's not going to happen, for the same reason it's never happened in scalar registers: Transistors, and bandwidth.

The advantage the vector unit would have is the ability to use an existing 256-bit operation on two 128-bit values, since they would be logically indistinguishable from one 256-bit vector in all but one case (4 x 64 bit values, since there are no operations for 2x64). But that doesn't solve the main problems with doubling up, which are related to scheduling, intelligently determining which 128-bit operands could be packed, which 128-bit operations could be packed with them, and which could be mapped to 256-bit operations; bandwidth; and the extra transistors required to do the packing and unpacking.

The bottom line is that if you want to simply and efficiently double the amount of work a unit does, pair it with another independent unit and mate them to an intelligent scheduler and a fat pipe.

programmer · August 10, 2003 9:04PM

Quote:

Originally posted by blabla

Correct me if Im wrong here but..I thought it was the compiler's ( and the assembly coder's) responsibility to "mark" the altivec-register that should be saved for a context switch. A context switch will only save those altivec registers actually used.

http://e-www.motorola.com/files/32bi...ALTIVECPEM.pdf

chapter 2.3.3

Yes, if its used. Any VMX code will be using a fair number of registers in order to maximize performance... and in a 256-bit register system each register saved is twice as expensive to save/restore.

wizard69 · August 11, 2003 10:09AM

Quote:

Originally posted by Amorph

Then perhaps I didn't understand what you were saying, because the problem of "packing" two 128-bit SIMD operand groups into a 256-bit SIMD register is like the problem of packing as two 32-bit floats into a 64 bit register. The difference becomes relevant if you implement a set of 256-bit SIMD operations so that programmers can rework their VMX code to use 256-bit vectors rather than 128-bit vectors. But if you're talking about doubling the efficiency of 128-bit code by packing its operands into 256-bit registers, that's not going to happen, for the same reason it's never happened in scalar registers: Transistors, and bandwidth.

Well obviously you will not speed up code that is already assuming 128 bit registers, the current VMX implementation. As far as there is no comparison to SIMD registers and the FPU or Integer unit registers. The FPU or Integer unit operates on one data type in a register at a time, the VMX unit is operating on several data types in parallel. Widening the registers, with the addition of the corresponding instructions, will provide a doubling of performance. Yes a programmer or compiler would have to implement the new features, that would be dependant on how thoose features enhance an application.

Quote:

The advantage the vector unit would have is the ability to use an existing 256-bit operation on two 128-bit values, since they would be logically indistinguishable from one 256-bit vector in all but one case (4 x 64 bit values, since there are no operations for 2x64). But that doesn't solve the main problems with doubling up, which are related to scheduling, intelligently determining which 128-bit operands could be packed, which 128-bit operations could be packed with them, and which could be mapped to 256-bit operations; bandwidth; and the extra transistors required to do the packing and unpacking.

The use of enxtended functionality and organization of data will still be up to the programmer as it is now. From my perspective it should be easier to schedule wider instructions and the corresponding data than to schedule more of the current instructions and data. Bandwidth is always an issue but I can't imagine transistors being dedicated to packing and unpacking for scheduling purposes.

Quote:

The bottom line is that if you want to simply and efficiently double the amount of work a unit does, pair it with another independent unit and mate them to an intelligent scheduler and a fat pipe.

This is an interesting point and may very well be one way to improve performance. We have seen from previous experience that it certianly works on FPU's and integer units. On vector code where you may be working on very large data sets relative to the "width" of the VMX unit I'm not convinced that this would be the best approach. Is there really that much schedulable code in normal VMX applications to justify the intelligent scheduler. I'm thinking that most vector code is much more regular as compared to what runs through a FPU or integer unit. This code will either schedule very easily or gain very little from it.

Dave

amorph · August 11, 2003 10:23AM

Quote:

Originally posted by wizard69

Well obviously you will not speed up code that is already assuming 128 bit registers, the current VMX implementation.

OK, so we're in agreement on that. Good.

Quote:

From my perspective it should be easier to schedule wider instructions and the corresponding data than to schedule more of the current instructions and data. Bandwidth is always an issue but I can't imagine transistors being dedicated to packing and unpacking for scheduling purposes.

Well, if we've both dismissed packing as an issue (because the programmer will do that, as is proper) then it's not an issue, and we don't have to worry about it.

However, the registers will still be 256 bits wide, which means that fetches will call for data in multiples of 256 bits, which means that 128 bit AltiVec data will suffer from massive internal fragmentation and take a performance hit - and, of course, the caches will be about half as effective.

Quote:

This is an interesting point and may very well be one way to improve performance. We have seen from previous experience that it certianly works on FPU's and integer units. On vector code where you may be working on very large data sets relative to the "width" of the VMX unit I'm not convinced that this would be the best approach. Is there really that much schedulable code in normal VMX applications to justify the intelligent scheduler.

Not much intelligence is necessary. "A is busy, so B gets this one." Or even, "A got the last one, so B gets this one." Just something to make sure that as many units as possible were being kept busy. Someone more familiar with VMX programming can confirm or deny this, but from my survey of the issue dependencies aren't a problem in vector code to nearly the degree that they are in scalar.

With this arrangement, the instruction set wouldn't have to be extended - the vectors would still be 128 bit. But the vector engine could devour two at a time, gaining you the practical benefit of a 256 bit implementation (except, again, without support for 64 bit data types in vectors) with a less complicated solution that also happens to speed up existing AltiVec code. Since VMX is not monolithic in implementation, you could also target specific physical units, doubling, for example, the vector float unit but not the vector permute unit.

Of course, bandwidth is still the elephant in the room, and the engineers would also have to do a study of the usual instruction mix to see whether and where it was worth doing. Doubling the theoretical power of a unit is of no help if one unit spends most of its time idle for whatever reason.

programmer · August 11, 2003 10:59AM

Amorph is saying what I've always said about a 256-bit VMX variation -- it costs you in terms of context switching, less flexibility, and fragmenting the PowerPC installed base further (i.e. now there is non-VMX, VMX, and VMX2 to develop for)... and it doesn't buy you much.

Consider that a 128-bit vector is a collection of smaller types all being operated on in parallel (e.g. 8 16-bit integers). The typical use of the unit is to run through long arrays of these smaller types (e.g. a thousand 16-bit integers). This means the existing code is already a loop that the OoOE engine is dynamically scheduling. By doubling the vector issue & execution rate you can do twice as much work using the same VMX code as you currently use. The OoOE engine allows multiple loop iterations to be in progress at the same time.

Widening the vector registers really only makes sense if you want to do atomic operations on larger data types... i.e. 64-bit floats. If you add support for 64-bit floats but don't increase the register size then each vector unit is only as powerful as 2 FPUs and less flexible. To achieve a worthwhile improvement you want each instruction to do more operations. Unfortunately that register size increase costs you quite a bit. Personally I think a lot more mileage is to be had by improving the FPUs and combining this with SMT so that they can be used by multiple threads at once if a single thread can't leverage them by itself. In that case then the same applies to having more vector units -- if a single thread can't use all the vector units then others will fill up the pipe. Going to a larger vector length makes this harder to achieve because you have fewer instructions consuming more bandwidth.

bigc · August 11, 2003 11:46AM

Quote:

Originally posted by Programmer

Amorph is saying what I've always said about a 256-bit VMX variation -- it costs you in terms of context switching, less flexibility, and fragmenting the PowerPC installed base further (i.e. now there is non-VMX, VMX, and VMX2 to develop for)... and it doesn't buy you much.

Consider that a 128-bit vector is a collection of smaller types all being operated on in parallel (e.g. 8 16-bit integers). The typical use of the unit is to run through long arrays of these smaller types (e.g. a thousand 16-bit integers). This means the existing code is already a loop that the OoOE engine is dynamically scheduling. By doubling the vector issue & execution rate you can do twice as much work using the same VMX code as you currently use. The OoOE engine allows multiple loop iterations to be in progress at the same time.

Widening the vector registers really only makes sense if you want to do atomic operations on larger data types... i.e. 64-bit floats. If you add support for 64-bit floats but don't increase the register size then each vector unit is only as powerful as 2 FPUs and less flexible. To achieve a worthwhile improvement you want each instruction to do more operations. Unfortunately that register size increase costs you quite a bit. Personally I think a lot more mileage is to be had by improving the FPUs and combining this with SMT so that they can be used by multiple threads at once if a single thread can't leverage them by itself. In that case then the same applies to having more vector units -- if a single thread can't use all the vector units then others will fill up the pipe. Going to a larger vector length makes this harder to achieve because you have fewer instructions consuming more bandwidth.

Best explanation I've seen yet as to why not to increase the size of the Altivec units.

wizard69 · August 11, 2003 12:47PM

Quote:

Originally posted by Amorph

However, the registers will still be 256 bits wide, which means that fetches will call for data in multiples of 256 bits, which means that 128 bit AltiVec data will suffer from massive internal fragmentation and take a performance hit - and, of course, the caches will be about half as effective.

I'm not seeing this at all a cache will contain the data it contains so there would be no change there. With SIMD instructions operating across more elements, you would likely reduce instruction demand on the cache and the schedulers. So this should be a win in both the L1 and L2 caches. I'm not sure where you would see the fragmentation coming from nor the peformance hit. 128 bit datas would be moved into the registers the same as before.

Quote:

Not much intelligence is necessary. "A is busy, so B gets this one." Or even, "A got the last one, so B gets this one." Just something to make sure that as many units as possible were being kept busy. Someone more familiar with VMX programming can confirm or deny this, but from my survey of the issue dependencies aren't a problem in vector code to nearly the degree that they are in scalar.

We agree again - in a sense anyways. One of the primary reasons that a vector unit is useful is that dependencies do not exist in the code like they do with scalar code. This is one of the reasons I believe a wider SIMD unit is feasable. Yes parallel units, in some case, will perform the same, but you do add complexity that is not needed in my estimation. On the other hand is we went wider with additional units we would have the best of both worlds. Keeping such a beast feed would be even more difficult.

Quote:

With this arrangement, the instruction set wouldn't have to be extended - the vectors would still be 128 bit. But the vector engine could devour two at a time, gaining you the practical benefit of a 256 bit implementation (except, again, without support for 64 bit data types in vectors) with a less complicated solution that also happens to speed up existing AltiVec code. Since VMX is not monolithic in implementation, you could also target specific physical units, doubling, for example, the vector float unit but not the vector permute unit.

If all you wanted was better throughput then yes and additional execution unit would help with that. I'm not convinced that it would be an optimal solution but is a practice that has prior implementation in FPU's and integer units. Those eXtra units in the FPU and integer units though solve a slightly differrent problem.

This is a discussion about a future VMX unit, which we suspect has enhanced capabilities. One of those new features I hope to see is doubles support. This would best be done with wider registers. Those wider registers should also support the contemporary data types with double the capacity. It may also be acase that completely new instructions are working their way into the facility which may take advantage of the wider registers.

Existing code speed up is a valid concern. This is especially the case when it appears that the 970 barely delivered on a real performane increase. Does the addition of a new execution unit justify a VMX2 label? Not in my estimation, the vector unit has a great deal of potential that has yet to be taken advantage of.

Quote:

Of course, bandwidth is still the elephant in the room, and the engineers would also have to do a study of the usual instruction mix to see whether and where it was worth doing. Doubling the theoretical power of a unit is of no help if one unit spends most of its time idle for whatever reason.

Yes this will be an issue for some time. It could very well be that VMX2 is more an excercise in addressing these issue, than the concerns we have discussed so far. The rumored increase in size almost represents half a chip, that gives us a lot of possibilities. Some we probally have yet to think about.

wizard69 · August 11, 2003 1:29PM

Quote:

Originally posted by Programmer

Amorph is saying what I've always said about a 256-bit VMX variation -- it costs you in terms of context switching, less flexibility, and fragmenting the PowerPC installed base further (i.e. now there is non-VMX, VMX, and VMX2 to develop for)... and it doesn't buy you much.

This is true in the sense that you have another feature set to develope to, but to say it won't buy you much is a bit of a stretch. If your concerned about the context switch in the case of time it may not be much of a concern on a faster processor. Keeping the instructions simpler may make handling the context switch a bit easier as instruction completetion is a little more straight forward. But the real problem I see with this statement is the stagnation effect of not being willing to adapt and improve.

Quote:

Consider that a 128-bit vector is a collection of smaller types all being operated on in parallel (e.g. 8 16-bit integers). The typical use of the unit is to run through long arrays of these smaller types (e.g. a thousand 16-bit integers). This means the existing code is already a loop that the OoOE engine is dynamically scheduling. By doubling the vector issue & execution rate you can do twice as much work using the same VMX code as you currently use. The OoOE engine allows multiple loop iterations to be in progress at the same time.

With a wider unit you cut your instruction issue in half, so there is less scheduling in the first place. Even if you add an additional execution unit making it a wider unit will still offer a pay off.

In many cases the additional execution unit wuold I admit have the same effect as a wider unit. After all SIMD can be approximated as parallel execution of a single instruction across several datum. At some point though the current allocations of registers will not handle the data flow well, it will be easier to just make things wider. After all if your going to dedicate electronics to rename registers and buffers you could just as well enhance the real registers with your allocation of transistors.

Quote:

Widening the vector registers really only makes sense if you want to do atomic operations on larger data types... i.e. 64-bit floats. If you add support for 64-bit floats but don't increase the register size then each vector unit is only as powerful as 2 FPUs and less flexible. To achieve a worthwhile improvement you want each instruction to do more operations. Unfortunately that register size increase costs you quite a bit. Personally I think a lot more mileage is to be had by improving the FPUs and combining this with SMT so that they can be used by multiple threads at once if a single thread can't leverage them by itself. In that case then the same applies to having more vector units -- if a single thread can't use all the vector units then others will fill up the pipe. Going to a larger vector length makes this harder to achieve because you have fewer instructions consuming more bandwidth.

Well supporting wider data types is one of the goals, but not the only one. Much could be gained by widening support for currently supported data types. For example you are not likely to be able to operate on 16, 16 bit data words at the same time in the integer unit. This would be possible with a 256 bit register. I'm not so sure that the register size increase would be a disadvantage, in many cases it would be an advantage.

Actually I see very little potential for SMT and vector operations to coexist. Lets face it vector operations on all current implementations staturate the interface. Given that I have to believe that SMT and vector ops are not going to play well together. Maybe VMX 2 will eliminate or reduce this staturation but I do not think it is likely. The nature of vector code is such that it will make use of all the bandwidth it can muster. Like you mentioned vector operations often take place on thousands (or millions) of data of the same size. How SMT would interleave another thread when the VMX units is doing something like this is beyond me. Going to wider registers in a SIMD unit would lower the bandwidth required for instructions and simplfy bandwidth management for data. It should be noted that if you have a two execution units running against 128 bits of data the bandwidth required is the same but more complexly dealt with. You are then moving 2 sets of 128 bit data to and from different registers.

So in the end; yes you would have fewer instructions dealling with the SAME bandwidth with a lot less contention. The data has to get into and out of the VMX unit fast, in many cases it can not hang around in the registers.

I'm not say wider registers are perfect but it seems reasonable that the processor would evolve this way. This does not mean that additional execution unit would not have a reason for being, I'm just not convinced that parallel operations will be as big as a gain as onemight suspect from experience in an integer or FPU. The reasons being; that one much of the vector code that is out there staturates the vector units due to data flow issues.

Dave

yevgeny · August 11, 2003 4:48PM

Quote:

Originally posted by wizard69

Well supporting wider data types is one of the goals, but not the only one. Much could be gained by widening support for currently supported data types. For example you are not likely to be able to operate on 16, 16 bit data words at the same time in the integer unit. This would be possible with a 256 bit register. I'm not so sure that the register size increase would be a disadvantage, in many cases it would be an advantage.

Why would this be so much better than operating on 8 16 byte values (aka shorts)? Why the need for the ability to operate on 16 shorts? I think that Programmer was VERY correct in saying that the only good motivation would be to be able to operate on 4 doubles (64 bits each).

More parallel processing is only better if you have more parallel data to deal with. Needless parallelization slows the chip down in innumerable ways!

This is not an issue of the PPC architecture becoming stale or not moving forward. The question is how many people need this extra ability and I bet that many do not. Why add features that nobody needs? Adding silicon slows the whole chip down and makes it more costly. You have to have a reason why the functionality is added and I think that is what is missing- just saying that it can be done is an inadequate reason for doing it.

airsluf · August 11, 2003 7:05PM

Bingo! Especially the last sentence.

programmer · August 11, 2003 10:58PM

Quote:

Originally posted by wizard69

But the real problem I see with this statement is the stagnation effect of not being willing to adapt and improve.

Changing the instruction set is very disruptive to the market. Intel gets away with it because they sell such huge volumes. The PowerPC guys can afford less disruption so the potential gain had better be huge in order to try it.

Quote:

With a wider unit you cut your instruction issue in half, so there is less scheduling in the first place.

Exactly... there is less scheduling opportunity. Perhaps looking at the problem in an extreme fashion makes it a bit more obvious -- why not extend the vector registers to 2560 bits, surely that would make it 20 times faster than it currently is? Even if you ignore the implementation cost and the context switching cost, when you start looking at the limitations forced on algorithms trying to use these registers you realize that you've lost a lot of flexibility.

Quote:

Even if you add an additional execution unit making it a wider unit will still offer a pay off.

Except that each execution unit is significantly more expensive if its wider, the registers are more expensive, and the internal buses between them and to the cache must be widened as well.

Quote:

At some point though the current allocations of registers will not handle the data flow well, it will be easier to just make things wider. After all if your going to dedicate electronics to rename registers and buffers you could just as well enhance the real registers with your allocation of transistors.

The performance of the Pentium4 casts serious doubt on that statement... and VMX already has 4 times as many registers as that. The optimal number of rename registers is a function of pipeline depth and width (i.e. number of in-flight instructions). If you can afford to increase the in-flight instruction count then you can certainly afford to increase the dispatch rate. Currently its only in groups of 5 -- going to 2 or 3 groups of 5 per cycle would not be stretch (in terms of expense) if the execution hardware was there to back it up. This is pointless unless you have minimal dependencies (e.g. programming with a larger "virtual" vector width), or you've got SMT. If you actually have wider registers then this is only to your advantage if you're working in the middle of a large vector (which is less often because your boundry zones are larger).

Quote:

Actually I see very little potential for SMT and vector operations to coexist. Lets face it vector operations on all current implementations staturate the interface.

I completely disagree on this one. VMX is spending lots of time waiting for memory which means there are lots of bubbles in the pipeline that can be filled up by other threads. Some vector algorithms and many non-vector algorithms are not bandwidth bound, or live nicely in the caches ... these will fill in the bubbles while waiting for memory.

As for instruction bandwidth, most vector code is fairly tight loops that fit easily into the L1 I-cache. The instruction bandwidth from I-cache is simply not an issue, so dispatching at 2 or 3 times the current rate ought not be difficult to manage. Nobody does it yet because of the reasons I described above.

Quote:

The reasons being; that one much of the vector code that is out there staturates the vector units due to data flow issues.

Right, which supports my position that there is no need to improve the vector element to instruction ratio at the cost of larger registers, more expensive context switching, fragmenting the ISA, and giving up flexibility by forcing longer vectors.

I stand by my view that unless 64-bit float (and integer) support in the vector unit is extremely compelling, it would not be worth widening the registers to 256-bits. Personally I don't think it is that compelling and would rather see the FPU hardware beefed up. If VMX2 really comes along I would rather see it add new operations (dot-product, cross-product, XYZW swizzle, etc) on the existing registers, instead of widening the registers. I guess we'll have to wait and see if IBM thinks the same way.

Vmx 2

Comments