Vmx 2

powerdoc · August 3, 2003 10:10AM

Mac Bidouille published a new rumor about VMX 2. They warn that it was a rumor with a big R.

I translate for you : IBM has finished to make the first specifications of the VMX 2 (VMX for IBM is like Altivec for Motorola and Velocity engine for Apple). This new version will have a set of 65 supplementary instructions. The final specifications will be ready in early 2004. However, the complexity of VMX 2, who will recquire lonely 24 millons of transistors, has incitate IBM to wait until the 65 nm process, before implementing it in his chips.

One thing is extremely interesting to notice. It will be the PPC 990 and the power 6 who will adopt VMX 2. IBM should seems to have decide to use Altivec on his professional chips, who will be even more in phasis with mac os X. The VMX 2 should triple the performance of Altivec and will aso ensure backward compatibility.

If this rumor will be true, it's means a great future for the chip supply of great CPU for Apple.

zapchud · August 3, 2003 10:21AM

Triple the performance of Altivec... do I smell 256bit altivec?

Ouch!

Although it's a rumor, it isn't a ridiculous one in my opinion.

I wonder, is there a real need for 256bit altivec? Or is there plenty of paralellism to take advantage of in todays code?

powerdoc · August 3, 2003 10:28AM

Quote:

Originally posted by Zapchud

Triple the performance of Altivec... do I smell 256bit altivec?

Ouch!

Although it's a rumor, it isn't a ridiculous one in my opinion.

I wonder, is there a real need for 256bit altivec? Or is there plenty of paralellism to take advantage of in todays code?

I think that SIMD unit is the only part of a modern CPU where you can take advantage of massive parallelism. I think that 256 bits, is the next logical move for VMX. I will even say that in the future decade 512 or even 1024 bits won't be stupid.

The mere problem with altivec 256 is to remove the internal bottleneck in order to be able to feed such a beast.

programmer · August 3, 2003 10:45AM

I've always been skeptical of the benefits of extending SIMD to greater sizes. As the vector length grows it suffers from diminishing returns. It seems to me that improving the implementation of the existing instruction set could yield equivalent performance gains without having to change existing programs. If a new VMX1 unit could process 128-bit vectors twice as fast it would be as fast as a 256-bit vector unit, more flexible and you wouldn't have to rewrite your code to gain the advantage of it.

Consider that in a SIMD unit of double width you must perform exactly the same operations on the new half of the data. That means that the halves of the operation on the vector cannot be inter-dependent, and they must be exactly the same. If instead of doubling the vector width you double the instruction dispatch rate and number of execution units you now have the same performance... but more flexibility because you don't have to do the same operation on the second half of the vector.

Futhermore, the cost of these huge registers would affect the context switching cost of the processor in a very negative way. Doubling the vector registers would be an extra 0.5K for every context, on top of the already substantial 1K... a 50% increase.

I find it interesting that this (questionable) rumour doesn't mention a longer vector size, but just says "supplementary instructions". There are a bunch of things that would be nice to have in the existing VMX but weren't implemented because they wouldn't have been possible to do in a single cycle in the initial implementation (adding across a vector, for example). These operations might be possible to implement efficiently with an increased pipeline depth like in the 970, and with a larger number of transistors. This would allow many algorithms to be implemented more efficiently than is currently possible, and increase the versatility of VMX. I expect a major goal for the VMX2 design would be to support compiler auto-vectorization, which would be the single biggest possible possible to the adoption of VMX.

zapchud · August 3, 2003 11:51AM

Quote:

Originally posted by Programmer

I've always been skeptical of the benefits of extending SIMD to greater sizes. As the vector length grows it suffers from diminishing returns. It seems to me that improving the implementation of the existing instruction set could yield equivalent performance gains without having to change existing programs. If a new VMX1 unit could process 128-bit vectors twice as fast it would be as fast as a 256-bit vector unit, more flexible and you wouldn't have to rewrite your code to gain the advantage of it.

Then maybe that's what the rumor is talking about. Improving the implementation substantially, and adding the "missing" instructions.

Tweak it!

tomb of the unknown · August 3, 2003 12:16PM

Quote:

Originally posted by Programmer

There are a bunch of things that would be nice to have in the existing VMX but weren't implemented because they wouldn't have been possible to do in a single cycle in the initial implementation (adding across a vector, for example).

Would one of those things be double precision FP operations, by any chance? I know that Altivec code can be munged to deliver double precision results, but the overhead involved, and the complexity of coding, makes it less than competitive with a DP FP unit in any event. Could some additional instructions (esp. vector permutes?) mean improved DP FP operations?

Quote:

These operations might be possible to implement efficiently with an increased pipeline depth like in the 970, and with a larger number of transistors. This would allow many algorithms to be implemented more efficiently than is currently possible, and increase the versatility of VMX.

And might have something to do with "fastpath", I suppose.

Quote:

I expect a major goal for the VMX2 design would be to support compiler auto-vectorization, which would be the single biggest possible possible to the adoption of VMX.

Another might be SMT. That might explain at least some additional transistor counts.

programmer · August 3, 2003 1:24PM

Quote:

Originally posted by Tomb of the Unknown

Would one of those things be double precision FP operations, by any chance? I know that Altivec code can be munged to deliver double precision results, but the overhead involved, and the complexity of coding, makes it less than competitive with a DP FP unit in any event. Could some additional instructions (esp. vector permutes?) mean improved DP FP operations?

This is possible, but of dubious value unless you increase the vector size. If they decide to increase the vector size to 256-bits, then double precision makes sense (4 element double precision vectors). In 128-bits though you can only fit a pair of doubles which means you're only getting the throughput of a pair of FPUs... which the 970 has. Personally I don't think its worth it -- better to increase the number of FPUs. Again, it allows existing code to run unchanged at higher speeds and means you don't have to struggle to get things into vectors.

Quote:

And might have something to do with "fastpath", I suppose.

I wouldn't suppose any such thing. The only relationship between FastPath and VMX, in my opinion, is that FastPath allows the processor to spend more uninterrupted time executing VMX instructions on long streams of data.

Quote:

Another might be SMT. That might explain at least some additional transistor counts.

I agree in what I suspect is a backwards fashion to what you meant -- supporting SMT would not explain the transistor counts, but having SMT would justify the transistor counts. What I mean is that right now the VMX unit is probably spending a fair bit of time waiting....

Latency is the big stumbling block in processor design right now (and probably from now on). The individual stages of the pipeline are getting faster and faster, but memory keeps falling behind, pipelines keep getting longer, and inter-chip communications requires sending signals over larger distances than within the chip. This means you are waiting for things more and more... either waiting for the data to come in from memory (or cache or another chip), or waiting for a result to finish being calculated before using the result in the next calculation. All of this waiting means wasted opportunity to do something else. What is the point in having more execution resources if you can't even keep the existing ones busy as it is? SMT allows you to keep many more busy because you can now fill the wait times of each thread with the non-wait times of other threads. If you're waiting 90% of the time in each thread when it is running "full speed", then you can run 10 threads at "full speed". This doesn't speed up the individual threads, but if you can implement your task in multi-threaded fashion it is a huge improvement.

tomb of the unknown · August 3, 2003 1:41PM

Quote:

Originally posted by Programmer

Personally I don't think its worth it -- better to increase the number of FPUs. Again, it allows existing code to run unchanged at higher speeds and means you don't have to struggle to get things into vectors.

Except that, with autovectorizing compilers, it wouldn't be as much of a struggle and streaming DP FP functions could be a major boost to certain applications. I would think 256 bit vectorized DP FP ops as you described would outperform even four FP units? Perhaps not, in which case I agree that expanding Altivec to 256 might be less cost effective than spending the transistors on additional FP units.

Quote:

I wouldn't suppose any such thing. The only relationship between FastPath and VMX, in my opinion, is that FastPath allows the processor to spend more uninterrupted time executing VMX instructions on long streams of data.

I'm not as sanguine that the two are unrelated. We don't know enough about what "fastpath" is to certain of anything, I should think.

programmer · August 3, 2003 2:00PM

Quote:

Originally posted by Tomb of the Unknown

Except that, with autovectorizing compilers, it wouldn't be as much of a struggle and streaming DP FP functions could be a major boost to certain applications. I would think 256 bit vectorized DP FP ops as you described would outperform even four FP units? Perhaps not, in which case I agree that expanding Altivec to 256 might be less cost effective than spending the transistors on additional FP units.

Auto-vectorizing has yet to be proven to be particularly effective except in rather specialized situations. I don't think I would want to take the chance and burn a lot of transistors on something so questionable... and I don't think IBM will either.

4 FPUs have a lot more scheduling opportunities than a 4-way SIMD unit so while they would require processing more instructions to do the same work, they could get the work done as fast (or faster). The instruction dispatch limits will probably be quite a bit higher by the time VMX2 arrives.

Quote:

I'm not as sanguine that the two are unrelated. We don't know enough about what "fastpath" is to certain of anything, I should think.

Nice word.

If you go and dig out what is known about FastPath and some of the IBM research papers I think you'd probably agree with me on this one. But for now I'll let you wallow in your pessimism.

airsluf · August 3, 2003 2:25PM

Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.

henriok · August 3, 2003 4:14PM

24 million transistors seems A LOT considering that the current VMX uses, what.. a tenth of that? or even less, I couldn't find the exact number. The original G4 (7400) used 6.4 million transistors in total. When 970 counts 58 million transistors in total and a mere corner of it seems to be used for the VMX units.. doesn't 24 million seem a bit much?

tomb of the unknown · August 3, 2003 4:58PM

Quote:

Originally posted by Henriok

24 million transistors seems A LOT considering that the current VMX uses, what.. a tenth of that? or even less, I couldn't find the exact number. The original G4 (7400) used 6.4 million transistors in total. When 970 counts 58 million transistors in total and a mere corner of it seems to be used for the VMX units.. doesn't 24 million seem a bit much?

It does, but I suppose it depends on how you count? For instance, the figure may include transistors spent on an increased L2 cache to support it. So, technically the revised VMX unit would not require that many additional transistors to support the additional instructions, but IBM may be planning for this much of an increase to support a longer pipeline, greater cache, etc.

Or the entire rumor could just be the fever-induced maunderings of an addled mind.

stoo · August 3, 2003 5:15PM

Quote:

Does Altivec do vector dot or cross products?

Yes, and a whole lot more beside.

Off topic, what's the best way to start (hobbyist) programming in Altivec?

programmer · August 3, 2003 11:49PM

Quote:

Originally posted by Stoo

Yes, and a whole lot more beside.

Off topic, what's the best way to start (hobbyist) programming in Altivec?

Actually it doesn't -- at least not intrinsicly. You can do those operations using a bunch of instructions, but if you are trying to hold an xyzw vector in a single VMX register then it isn't very efficient -- the FPU is generally better at it. What is efficient to have many xyzw vectors, represented as long arrays of the x's, the y's, the z's, and the w's. Then the VMX unit can do those operations very efficiently by doing 4 operations at a time.

For 3D operations (xyzw vectors and 4x4 matrices) there could be many instructions added to VMX that would aid tremendously -- basically all of the instructions in the OpenGL vertex and fragment program specs, including the swizzle instruction. A couple of the simple ones are supported by VMX already, but most of the complex ones are not. VMX was designed for signal processing, i.e. grinding through long arrays of data. It is much less useful "scalar operations on 4-vectors", if you know what I mean.

powerdoc · August 4, 2003 12:58AM

A 256 bit vector unit, will help in photoshop filter, but i guess will be useless for MP encoding ( i can be wrong, so correct me if it's the case).

The extra set of instructions will help in many situations, i remember to have seen programmers requiring new features in the past. But alone,these new set of instructions will not allow to triple the level of performance.

Instead of a giant 256 bit wide VMX unit with 162 + 65 instructions, can we imagine that the VMX 2 engine is the combinaison of 2 sub-unit : a simple one and a complex one.

The simple unit will be basically the current VMX one.

The complex one will support both set of instructions (162 + 65)

Simple stuff like some photoshop filters, will take advantage of both VMX unit simultaneously.

Complicated stuff like 3 D operations will take advantage of the complex unit.

Many chips facturer are doig this already for their FP unit and their Integer unit. This kind of architecture, allow to improve performances without raising too much the number of transistors.

In this way you can triple the performances without creating a monster behemot.

shawk · August 4, 2003 8:11AM

There are two articles in today's New York Times that may be of interest NYT Cray and NYT IBM Fishkill .

A look in the library at Cray might give some insight into possible additional VMX instructions and a possible motive for adding them.

With some thought, the VMX2 could offer a Cray compatible instruction set; the VMX2 could then become the computer and the 9*0 cpu the interface.

Perhaps someone could put more time into this speculation.

programmer · August 4, 2003 8:39AM

Quote:

Originally posted by Powerdoc

Instead of a giant 256 bit wide VMX unit with 162 + 65 instructions, can we imagine that the VMX 2 engine is the combinaison of 2 sub-unit : a simple one and a complex one.

The simple unit will be basically the current VMX one.

The complex one will support both set of instructions (162 + 65)

The current AltiVec units on all chips with them already do this with the existing VMX instruction set. The definition of the execution units and which instructions they operate on is an implementation detail that is not specified in the user programming model. VMX2 refers to the user programming model because it defines new instructions. The programmer writes these instructions and the processor's dispatch mechanism takes care of sending them to the correct execution unit(s). This is the PowerPC way (and it is similar to the x86, but dissimilar to the IA-64).

vinney57 · August 4, 2003 9:07AM

I'm very ignorant of all this stuff but implementing the OpenGl instructions in VMX seems an exciting prospect. The manipulation of graphics is a very important development thread for Apple who will want to '3D' a lot more of the user interface in the future, plus provide API's for developers to do so as well. Will this help in this endeavour? Will it also help to ease the optimisation of graphics drivers for specific apps? As I said, have only slightly more than a clue in this dept.

powerdoc · August 4, 2003 9:42AM

Quote:

Originally posted by Programmer

The current AltiVec units on all chips with them already do this with the existing VMX instruction set. The definition of the execution units and which instructions they operate on is an implementation detail that is not specified in the user programming model. VMX2 refers to the user programming model because it defines new instructions. The programmer writes these instructions and the processor's dispatch mechanism takes care of sending them to the correct execution unit(s). This is the PowerPC way (and it is similar to the x86, but dissimilar to the IA-64).

You are right, for example the g4e have a vector issue queu that dispatch 4 exucutions units : simple integer, complex integer, flotating point, permute.

Now imagine that most of the instructions of the new set of instructions are dedicated to the 3D, then IBM wil make a new excution unit, let's call it V3D. Let's imagine that V3D is very huge in size.

This new SIMD unit will have for example one permute unit, two simple integer unit, one complex integer unit, two floating points unit and one V3D unit. All with 128 bits registers. In this way some tasks can be performed by two simple integer unit simultaneously without enlarging the register to 256 bits size.

It's different of a system where the register are 256 bits, and where the execution unit will be permute, simple integer, complex integer, floating point, and a new unit who support a large amount of new instructions like 3D. This system appear me, more heavy than the previous one. And the previous one still work with 128 bit registers.

programmer · August 4, 2003 9:45AM

Quote:

Originally posted by shawk

There are two articles in today's New York Times that may be of interest NYT Cray and NYT IBM Fishkill .

A look in the library at Cray might give some insight into possible additional VMX instructions and a possible motive for adding them.

With some thought, the VMX2 could offer a Cray compatible instruction set; the VMX2 could then become the computer and the 9*0 cpu the interface.

Perhaps someone could put more time into this speculation.

A Cray-compatible instructon set isn't necessary, in this market the users recompile their software for their ultra-expensive hardware.

You bring up an interesting point, however -- money is being funneled back into high node performance super-computing, and IBM wants a piece of that action. Certainly they are in a good position to attack it, especially since they've just added VMX in for Apple.

Looking at the Cray documentation it looks like VMX isn't terribly far off in capabilities or speed; the main thing it lacks is 64-bit floating point capabilities. If this is what IBM wants to challenge with the POWER6 then I would agree that longer vectors and 64-bit floating point is definitely the way to go. Considering that Cray is only at about the 12.8 gigaflop level per node, IBM can easily get there with a POWER5/6 + VMX... but in that market double precision is essential. This might explain why IBM is suddenly interested in vector processing and auto-vectorizing compilers.

Heh, the Cray X1 instruction set includes a "vrip" instruction that you use when you're done with the vector registers. It declares them dead so that they don't need to be preserved across context switches. I like it... "vector rest in peace". VMX already has support for something similar, but its not so well named.

programmer · August 4, 2003 9:55AM

Quote:

Originally posted by Powerdoc

You are right, for example the g4e have a vector issue queu that dispatch 4 exucutions units : simple integer, complex integer, flotating point, permute.

Now imagine that most of the instructions of the new set of instructions are dedicated to the 3D, then IBM wil make a new excution unit, let's call it V3D. Let's imagine that V3D is very huge in size.

This new SIMD unit will have for example one permute unit, two simple integer unit, one complex integer unit, two floating points unit and one V3D unit. All with 128 bits registers. In this way some tasks can be performed by two simple integer unit simultaneously without enlarging the register to 256 bits size.

It's different of a system where the register are 256 bits, and where the execution unit will be permute, simple integer, complex integer, floating point, and a new unit who support a large amount of new instructions like 3D. This system appear me, more heavy than the previous one. And the previous one still work with 128 bit registers.

If adding these "3D" instructions is their direction then it requires no additional architected registers, and I would expect the various new instructions to be handled in whichever execution unit is most appropriate in the new implementation. I doubt you would see a "V3D" unit per-se. The 970 VMX unit already has dual VPUs which can process vectors in parallel, effectively acting as a 256-bit vector unit like you allude to. In this scheme all of the execution units would grow in size a little, and there would be more copies of them which allows all instructions to operate in parallel... effectively giving you more than 256-bit vectors in parallel.

Regarding the OpenGL issue -- this might help a software implementation of the vertex programs, but pretty much all GPUs going forward are going to include this kind of vector hardware and graphics are better handled in the GPU which is tightly coupled to the pixel rasterization hardware and high speed memory subsystem. I don't see this capability being particularly useful to OpenGL, but it would be useful to applications using OpenGL... a lot of calculations are required in some applications to figure out what to pass to the graphics engine. Also, some kinds of simulations (like those found in games) would really benefit. While some of these new instructions might appear, I would expect VMX2 to primarily remain true to its heritage in signal processing, especially if IBM is going after the supercomputing market like alluded to above.

Vmx 2

Comments