Vmx 2

amorph · August 8, 2003 1:01PM

Quote:

Originally posted by Bigc

now that's the way to start a mathematical argument...

It's true in Amorphomatics, alright? Sheesh.

Picky, picky, picky.

bigc · August 8, 2003 1:06PM

8)

airsluf · August 8, 2003 2:23PM

Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.

nevyn · August 8, 2003 3:38PM

I'm in the "let's just go to 4x real 64-bit FPUs" camp too.

On another note entirely, one key point of the Altivec unit is dealing with piles of streaming data.

What about a FPGA unit in addition, instead of a change to the Altivec units themselves? It would seem like a great coprocessor to the AV unit from my POV - a lot of the issues involved in using AV involve massaging data into/out of various formats.

For those that don't know, "FPGA" stands for field programmable gate array, which essentially means the _hardware_ is configured for the specific job at hand. They aren't as fast as 'normal' chips, but you can explicitly ignore/change various things that you know your algorithim doesn't care about. (So it could do 65 bit math if you wanted. or 9 bit math. Or whatever).

wizard69 · August 8, 2003 3:53PM

Quote:

Originally posted by Amorph

No, it's not what you're getting at. You're missing the point. The mathematical real number 0.25 is the mathematical real 1 divided by the mathematical real 2. Anything measured is a lot messier than that, and any responsible scientist has to account for that. A measured value of "0.25" could be any of: 0.24999, 0.253, 0.25000000001, or even precisely 1/2 (although what are the odds of that?). You don't know what the exact value is, so any assumptions beyond the initial signficant digits are almost guaranteed to be false.

If you can't see what I'm getting at then we have to reconsider whom is missing the point. You bounce back and for between the concept of measurements and math apparently to confuse yourself or the readers of this thread.

Anybody with a little bit of experience in the real world knows that there is uncertianty in measurement. Figuring out which digits are significant is much more involved then just grabbing the "initial significant digits"

Quote:

At this point I have no idea what you're talking about. Precision represents the accuracy with which something can be represented. It applies both to measurements and to representations in floating point, which is why people refer to "64 bit precision" and "precision tools".

So now you are trying to claim that a 64 bit float has 64 bits of precision. You really don't expect me to believe that do you?

Quote:

How is resolution different from precision, anyway? Both specify a quantum value beneath which the representation is no longer accurate.

There is a huge differrence between resolution and precision or accuracy. They are two completely different things. It is one fo the reasons why manufactures of test equipment print detailed data sheets on their instrumentation. It is very possible that an instrument can be fairly precise one one range and marginal on another even though it may be resolving the same number of digits.

Quote:

You've misinterpreted what I've said. If you have a measurement of "0.25 and change" you don't know what that change is. It could be zero, or it could not be. Disposing of it is not an issue, because you don't know what it is in the first place. If you could measure it in any meaningful way, there would be significant digits to represent it! The fact that FP might, over the course of calculations, introduce a whole bunch of extra digits (but not significant digits, because you can't get signal from noise), is an unwelcome artifact. It's not anything you can use, and it's not the "change" I was referring to.

Agian I find myself agreeing with you then disagreeing with you. Just because you can resolve a value, it does not imply that the value is accurate or significant. If it can be resolved then keep it around, but if you have no siginificant measurment then you don't have any change.

You are correct that FP math can add extra digits, but it is up to the algorithm designer to determine if they are significant at any point. Premature rounding can introduce as many errors as excessive reliance on extra digits. You can not just make rash decisions on when information can be dropped.

Quote:

All you have to do is answer the question: How many applications, and are they worth the cost of implementing a vector unit vs. using parallelism and conventional FP units? It's nice that there are supercomputers that can do this, but Apple doesn't make supercomputers (silly marketing hype aside). I don't pretend to know the answer to that question, but it's not a simple question, and it can't be blown off.

I agree that the design question is not simple. In the context of the rumored revision to the VMX unit it is worth looking at. Considering Apple push into the workstation market it may well be worth the investment.

As an aside I have to believe that Apple and IBM see a great deal of potential in the VMX subsystem. How that will be improved and extended in the future will be an interesting subject to debate.

Quote:

We don't know that IBM and Apple (and probably Mot) are revisiting VMX to provide this functionality, either. There are all kinds of capabilities they could add that would greatly improve its appeal for streaming and signal processing work, without changing the sizes of the registers or supporting 64 bit anything.

I still don't know what the hang up about 64 bits is. If VMX2 has 256bit registers and keeps symetery with the current implementation then you will have greatly improved VMX in handling the original data types. In effect you would double single cycle performance on the data types currently processed by this unit. The side effect of being able to handle doubles could be very usefull.

The issue boils down to is the VMX unit a good place to be doing this? That is handling double vector math. Not being a hardware engineer I can only geuss that it would be easier to do it here rather than in the normal FPU.

Quote:

No, they're valid, they just aren't anything more than "concerns." I'm not saying can't or won't or shouldn't. I'm merely pointing out that the concerns are hairy enough, and the payoff uncertain enough, and other features desirable enough, that the support you want might or might not happen even at 65nm. Whether it happens depends on a large number of variables whose values are currently unknown.

Personally, looking at the problem, I am leaning toward "won't happen." The current top of the line PowerMac can crunch through 4 FP calculations per clock (2 CPUs with 2 FPUs each). There you go - no additional hardware necessary, and twice the memory bandwidth and twice the cache that would be available to a 64-bit-savvy VMX unit on one CPU.

I would hope that the PPC would not become stagnet, or worst yet fall into the confusion that is Intel hardware. So I have to think that some improvements and extentions are on the way - hopefully well planned. The issue I have is that some operations currently done on the VMX unit do not fit well into the FPU and integer units. I do not think that IBM nor Apple would try to cram into those units instructions and operations that would slow them down or hamper other improvements.

Quote:

So what are the tradeoffs in going to 256 bit or 512 bit registers, and how easy are they to surmount? Is it worth it to the target market? There is zero use for it in the embedded space (any time soon), zero use on the desktop (any time soon), so that leaves the workstation and server markets. Servers might be able to use it for IPv6 networking, but IBM seems to have other ideas for that sort of thing (FastPath, which will intercept a lot of the system interrupts and allow the main CPU to keep crunching away).

Well these are all the arguments that people first used when AltVec came out. Eventually the facility was found to be very usefull, sometimes in unexpected ways. In any event I have to disagree with you with respect to the thought that there would be zero demand in the embedded space or the desktop. You are also making an assumption as to what a server may be doing with the hardware, computation servers are big business nowadays. In fact computation servers are such a big thing that IBM is back into the business of selling computer time.

Even I was a bit reluctant to believe that that would ever happen again.

Dave

powerdoc · August 8, 2003 3:55PM

I am ready to bet that IBM and Apple are already calculating what subunits are used more in VMX, and what instructions should make a speed bump.

Currently the VMX has four execution unit, and can dispatch to any of the 2 (for the G4 e), imagine that one of these unit is used very often and is the bottleneck, while others units are not used at 100 %, the solution is simple : duplicate the right subunit.

VMX 2 should be still 128 bits, but will have larger executions units that can dwell with more instructions, and some of these units may be duplicates.

Add to the chip, some new features like SMT, some extra FP unit, and why not a third integer unit, and you will have a great chip : the G7 on 65 nm process.

yevgeny · August 8, 2003 4:01PM

Quote:

Originally posted by Nevyn

I'm in the "let's just go to 4x real 64-bit FPUs" camp too.

On another note entirely, one key point of the Altivec unit is dealing with piles of streaming data.

What about a FPGA unit in addition, instead of a change to the Altivec units themselves? It would seem like a great coprocessor to the AV unit from my POV - a lot of the issues involved in using AV involve massaging data into/out of various formats.

For those that don't know, "FPGA" stands for field programmable gate array, which essentially means the _hardware_ is configured for the specific job at hand. They aren't as fast as 'normal' chips, but you can explicitly ignore/change various things that you know your algorithim doesn't care about. (So it could do 65 bit math if you wanted. or 9 bit math. Or whatever).

4 FPU's? You can reach a point of dimminishing returns where you have all this extra silicon on the CPU and some of it is infrequently used. 4 FPU's is ALOT of FPUs. It is very hard for the CPU to keep 4 FPUs busy.

FPGA's are used in different situations than general purpose CPUs. I think that it would be very difficult to use them effectively in most "normal" programming models. For example, the programmer would have to set the "bitness" of a number (so that the FPGA could adjust its bitness). This would invariably leak CPU details up to the programmer which is a bad thing in many cases. I don't think that a FPGA would work out too well (esp. since they do run much slower than a normal CPU).

amorph · August 8, 2003 4:59PM

Quote:

Originally posted by wizard69

If you can't see what I'm getting at then we have to reconsider whom is missing the point. You bounce back and for between the concept of measurements and math apparently to confuse yourself or the readers of this thread.

No, that's the whole point: Abstract math is one thing. Experimentally derived data is another. Abstract mathematics is arbitrarily precise. Nothing else is. "0.5" means exactly one half in mathematics, and "about 0.5" to a field scientist. This is neither difficult nor confusing: One discipline does not have to account for error or noise, the other does. Their conventions differ accordingly.

Quote:

You are correct that FP math can add extra digits, but it is up to the algorithm designer to determine if they are significant at any point.

YES. FINALLY. Significant digits matter, it's up to the person doing the work to figure out and track what is significant, and FP contributes nothing to this but the occasional approximation. Welcome to my thesis. Now, if significant digits matter, then 0.5 / 2 is not necessarily 0.25, is it? Mathematically, yes. Experimentally, no, because the hundredths digit in the result claims a level of precision that doesn't appear anywhere else. (The denominator is a special case, because it has no decimal point, so it's assumed to be mathematically precise. The example would be clearer if it said 0.5 / 2.0 = 0.25.) If you're trying to contain noise, you pick the worst case, not the best case. Better to deal with acknowledged, controlled lack of precision than apparently significant noise.

Quote:

I still don't know what the hang up about 64 bits is. If VMX2 has 256bit registers and keeps symetery with the current implementation then you will have greatly improved VMX in handling the original data types. In effect you would double single cycle performance on the data types currently processed by this unit. The side effect of being able to handle doubles could be very usefull.

No you wouldn't, for the same reason that the rest of the 970 doesn't double 32 bit performance that way. If AltiVec went 256 bit, the original 128 bit functions would continue to function as they had, and half of all of the registers would simply go unused. As far as I know, it has never been considered worthwhile for the CPU to "pack" operations like that, and 64 bit CPUs have been around for a long time now. The benefit is considerably less than 2x, and the cost in transistors is steep.

If anything, 128-bit AltiVec operations would slow down because the bandwidth requirements would double. The same thing happened to scalar 32 bit operations in 64 bit CPUs. Application developers could recode for 256 bit registers to get around that, but there'd still be a lot of the old code out there.

Quote:

The issue boils down to is the VMX unit a good place to be doing this? That is handling double vector math. Not being a hardware engineer I can only geuss that it would be easier to do it here rather than in the normal FPU.

No, the issue boils down to: Are vectors of 64 bit values a common and soluble enough problem to be implemented in hardware? If so, then the vector unit is certainly the right place to put them. (Note that "the vector unit" is a logical sum of several physical units, so the 64 bit portion could even be separate). If not, then they can continue to be implemented in software. 4 64-bit FPUs will be available in about a month to any programmer who wants them, although it will take some clever threading to really use them all. But then, massive SMP is how supercomputers work, so anyone trying to do supercomputer work on a G5 will already be familiar with the problem.

Quote:

Well these are all the arguments that people first used when AltVec came out. Eventually the facility was found to be very usefull, sometimes in unexpected ways. In any event I have to disagree with you with respect to the thought that there would be zero demand in the embedded space or the desktop. You are also making an assumption as to what a server may be doing with the hardware, computation servers are big business nowadays. In fact computation servers are such a big thing that IBM is back into the business of selling computer time.

Some people were skeptical, but not anyone who knew anything about the problem. It was pretty obvious out of the starting gate that it would be router and telephony heaven in the embedded space, and filter and codec heaven on the desktop. Even tech columnists caught on fast to the idea that SIMD on the desktop would usher in vastly improved multimedia. Most of the skepticism I read centered around whether anyone would use it - not because it wasn't inherently useful, but because it was an extension that was only shipping in one product line. Those people had the Quadra 840av - which had a dedicated DSP chip that was hardly ever used, and eventually orphaned - firmly in mind. The consensus among the embedded AltiVec programmers around Ars Technica and altivec.org is that vectors of 64 bit values are of no use in the embedded space; they'd know, so I'm taking them at their word. As for servers, clusters are the computational engines, and of course Apple is selling racks of Xserves to bioinformatics and genetics firms in large part because of the dual AltiVec engines. But that still doesn't address the larger server space, much of which involves pushing files around, networking, and firing off the odd Perl script.

I don't remember too many people worrying about the G4's bandwidth at the outset. By the end, of course, Apple's own Advanced Mathematics Group was bitching in WWDC sessions about the lack of bandwidth, but them's the breaks. It is worth noting that even a hypothetical "double wide" VMX wouldn't tax the bus as much as the 12:1 demand:supply ratio of AltiVec to the G4's MaxBus. But then, that disparity crippled AltiVec on the G4, and Apple would be wise to make sure that doesn't happen again.

yevgeny · August 8, 2003 5:14PM

Wizard69, Amorph made a good post and you should listen to him- he is right.

nevyn · August 8, 2003 5:45PM

Quote:

Originally posted by Yevgeny

4 FPU's? You can reach a point of dimminishing returns where you have all this extra silicon on the CPU and some of it is infrequently used. 4 FPU's is ALOT of FPUs. It is very hard for the CPU to keep 4 FPUs busy.

So, when we have dual cores we'll have trouble keeping one busy? That's silliness. It seems like we're headed towards taking an entire 'core' and replicating it -> you end up with 4 FPUs, 2 in one core, 2 in the other core. Using two cores on one chip would seem to take pretty much the identical amount of silicon as one core that happens to have twice as many functional units. (Acknowledging that the dispatching will be trickier, which is a factor in why we're headed to two cores). I don't actually care how the units are packaged, just that the end-user box has more FPUs. Preferably lots more

Quote:

Originally posted by Yevgeny

FPGA's are used in different situations than general purpose CPUs. I think that it would be very difficult to use them effectively in most "normal" programming models.

Good altivec is brain-pretzelizing also. So? It could be packaged API-wise as a complex pre-packaged addition to the vector instructions (with the penalties for each added). There's a big difference between an extra 5cycle (or 10 cycle) penalty on some operations, and being required to run the integer unit full tilt to massage things going into the AV unit.

yevgeny · August 8, 2003 6:02PM

Quote:

Originally posted by Nevyn

So, when we have dual cores we'll have trouble keeping one busy? That's silliness. It seems like we're headed towards taking an entire 'core' and replicating it -> you end up with 4 FPUs, 2 in one core, 2 in the other core. Using two cores on one chip would seem to take pretty much the identical amount of silicon as one core that happens to have twice as many functional units. (Acknowledging that the dispatching will be trickier, which is a factor in why we're headed to two cores). I don't actually care how the units are packaged, just that the end-user box has more FPUs. Preferably lots more

There is a very large difference between having dual cores with two FP units each and having one core with 4 FP units. First of all, if you have 4 FP units, then you have to find some ay (either in the compiler or in the hardware) to keep all 4 busy in an operation. Not everything needs 4 FP pipes. Secondly, if you have dual cores but not multithreaded software, then one of your cores is just sitting there waiting for something to do. Multithreaded software has taken it upon itself to resolve the issue of parallelizing the process, but adding more FPUs means that the chip or the compiler must find a way to do this, and it isn't always possible. If you had 8 FPUs, then you would have to find some insane way to keep them busy, and most of the time, most of them would just be sitting there wasting silicon. Multi core CPUs are different than massively parallel CPUs.

Of course, I can turn your logic on you just as easily. If making massively parallel CPUs with tons of FP and integer units is so easy, then why are IBM and Moto going for multi core CPUs? Answer: Multicore CPUs are better than massively parallel CPUs (like Itanium). Why are multi core CPUs better? Because making the programmer create multithreaded software is better than making the chip or the compiler smarter.

Quote:

Good altivec is brain-pretzelizing also. So? It could be packaged API-wise as a complex pre-packaged addition to the vector instructions (with the penalties for each added). There's a big difference between an extra 5cycle (or 10 cycle) penalty on some operations, and being required to run the integer unit full tilt to massage things going into the AV unit.

It would be a trade off. Maybe it would work in some cases, maybe it wouldn't work in other cases. I am not sufficiently familiar with FPGA's to say which is the case, but since I have never heard of any R&D into this issue, I can only assume that those who are familiar do not think that it is worth the effort.

nevyn · August 8, 2003 6:28PM

Quote:

Originally posted by Yevgeny

Why are multi core CPUs better? Because making the programmer create multithreaded software is better than making the chip or the compiler smarter.

And yet the _next_ step is both multi-core _and_ hyperthreading. Where hyperthreading is essentially a way of ekeing real work out of unused units (or portions of units) without the programmer explicitly doing anything other than threading. Like the 'other FPU(s)' in the single-core-with-4FPU design.

Whatever the Power5-lite ends up having, it isn't going to be a floating point slouch.

airsluf · August 8, 2003 6:58PM

Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.

programmer · August 9, 2003 1:15AM

Quote:

Originally posted by Yevgeny

There is a very large difference between having dual cores with two FP units each and having one core with 4 FP units. First of all, if you have 4 FP units, then you have to find some ay (either in the compiler or in the hardware) to keep all 4 busy in an operation. Not everything needs 4 FP pipes. Secondly, if you have dual cores but not multithreaded software, then one of your cores is just sitting there waiting for something to do. Multithreaded software has taken it upon itself to resolve the issue of parallelizing the process, but adding more FPUs means that the chip or the compiler must find a way to do this, and it isn't always possible. If you had 8 FPUs, then you would have to find some insane way to keep them busy, and most of the time, most of them would just be sitting there wasting silicon. Multi core CPUs are different than massively parallel CPUs.

If you have 4 FPUs then you keep them busy the same way you keep the VMX unit busy -- you have them operate on long vectors of non-interrelated data. The major difference is that the instruction dispatch rate has to be increased to feed the larger number of execution units, but that also buys you flexibility since not all the instructions have to be the same. The OoOE means that your loops will automatically take advantage of the available hardware, as long as you avoid data dependencies.

Most people don't understand that the AltiVec unit is not good at doing 3D operations where a 4-vector is stored in a single vector register, and a 4x4 matrix is stored in 4 vector registers. This does not work well, and the VMX unit generally doesn't outperform the FPU in this case... even on the G4, and the G5 is even more extreme because they doubled the FPU units. The way to make VMX do this kind of math efficiently is to do 4 vectors worth of math at a time, and spread your 4x4 matrix across 16 registers.

It is in the non-floating point support, and the permute capabilities that AltiVec really shines. The 4-way floating point operations primarily benefit from dispatching 1/4 the number of instructions, but the individual fields of the register cannot be inter-dependent.

On the subject of precision and accuracy, consider for a moment that a 32-bit float cannot represent more values than a 32-bit integer (~4 billion of them). In fact it can represent fewer because a few values are reserved for special meanings. The power of the floating point representation is that the scale of the representable values is non-linear and can thus cover a much wider range with a varying degree of precision. So which is more "precise", a 32-bit integer or a 32-bit float?

wizard69 · August 9, 2003 2:19AM

Quote:

Originally posted by Amorph

YES. FINALLY. Significant digits matter, it's up to the person doing the work to figure out and track what is significant, and FP contributes nothing to this but the occasional approximation. Welcome to my thesis. Now, if significant digits matter, then 0.5 / 2 is not necessarily 0.25, is it? Mathematically, yes. Experimentally, no, because the hundredths digit in the result claims a level of precision that doesn't appear anywhere else. (The denominator is a special case, because it has no decimal point, so it's assumed to be mathematically precise. The example would be clearer if it said 0.5 / 2.0 = 0.25.) If you're trying to contain noise, you pick the worst case, not the best case. Better to deal with acknowledged, controlled lack of precision than apparently significant noise.

Lets see if I can approach this from a differrent direction. Lets say your working in a lab and a technician has a widget running at a half a volt (0.5) and you ask him to cut the voltage in half (division by two). Would you be happy if the resultant voltage is 0.3 or 0.2 or would you expect a value of 0.25?

Quote:

No you wouldn't, for the same reason that the rest of the 970 doesn't double 32 bit performance that way. If AltiVec went 256 bit, the original 128 bit functions would continue to function as they had, and half of all of the registers would simply go unused. As far as I know, it has never been considered worthwhile for the CPU to "pack" operations like that, and 64 bit CPUs have been around for a long time now. The benefit is considerably less than 2x, and the cost in transistors is steep.

Now here you are wrong or atleast have not alluded to things properly. Remember SIMD is "single instruction multiple data"; the current VMX registers are working on multiple quanities of data at the same time. This is not comparable to widening the registers in a conventional ALU. In effect that is what VMX does; it takes packs of data values, of a supported data type, and executes an instruction against them.

Yes in some cases the instructions VMX instruction would have to be extended to handle the additional data, but that is not an unusual thing to do in processor development. Now the question becomes how usefull would that be. In some instances I could see it being very usefull

Quote:

If anything, 128-bit AltiVec operations would slow down because the bandwidth requirements would double. The same thing happened to scalar 32 bit operations in 64 bit CPUs. Application developers could recode for 256 bit registers to get around that, but there'd still be a lot of the old code out there.

No, the issue boils down to: Are vectors of 64 bit values a common and soluble enough problem to be implemented in hardware? If so, then the vector unit is certainly the right place to put them. (Note that "the vector unit" is a logical sum of several physical units, so the 64 bit portion could even be separate). If not, then they can continue to be implemented in software. 4 64-bit FPUs will be available in about a month to any programmer who wants them, although it will take some clever threading to really use them all. But then, massive SMP is how supercomputers work, so anyone trying to do supercomputer work on a G5 will already be familiar with the problem.

It is interesting to look a dsp chips and how they progressed over time. The integer registers tended to widen and eventually the DSPs started to support floats. It is a normal progression as more knowledge and understanding becomes available, the easier it becomes to apply the new capabilities.

The only reason we see stunted growth in the DSP market is the very good performance that is now possible with Alt-Vec and SIMD units on the intel side of the fence. If the capability to support data types greater than 32 bits is not supported on these processors, it will be eventually somewhere else.

Quote:

Some people were skeptical, but not anyone who knew anything about the problem. It was pretty obvious out of the starting gate that it would be router and telephony heaven in the embedded space, and filter and codec heaven on the desktop. Even tech columnists caught on fast to the idea that SIMD on the desktop would usher in vastly improved multimedia. Most of the skepticism I read centered around whether anyone would use it - not because it wasn't inherently useful, but because it was an extension that was only shipping in one product line. Those people had the Quadra 840av - which had a dedicated DSP chip that was hardly ever used, and eventually orphaned - firmly in mind. The consensus among the embedded AltiVec programmers around Ars Technica and altivec.org is that vectors of 64 bit values are of no use in the embedded space; they'd know, so I'm taking them at their word. As for servers, clusters are the computational engines, and of course Apple is selling racks of Xserves to bioinformatics and genetics firms in large part because of the dual AltiVec engines. But that still doesn't address the larger server space, much of which involves pushing files around, networking, and firing off the odd Perl script.

I don't remember too many people worrying about the G4's bandwidth at the outset. By the end, of course, Apple's own Advanced Mathematics Group was bitching in WWDC sessions about the lack of bandwidth, but them's the breaks. It is worth noting that even a hypothetical "double wide" VMX wouldn't tax the bus as much as the 12:1 demand:supply ratio of AltiVec to the G4's MaxBus. But then, that disparity crippled AltiVec on the G4, and Apple would be wise to make sure that doesn't happen again.

Now that last statement is something we can agree on!!!

Thanks

Dave

wizard69 · August 9, 2003 2:27AM

Well no he is either not communicating well or has atleast a few concepts wrong. It does not make sense to compare changing the width of a register in the main CPU ALU with a change of register size in a vector unit.

In a ALU you are always doing one operation on one piece of data in a register. Within a vector unit you are operating on a number of pieces of data at the same time. The effects of changing the width of a vector unit are differrent than that experienced when changing the width of a processors register.

Amorph has made some very good arguments, so I find it hard to understand how he let that slip out.

Thanks

Dave

Quote:

Originally posted by Yevgeny

Wizard69, Amorph made a good post and you should listen to him- he is right.

wizard69 · August 9, 2003 2:53AM

Quote:

Originally posted by Programmer

If you have 4 FPUs then you keep them busy the same way you keep the VMX unit busy -- you have them operate on long vectors of non-interrelated data. The major difference is that the instruction dispatch rate has to be increased to feed the larger number of execution units, but that also buys you flexibility since not all the instructions have to be the same. The OoOE means that your loops will automatically take advantage of the available hardware, as long as you avoid data dependencies.

Most people don't understand that the AltiVec unit is not good at doing 3D operations where a 4-vector is stored in a single vector register, and a 4x4 matrix is stored in 4 vector registers. This does not work well, and the VMX unit generally doesn't outperform the FPU in this case... even on the G4, and the G5 is even more extreme because they doubled the FPU units. The way to make VMX do this kind of math efficiently is to do 4 vectors worth of math at a time, and spread your 4x4 matrix across 16 registers.

It is in the non-floating point support, and the permute capabilities that AltiVec really shines. The 4-way floating point operations primarily benefit from dispatching 1/4 the number of instructions, but the individual fields of the register cannot be inter-dependent.

On the subject of precision and accuracy, consider for a moment that a 32-bit float cannot represent more values than a 32-bit integer (~4 billion of them). In fact it can represent fewer because a few values are reserved for special meanings. The power of the floating point representation is that the scale of the representable values is non-linear and can thus cover a much wider range with a varying degree of precision. So which is more "precise", a 32-bit integer or a 32-bit float?

Now we have some very intersting points. A single allocates one bit to the sign, 8 bits to the exponent and the rest to the Significand or 24 bits. A 23 bit fraction is not much to speak of, especially in this day and age when a A to D converter may spit out more bits for each measurement it takes.

As I have said 64 bits can be justified easly simple because of the limits imposed by float data type. Just because someone can not imagine a need for VMX support for this data type does not mean that it does not exist or won't exist in the future.

Everyone should remember when we where told that 640K would be all the memory you would ever need. This short sightedness is much the same with respect to this discussion of VMX2. Sure it does not exist yet (VMX2), but we would all be fools if we were to run around believing that VMX is all we will ever need.

Dave

airsluf · August 9, 2003 8:38AM

Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.

programmer · August 9, 2003 9:20AM

Quote:

Originally posted by wizard69

As I have said 64 bits can be justified easly simple because of the limits imposed by float data type. Just because someone can not imagine a need for VMX support for this data type does not mean that it does not exist or won't exist in the future.

You seem to be obsessed with the accuracy of physical measurements. There are a very large set of software problems where there are no physical measurements to be dealt with -- perhaps most problems? Even if you have measurements, and they are lower precision than a 32-bit float, you still need higher precision math in order to run many algorithms on this data to avoid inaccuracies creeping in because of the nature of fixed-precision math.

You are absolutely right that there is a need for 64-bit numbers; my position is that the FPU(s) is where this data should exist, not the vector unit. The advent of SMT, the need to write new code to leverage VMX2, splitting the hardware base, the constraints on vector processing, the cost of context switching with a large vector register set are all reasons why you wouldn't want to put doubles into the vector unit (even with 100% backwards compatibility). There are some perfectly valid reasons to extend VMX2 in this way, but while it might be an obvious thing to do I believe that the reasons not to do it outweight the advantages. We'll see if IBM agrees with me when they announce the details of VMX2. I have a lot of confidence that whichever course they choose will be the right one since they know a lot more about the subject that anybody here.

rickag · August 9, 2003 10:11AM

I'll probably screw this up, but hopefully some one will understand my question(somewhat embarrassing for me that I'm so ignorant that I can't frame a good question)

I've kind of kept up with this thread, but can not remember if some one has brought up adding to the number of execution units in Altivec; 4 currently, right?

Altivec has a simple integer, complex integer, flotating point and permute unit. Would it be possible or even worthwhile to increase the number of execution units? and provide more flexible units(execution units that could perform one or more of the functions - for example an execution unit that could do both simple integer and/or complex integer) And if possible, would this allow for the retirement of more instructions per cycle?

Vmx 2

Comments