or Connect
New Posts  All Forums:Forum Nav:

Vmx 2 - Page 2

post #41 of 115
Quote:
Originally posted by Powerdoc
It's quite different from what we where discussing here. It's a discussion of how to make 256 bit FP operation on a altivec unit, rather than altivec 256 bits, or a parallelar altivec unit (the current one have only 4 executions units, all differents).

I know what the paper was about, after all, that's what I said. What I didn't know was what was smalM's application. So I posted the link in the hopes that it would help him if his application was related to the paper.

As to the need for the level of precision provided by 256 bit calculations, if you are looking for gross physical proof of proof of quantum level events, then you are likely to need extreme precision in your calculations just to get a sense of the scale of the phenomenon you seek. Recent discussions about the gravitational lensing effects of relatively small newtonian bodies (Jovian class) demonstrated the need for extremely accurate calculations just so a determination could be made as to whether the anticipated effect could be measured, as I recall.
"Spec" is short for "specification" not "speculation".
Reply
"Spec" is short for "specification" not "speculation".
Reply
post #42 of 115
Quote:
Originally posted by Amorph
Actually, IEEE specifies an 80 bit floating point type, which was implemented in the 68040 and lost in the transition to PowerPC.


Yes; I always thought that this was step backwards. Apple did have a reference at one time to a data type they called doubledouble, not sure if this is still defined, but obviously somebody thought that there was a need for such a data type and the code to support it. I'm not currently and Apple developer, but if doubledouble is still supported it is still only one of many extended precision floats out in the wild.
[qoute]

For the national debt, you'd want one of those big IBMs that can do fixed-point math in hardware.
[/quote]
Real soon now we will have 64 bit real integers. This is a tremendous expansion in capability application wise. I sometimes believe that people are underestimating just how important 64 bits will be to the future of computing on the desktop and Apple specifically.
Quote:

None of this is really relevant to a revision of VMX - G4s have always been able to do 64 bit floating point just fine (if a bit slowly). The issue is whether it's worth the extra bandwidth and silicon to handle 64 bit values in vectors, and currently the answer appears to be no. SSE2 can do 2x64 bit FP, but that's only to make up for the fact that the x86's built-in floating point unit is hilariously bad. That's not the case on any PowerPC (that has an FP unit in the first place).

I have to disagree just a bit here. As long as code and algorithms exist that make use of 64 bit vectors, it MAY be worthwhile to support such vectors with hardware. Now it may not be a good idea to support them with the current VMX design which was architeched for specific issues on a desktop machine. We certainly would not want support for 64 bit operations disturbing or impacting the performance of current VMX code.

Just as SSE2 improved on the standard Intel FPU an enhanced VMX unit could improve on the PPC FPU. It could be argued though that the thing to do would be to rev the PPC FPU to handle vector type operations on 64 bit data types. Niether of these solutions or capability expansions would deal with the fact that past 64 bit vector machines worked on long vectors as opposed to the short vector design seen in VMX/AltVec.

To rev VMX to truely support 64 bit vectors well I would suspect that one of two things would be required. One would be very wide registers of 256 bits to 512 bits wide. The other would be very long 64 bit wide FIFO's or buffers. I could imagine the instantaneous power disapated in a 512 bit wide VMX unit would be very high, so 64 bit vector operations against a FIFO or buffer make sense. Assuming of course that a FIFO or buffer could deliever data in a single cylce.

Thanks
Dave
post #43 of 115
Quote:
Originally posted by Nevyn
???
If you're running out of bits in double precision floating point math doing anything related to money, fire your programmer.

Yes, the US national debt exceeds $4B, it is still NOWHERE near something that needs floating point sizes beyond double precision. Yes, I grok compounded interest and other methods of getting "partial cents" etc - but _in_the_end_ it all has to be rounded. (Because we don't cut coins into pieces anymore). The other aspect is that something like the debt isn't really a 'single item'. It is a collection of a very large number of smaller items... each of which has its own individual interest rate, maturity etc. -> Each of those is _individually_ calculable with 100% accuracy on hand calculators (once you acknowledge the inherent discreetness + rounding rules).


Hmm I thought that debt was $4T. Well yes the budget (not just the debt) is a collection of smaller items, but that does not mean that one goes about ignoring each and every dollar allocated to them. They can not be rounded out of the picture until the budget is totaled up. The resulting error would be huge. So yeah when it comes time to talk about the budget we may talk about trillions here or there but in the end it is the summation of a bunch of little things that are all significant.
[qoute]

On another note, the text in front of me tabulates only e & pi beyond 16 digits. Well, and pi/2, and a slew of other sillyness like that.

You don't really mean "If you've got the more precise info, you're insane to chuck it, all calculations must proceed with all available information" do you? You really mean "I've assessed 1) the number of significant bits I need in the end, and 2) I've assessed how my algorithm will spread my error bars/reduce the number of significant bits, and 3) I've measured as accurately as I need to, with a fair chunk of extra accuracy"
[/quote]
First off; it is never a good idea to throw away measured data if you have a way to represent it. I really hope that you aren't suggesting this.

The algorithm de jour that you are using at the moment may not process all of that information, but that is nothing new in the history of man kind. We often improve our algorithms as we develop better understandings of our subject interests.

Often it is not a question of measuring as accurately as you need to but of measuring as accurately as you are capable of. By a fair chunk of extra accuracy I thining you are talking about resolution. A measurements resolution as absolutely nothing to do with its accuracy.
Quote:

Because if you really mean "never use approximations when better data is available", please call me when you've accurately and precisely calculated the circumference of a circle exactly 1 meter in diameter. In units of _meters_ please, not multiples of pi. Oh, and here's the first million digits of pi or so.

I mean exactly what I said. To clarify, never use, for a value of PI, 3.14 when it is available in the full resolution of the data type available. Since you have kinly pointed to one of the PI sights that inhabit the net, you should be aware that this constant is covered for just about any data type we can dream up and process at the moment.

I suppose that when a circle of exactly 1 meter in diameter is ever measured we would then be able to know its circumference, I suspect that this is wee bit in the future. Given that you have a circle of approximate diameter of 1 meter, you certainly would use all of the resolution of PI that you have available to calculate circumference. I know that some people will respond in disgust at that last statement, but a little bit of thought should clear things up.

Thanks
Dave
post #44 of 115
[QUOTE]Originally posted by wizard69
Yes; I always thought that this was step backwards. Apple did have a reference at one time to a data type they called doubledouble, not sure if this is still defined, but obviously somebody thought that there was a need for such a data type and the code to support it.

In this case it fell to the RISC philosophy, which made no provision for large oddball sizes like 80 bits. The great battle cry was that all instructions and data were in identically sized chunks, to remove the complexity of instruction fetching and decoding inherent in CISC architectures.

Quote:
Real soon now we will have 64 bit real integers. This is a tremendous expansion in capability application wise. I sometimes believe that people are underestimating just how important 64 bits will be to the future of computing on the desktop and Apple specifically.

Given that Apple is currently gunning for the UNIX workstation and enterprise server spaces, I don't think anyone is pooh-poohing the relevance of 64 bit there. The question is, when will it become a crucial feature of, say, the iBook? I'm not going to bet against the ingenuity of developers, but right now the obvious uses of 64 bit CPUs in consumer applications are thin on the ground.

Quote:
I have to disagree just a bit here. As long as code and algorithms exist that make use of 64 bit vectors, it MAY be worthwhile to support such vectors with hardware. Now it may not be a good idea to support them with the current VMX design which was architeched for specific issues on a desktop machine.

You aren't disagreeing with me, except that I think VMX has legs. Currently, there are not enough uses for 64 bit values in vector math to justify an implementation in hardware. Maybe the demands of high-end 3D apps will change that down the road. But it won't be a simple change: 2x64 bit "vectors" are hardly worth it, because the 970's dual FPUs can do that just as well, and without the need to pack and unpack the vectors. 4x64 bit vectors mean 256-bit registers, a whole slew of transistors, new instructions, and even more phenomenal bandwidth demands (currently, as fast as it is, the 970's bus can't even come close to keeping VMX fed).

Quote:
Just as SSE2 improved on the standard Intel FPU an enhanced VMX unit could improve on the PPC FPU.

No, there's no analogy there. The x86 FPU is a miserably designed piece of crap that they can't improve without breaking legacy code because of the nature of its design. So the SIMD engine gets to function as a replacement. The PowerPC FPUs have always been better, and more importantly, they've always been designed in a way that allows the implementation to be improved without breaking everything. So if you want better FPU performance in, say, the 971, you just beef up the FPUs or add more units. You don't touch the SIMD engine unless you want to improve that.

Quote:
It could be argued though that the thing to do would be to rev the PPC FPU to handle vector type operations on 64 bit data types. Niether of these solutions or capability expansions would deal with the fact that past 64 bit vector machines worked on long vectors as opposed to the short vector design seen in VMX/AltVec.

I wouldn't argue that. FPUs should do FP, and SIMD engines should do SIMD.

Quote:
To rev VMX to truely support 64 bit vectors well I would suspect that one of two things would be required. One would be very wide registers of 256 bits to 512 bits wide. The other would be very long 64 bit wide FIFO's or buffers.

The stack based (FIFO) design is what crippled the x86 FPU permanently.
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #45 of 115
Quote:
Originally posted by wizard69
Given that you have a circle of approximate diameter of 1 meter, you certainly would use all of the resolution of PI that you have available to calculate circumference.
[/B]

Of course it is daft to use just 3.14.
But it is equally daft to use "all of the resolution that you have available" unless you _know_ that the end result will use it.

A guy with a hand saw and a string trying to make a 1 meter circle in a board doesn't need to go beyond a couple of digits -> he isn't going to come anywhere near the size he calculated he needed anyway.

A guy with better normal tools might need another couple of digits.

A guy using several sets of laser inferometry measurements with statistical precision calculations to ensure tool position could use a really solid set of calculations.

But if the millionth digit of pi appears anywhere in any of these three calculations, one hell of a lot of wasted work occurred.

I am _NOT_ saying anyone should wantonly discard useful starting information. Just that "useful" is dependent on context, and the contexts where the millionth digit of pi are useful are... rare. A reliance on "I'll just use higher precision in my calculation" is often a sloppy avoidance of analyzing the error propagation through your algorithm. It can also lead to an overconfidence, particularly if there's an overlooked discontinuity.
post #46 of 115
You wouldn't want double-precision in the vector unit. Period.

21st post down from the top:

http://forums.appleinsider.com/showt...X%2FSSE%2FSSE2

--
Ed
post #47 of 115
Quote:
Originally posted by Ed M.
21st post down from the top:

21? Sorry, can't count that high. Not enough bits.

Oh, wait... missed one.
"Spec" is short for "specification" not "speculation".
Reply
"Spec" is short for "specification" not "speculation".
Reply
post #48 of 115
Quote:
Originally posted by Nevyn
Of course it is daft to use just 3.14.
But it is equally daft to use "all of the resolution that you have available" unless you _know_ that the end result will use it.


Maybe I did not communicate this well. My point is if your development system has PI defined as a full double precision value in makes no sense to round it off or use 3.14, before use. Same thing goes for a single precision value or doubledouble.
Quote:

A guy with a hand saw and a string trying to make a 1 meter circle in a board doesn't need to go beyond a couple of digits -> he isn't going to come anywhere near the size he calculated he needed anyway.

A guy with better normal tools might need another couple of digits.

A guy using several sets of laser inferometry measurements with statistical precision calculations to ensure tool position could use a really solid set of calculations.

But if the millionth digit of pi appears anywhere in any of these three calculations, one hell of a lot of wasted work occurred.

Yes this was not communicated well resolving PI beyond the resolution of your data type is wasted energy. Like wise it is not to smart to use a single precision value of PI when the rest of you calculations are double precision.

It is funny some of the examples you gave as I was thinking along similar lines. That is a sheet metal craftsman trying to make a tube 1 meter in diameter. He would have a very hard time getting good results using some of the rounding suggestions mentioned in this thread.
Quote:

I am _NOT_ saying anyone should wantonly discard useful starting information. Just that "useful" is dependent on context, and the contexts where the millionth digit of pi are useful are... rare. A reliance on "I'll just use higher precision in my calculation" is often a sloppy avoidance of analyzing the error propagation through your algorithm. It can also lead to an overconfidence, particularly if there's an overlooked discontinuity.

I could not agree more with the above statement. Yet at the same time I've seen many an instance where people have thrown away resolution and then wondered why they are having so much trouble. It is amazing that people accept that when you divide 1/2 by 2 you get 1/4 yet reject that 0.5 divided by 2 = 0.25 is a valid result. It certianly is in the real world. A simplification of course but I've had educated people try to convince me that this is the only point of view on the subject.

I'm willing to state that in a similar manner, that these off the cuff old wives tales, about rounding and resolution, often end up producing the same overconfidence. It takes a bit of thought to determine where the best place is to drop resolution or introduce rounding.
post #49 of 115
Quote:
Originally posted by Amorph
Given that Apple is currently gunning for the UNIX workstation and enterprise server spaces, I don't think anyone is pooh-poohing the relevance of 64 bit there. The question is, when will it become a crucial feature of, say, the iBook? I'm not going to bet against the ingenuity of developers, but right now the obvious uses of 64 bit CPUs in consumer applications are thin on the ground.


I'm also left with the impression that the workstation is the direction that Apple is headed in. While it may be a while before the iBook moves to 64 bit, I think Apple will find that it will have no choice in the manner. It will come down to an issue with addressable memory, not strictly a 64 bit issue but probally the easiest way to solve that issue.
Quote:

You aren't disagreeing with me, except that I think VMX has legs. Currently, there are not enough uses for 64 bit values in vector math to justify an implementation in hardware. Maybe the demands of high-end 3D apps will change that down the road. But it won't be a simple change: 2x64 bit "vectors" are hardly worth it, because the 970's dual FPUs can do that just as well, and without the need to pack and unpack the vectors. 4x64 bit vectors mean 256-bit registers, a whole slew of transistors, new instructions, and even more phenomenal bandwidth demands (currently, as fast as it is, the 970's bus can't even come close to keeping VMX fed).

Yes we are real close here. The point I'm trying to make is that there are applicaitons for floating point (single and double) vector processing that VMX is not optimised for. I'm thinking about the type of applications that old Crays and other supercomputers where optimised for. The current PPC register based FPU could be improved on a great deal for certain types of applications. But it may make more sense to leave the complexity out of the FPU and add it to a specialized unit such as the rumored VMX2.


Quote:


No, there's no analogy there. The x86 FPU is a miserably designed piece of crap that they can't improve without breaking legacy code because of the nature of its design. So the SIMD engine gets to function as a replacement. The PowerPC FPUs have always been better, and more importantly, they've always been designed in a way that allows the implementation to be improved without breaking everything. So if you want better FPU performance in, say, the 971, you just beef up the FPUs or add more units. You don't touch the SIMD engine unless you want to improve that.

Again I can't totally disagree here other than to say that a capability for vector math with 64 bit floats is a positive addition to the CPU. If that happens in the FPU, the VMX or a new unit doesn't make much difference. Logically though the VMX unit would take some of these capabilities cleanly. That is we are talking new data types for existing instructions.
Quote:

I wouldn't argue that. FPUs should do FP, and SIMD engines should do SIMD.

Yep all I'm really talking about is extending the SIMD unit to add doubles to its singles FP capability. Part of this involves a much wider register set but that is and advantage all around.

Thanks
Dave


Quote:
The stack based (FIFO) design is what crippled the x86 FPU permanently.

post #50 of 115
Quote:
Originally posted by wizard69
Yet at the same time I've seen many an instance where people have thrown away resolution and then wondered why they are having so much trouble. It is amazing that people accept that when you divide 1/2 by 2 you get 1/4 yet reject that 0.5 divided by 2 = 0.25 is a valid result. It certianly is in the real world. A simplification of course but I've had educated people try to convince me that this is the only point of view on the subject.

It is, if you think about what the problem they're identifying is.

Fractional representations are perfectly precise because they are abstract: 1/2 is exactly 1 divided by 2, with no possibility of noise or inaccuracy.

0.5, on the other hand, by scientific convention means 0.5<and any additional precision was lost to noise, coarse measuring tools etc.>. In other words, 0.5 does not mean 1/2. It could be 0.503, or even 0.55, or 0.49. In that case, the best you can do is say that 0.5/2 = 0.2 - the approximation is simply an admission that the data is noisy, and the equal sign is analogous at best to its mathematical counterpart. Along these lines, 0.50 / 2.00 = 0.25 - but that still isn't the same as 1/2 divided by 2 = 1/4. You've just pushed the noise back one significant digit.

One of the things that floating point does, actually, is present the illusion of precision by ignoring the idea of significant digits, and this along with the lack of a built in "equality within a given delta" operator actually introduces inaccuracy to measurements, and instills false confidence. (FP also introduces inaccuracies via approximation, but at 64 bits that is only a problem at unusual extremes, and problem children like 1/3 and 1/10.) If all you know is that you have a measurement of 0.25-and-change, then it's not accurate to say that you have a measurement of 0.25395264823. You might... As a result, responsible FP code tracks the delta that represents the real, guaranteed accuracy, and uses it when appropriate to clip the machine's overly optimistic "precision" and compensate for its exacting comparison operators.

Quote:
Yep all I'm really talking about is extending the SIMD unit to add doubles to its singles FP capability. Part of this involves a much wider register set but that is and advantage all around.

Yes, but you still haven't made a case for it. I'm sure that, somewhere, there's a problem involving 1,024 bit vectors. Maybe someone's trying to model the impact of the solar wind on the Milky Way down to the cubic femtometer? Before you build it into hardware, you have to ask what the benefit is vs. the cost. The cost is not inconsiderable: Much wider registers are a benefit all around until your massive vector unit spends most of its time twiddliing its thumbs while the bus and main RAM struggle under the load, and the caches thrash constantly.

In short, you can't really argue for the adoption of this technology until you sit down and figure out how hard it is to implement, and what the implications will be for the rest of the CPU and the rest of the board. Right now, today, VMX will cheerfully eat four times the total bandwidth of the 970's bus, starving out the rest of the CPU. Double the register width, and the bandwidth requirements double. If you want something to replace a supercomputer, you need to give it all the bandwidth it could ever want, or it'll sit there twiddling its thumbs at incredibly high speed. And, if you're Apple, you have to figure out when your ersatz Cray will appear in a PowerBook or an iMac - an eventuality which every extra transistor delays.
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #51 of 115
Quote:
Originally posted by Amorph
It is, if you think about what the problem they're identifying is.

Fractional representations are perfectly precise because they are abstract: 1/2 is exactly 1 divided by 2, with no possibility of noise or inaccuracy.

0.5, on the other hand, by scientific convention means 0.5<and any additional precision was lost to noise, coarse measuring tools etc.>.

I don't think I agree with this... the decimal notation in-and-of itself doesn't imply the loss of any precision. If I write 0.5, I mean 0.5. If I write 0.5 +/- 0.05 then I mean there was a loss of precision. The problem is that people don't pay attention to their levels of precision.

And additional problem is that 0.5 is a decimal representation, and IEEE floating point is a binary representation. Some floating point numbers represent values which cannot be written precisely in decimal, and visa versa. This leads to inaccuracies that people don't usually pay attention to.

The real problem in both cases is that people don't understand computational math, or choose to ignore its finer points.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #52 of 115
Thread Starter 
Quote:
Originally posted by Programmer
I don't think I agree with this... the decimal notation in-and-of itself doesn't imply the loss of any precision. If I write 0.5, I mean 0.5. If I write 0.5 +/- 0.05 then I mean there was a loss of precision. The problem is that people don't pay attention to their levels of precision.


Yes, but by 0,5 he means 0,5 displayed by a calculator, and calculators are limited in digits. Otherwise you are right 0,5 has the same absolute precision than 1/2.
However there is no way to dysplay without loss of precisions 1/3 without employing the fractions.
Your 0,5 +/- 0,05 is an interesting notion. Unfortunately, such an info is not offered often in computation calculation. It would be fine it the soft give this sort of info. It's easy to guess for a simple division, but it's very difficult to guess after a whole complex calculation. Do you know a software, that he is able to give the loss of precision of a complex calculations (after millions of calculations for example) ?
post #53 of 115
Quote:
Originally posted by Powerdoc
Yes, but by 0,5 he means 0,5 displayed by a calculator, and calculators are limited in digits. Otherwise you are right 0,5 has the same absolute precision than 1/2.

But if you punch 1 / 2 = into a calculator you will get 0.5 and it will be an exact answer.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #54 of 115
Thread Starter 
Quote:
Originally posted by Programmer
But if you punch 1 / 2 = into a calculator you will get 0.5 and it will be an exact answer.

Yes but if yo do 0,5(log (inv log))n times, you won't have an exact answer. it will give you something like 0,500000001. And for the computer there is no difference between exact answers and approximatives ones.

Some mathematicians have made tricky programs that multiply the imprecision in a such exponential way, that it leads to great errors. In this way, the show the limit of math simulation.
The important is to be able to detect when such a thing appears. For basic calculation like 1/2, it's simple, for complex mathematic calculations or simulations it's a much more difficult task. You don't have to follow blindly what the computer said.
post #55 of 115
Quote:
Originally posted by Powerdoc
Yes but if yo do 0,5(log (inv log))n times, you won't have an exact answer. it will give you something like 0,500000001. And for the computer there is no difference between exact answers and approximatives ones.

Some mathematicians have made tricky programs that multiply the imprecision in a such exponential way, that it leads to great errors. In this way, the show the limit of math simulation.
The important is to be able to detect when such a thing appears. For basic calculation like 1/2, it's simple, for complex mathematic calculations or simulations it's a much more difficult task. You don't have to follow blindly what the computer said.

Yes, but the statement I'm objecting to was that 0.5 is somehow less accurate than 1/2. This is not correct. A given floating point number represents a number precisely, that just may not be the number you wanted and it may not be possible to convert that number of a decimal representation in an exact way. The imprecision comes from the calculations (including conversion), and occurs because of using a fixed format representation (e.g. 32-bit or 64-bit floating point).

The IEEE standard contains the "inexact" flag and exception. Any calculation that involves rounding, overflow, or underflow will set this flag or throw this exception. Unfortunately it doesn't track the amount of error for you, but it could be used to detect when it happens. Unfortunately most calculations are going to have something inexact in them so that flag isn't going to help you a whole lot.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #56 of 115
Quote:
Originally posted by Programmer
Yes, but the statement I'm objecting to was that 0.5 is somehow less accurate than 1/2. This is not correct.

That depends on how you got it. If you measured 0.5 experimentally, there's an implicit +/-. The shorthand has always been that you explicitly give the number of significant digits you're sure of. So in this realm, 0.5, 0.50, 0.500, and 0.5000 are all slightly different, and converge on the mathematical real number 0.5, which is obviously equivalent to 1/2.

Put it this way: If you have to write down thousands of measurements, would you consistently write down 0.5 +/- some delta, or would you adopt the "significant digits" shorthand?

Quote:
A given floating point number represents a number precisely, that just may not be the number you wanted and it may not be possible to convert that number of a decimal representation in an exact way.

If the only precision you care about is that of the number you want - and under what circumstance would any other definition apply? - then this isn't precision at all. In the best case, it's the value you want to the precision you're guaranteed (by the quality of your measurement or calculation) plus or minus some essentially random noise introduced by the FP hardware. This infects both attempts at pure mathematics (because of flaws is FP representation of real numbers) and in calculation from experimental or observational results (because of the former reason, and because of illusory precision).

Quote:
The IEEE standard contains the "inexact" flag and exception. Any calculation that involves rounding, overflow, or underflow will set this flag or throw this exception. Unfortunately it doesn't track the amount of error for you, but it could be used to detect when it happens. Unfortunately most calculations are going to have something inexact in them so that flag isn't going to help you a whole lot.

So you end up with the code I've seen that tracks the delta manually and ignores the built-in comparison operators in favor of hand-rolled functions that take the delta into account, and track the number of significant digits.
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #57 of 115
Quote:
Originally posted by Amorph
It is, if you think about what the problem they're identifying is.

Fractional representations are perfectly precise because they are abstract: 1/2 is exactly 1 divided by 2, with no possibility of noise or inaccuracy.

0.5, on the other hand, by scientific convention means 0.5<and any additional precision was lost to noise, coarse measuring tools etc.>. In other words, 0.5 does not mean 1/2. It could be 0.503, or even 0.55, or 0.49. In that case, the best you can do is say that 0.5/2 = 0.2 - the approximation is simply an admission that the data is noisy, and the equal sign is analogous at best to its mathematical counterpart. Along these lines, 0.50 / 2.00 = 0.25 - but that still isn't the same as 1/2 divided by 2 = 1/4. You've just pushed the noise back one significant digit.


This is exactly what I'm getting at. There is no reason to infer that there is precision lost or course measuring tools or anything of that nature. In the end you are specifing the same value - 0.25 represents the same thing as 1/4.
Quote:
One of the things that floating point does, actually, is present the illusion of precision by ignoring the idea of significant digits, and this along with the lack of a built in "equality within a given delta" operator actually introduces inaccuracy to measurements, and instills false confidence. (FP also introduces inaccuracies via approximation, but at 64 bits that is only a problem at unusual extremes, and problem children like 1/3 and 1/10.) If all you know is that you have a measurement of 0.25-and-change, then it's not accurate to say that you have a measurement of 0.25395264823. You might... As a result, responsible FP code tracks the delta that represents the real, guaranteed accuracy, and uses it when appropriate to clip the machine's overly optimistic "precision" and compensate for its exacting comparison operators.

If you have a measurement of 0.25 and change as you say it is silly to dispose of that "change" until you know that it is relavant. There may be little precision in that measurement but you do want to keep and track all of the resolution that you had when you made the memsurement. You seem to be making the mistake or confusing resolution with precision, they are not the same thing.
Quote:

Yes, but you still haven't made a case for it. I'm sure that, somewhere, there's a problem involving 1,024 bit vectors. Maybe someone's trying to model the impact of the solar wind on the Milky Way down to the cubic femtometer? Before you build it into hardware, you have to ask what the benefit is vs. the cost. The cost is not inconsiderable: Much wider registers are a benefit all around until your massive vector unit spends most of its time twiddliing its thumbs while the bus and main RAM struggle under the load, and the caches thrash constantly.

Frankly you have not made any case at all for keeping doubles out of a SIMD unit of any type. All you have to accept is that singles, that is 32 bit floats, do not have the dynamic range for many applications. It really doesn't matter if the application needs one or two bits extra or a lot more, the next logical data type is double. This implies that it would be reasonable to expand the VMX to handle 64 bit data types.

I don't deny that implementing such capabilities will cost in transistors. It is probally for this reason that the rumor is directed at 65nm devices. The same argument can be made that the reason AltVec has its current limitations is one of economics with respect for the process it was first targeted at.

The issues with caches and data transfers are real, which is one of the reasons I believe that Apple & IBM are looking seriously at improving VMX. Maybe not VMX2 as the rumors describe but improvements none the less. Since such improvements would address data movement and buffering as much as anything else, new data types and instructions would experience less of a hit. Remember this rumor revolves around a new or improved VMX unit, hopefully it will not be saddled with the current units limitations.

Quote:
In short, you can't really argue for the adoption of this technology until you sit down and figure out how hard it is to implement, and what the implications will be for the rest of the CPU and the rest of the board. Right now, today, VMX will cheerfully eat four times the total bandwidth of the 970's bus, starving out the rest of the CPU. Double the register width, and the bandwidth requirements double. If you want something to replace a supercomputer, you need to give it all the bandwidth it could ever want, or it'll sit there twiddling its thumbs at incredibly high speed. And, if you're Apple, you have to figure out when your ersatz Cray will appear in a PowerBook or an iMac - an eventuality which every extra transistor delays.

Since this is a chip that apparently hasn't even left the stage of realization, the above concerns are really not valid. Bandwidth is and always will be an issue. Any new design would have to address those bandwidth issues, but those issues would be in place even if the SIMD unit was not touched. You only need to give the improved VMX unit enough bandwidth to make it a significant solution over other performance improvements. Trade offs in processor design are not going away, it is a matter of getting the best bang for the buck for the processors targeted market. As you have accurately indicated there is a great deal of room for improvement in the current design.

Thanks
Dave
post #58 of 115
Not needing 64-bit floats is not the reason to avoid putting them into VMX2.

Better arguments are to be found in examing the costs of such an addition in terms of the amount of machine state to be preserved, the opportunity cost, and the effects of fragmenting the PowerPC user programming model further. Intel and AMD have been changing their programming model on a whim for years and look at the mess it has gotten them into, and developers either don't bother using any of it or choose to support some very small subset (but the hardware has to support it all).

For a given number of transistors is it better to add double support VMX2, or improve the normal FPU implementation? If you add transistors to the VMX2 units then nobody gets the benefit until they recode for VMX2 specifically. If you add to the FPUs then everybody already doing double precision benefits immediately. In an SMT design there is probably a thread running on the processor somewhere that can use all those FPUs all the time. If you extend VMX to have doubles you pretty much have to double the register width to be useful, adding another 512 bytes to a full context switch.

The AltiVec unit is great. It is a terrific design. I just don't see that increasing the register size and adding the huge complexity of a quad double precision unit is worth it, however. There are other instructions they could add first that improve what the unit can do already without having to introduce such an expensive new type. With this kind of an addition everybody pays the price, but few reap the benefits.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #59 of 115
Quote:
Originally posted by wizard69
This is exactly what I'm getting at. There is no reason to infer that there is precision lost or course measuring tools or anything of that nature. In the end you are specifing the same value - 0.25 represents the same thing as 1/4.

No, it's not what you're getting at. You're missing the point. The mathematical real number 0.25 is the mathematical real 1 divided by the mathematical real 2. Anything measured is a lot messier than that, and any responsible scientist has to account for that. A measured value of "0.25" could be any of: 0.24999, 0.253, 0.25000000001, or even precisely 1/2 (although what are the odds of that?). You don't know what the exact value is, so any assumptions beyond the initial signficant digits are almost guaranteed to be false.

Quote:
If you have a measurement of 0.25 and change as you say it is silly to dispose of that "change" until you know that it is relavant. There may be little precision in that measurement but you do want to keep and track all of the resolution that you had when you made the memsurement. You seem to be making the mistake or confusing resolution with precision, they are not the same thing.

At this point I have no idea what you're talking about. Precision represents the accuracy with which something can be represented. It applies both to measurements and to representations in floating point, which is why people refer to "64 bit precision" and "precision tools".

How is resolution different from precision, anyway? Both specify a quantum value beneath which the representation is no longer accurate.

You've misinterpreted what I've said. If you have a measurement of "0.25 and change" you don't know what that change is. It could be zero, or it could not be. Disposing of it is not an issue, because you don't know what it is in the first place. If you could measure it in any meaningful way, there would be significant digits to represent it! The fact that FP might, over the course of calculations, introduce a whole bunch of extra digits (but not significant digits, because you can't get signal from noise), is an unwelcome artifact. It's not anything you can use, and it's not the "change" I was referring to.

Quote:
Frankly you have not made any case at all for keeping doubles out of a SIMD unit of any type. All you have to accept is that singles, that is 32 bit floats, do not have the dynamic range for many applications.

All you have to do is answer the question: How many applications, and are they worth the cost of implementing a vector unit vs. using parallelism and conventional FP units? It's nice that there are supercomputers that can do this, but Apple doesn't make supercomputers (silly marketing hype aside). I don't pretend to know the answer to that question, but it's not a simple question, and it can't be blown off.

We don't know that IBM and Apple (and probably Mot) are revisiting VMX to provide this functionality, either. There are all kinds of capabilities they could add that would greatly improve its appeal for streaming and signal processing work, without changing the sizes of the registers or supporting 64 bit anything.

Quote:
Since this is a chip that apparently hasn't even left the stage of realization, the above concerns are really not valid.

No, they're valid, they just aren't anything more than "concerns." I'm not saying can't or won't or shouldn't. I'm merely pointing out that the concerns are hairy enough, and the payoff uncertain enough, and other features desirable enough, that the support you want might or might not happen even at 65nm. Whether it happens depends on a large number of variables whose values are currently unknown.

Personally, looking at the problem, I am leaning toward "won't happen." The current top of the line PowerMac can crunch through 4 FP calculations per clock (2 CPUs with 2 FPUs each). There you go - no additional hardware necessary, and twice the memory bandwidth and twice the cache that would be available to a 64-bit-savvy VMX unit on one CPU.

Quote:
Trade offs in processor design are not going away, it is a matter of getting the best bang for the buck for the processors targeted market.

So what are the tradeoffs in going to 256 bit or 512 bit registers, and how easy are they to surmount? Is it worth it to the target market? There is zero use for it in the embedded space (any time soon), zero use on the desktop (any time soon), so that leaves the workstation and server markets. Servers might be able to use it for IPv6 networking, but IBM seems to have other ideas for that sort of thing (FastPath, which will intercept a lot of the system interrupts and allow the main CPU to keep crunching away).
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #60 of 115
Quote:
Originally posted by Amorph
No, it's not what you're getting at. You're missing the point. The mathematical real number 0.25 is the mathematical real 1 divided by the mathematical real 2. Anything measured is a lot messier than that, and any responsible scientist has to account for that...


....

now that's the way to start a mathematical argument...
I heard that geeks are a dime a dozen, I just want to find out who's been passin' out the dimes
----- Fred Blassie 1964
Reply
I heard that geeks are a dime a dozen, I just want to find out who's been passin' out the dimes
----- Fred Blassie 1964
Reply
post #61 of 115
Thread Starter 
Quote:
Originally posted by Bigc
now that's the way to start a mathematical argument...

It was just a syntax error. Don't be hard with Amorph : he has the power
post #62 of 115
Quote:
Originally posted by Bigc
now that's the way to start a mathematical argument...

It's true in Amorphomatics, alright? Sheesh.

Picky, picky, picky.
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #63 of 115
8)
I heard that geeks are a dime a dozen, I just want to find out who's been passin' out the dimes
----- Fred Blassie 1964
Reply
I heard that geeks are a dime a dozen, I just want to find out who's been passin' out the dimes
----- Fred Blassie 1964
Reply
post #64 of 115
Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.
post #65 of 115
I'm in the "let's just go to 4x real 64-bit FPUs" camp too.

On another note entirely, one key point of the Altivec unit is dealing with piles of streaming data.

What about a FPGA unit in addition, instead of a change to the Altivec units themselves? It would seem like a great coprocessor to the AV unit from my POV - a lot of the issues involved in using AV involve massaging data into/out of various formats.

For those that don't know, "FPGA" stands for field programmable gate array, which essentially means the _hardware_ is configured for the specific job at hand. They aren't as fast as 'normal' chips, but you can explicitly ignore/change various things that you know your algorithim doesn't care about. (So it could do 65 bit math if you wanted. or 9 bit math. Or whatever).
post #66 of 115
Quote:
Originally posted by Amorph
No, it's not what you're getting at. You're missing the point. The mathematical real number 0.25 is the mathematical real 1 divided by the mathematical real 2. Anything measured is a lot messier than that, and any responsible scientist has to account for that. A measured value of "0.25" could be any of: 0.24999, 0.253, 0.25000000001, or even precisely 1/2 (although what are the odds of that?). You don't know what the exact value is, so any assumptions beyond the initial signficant digits are almost guaranteed to be false.


If you can't see what I'm getting at then we have to reconsider whom is missing the point. You bounce back and for between the concept of measurements and math apparently to confuse yourself or the readers of this thread.

Anybody with a little bit of experience in the real world knows that there is uncertianty in measurement. Figuring out which digits are significant is much more involved then just grabbing the "initial significant digits"
Quote:
At this point I have no idea what you're talking about. Precision represents the accuracy with which something can be represented. It applies both to measurements and to representations in floating point, which is why people refer to "64 bit precision" and "precision tools".

So now you are trying to claim that a 64 bit float has 64 bits of precision. You really don't expect me to believe that do you?
Quote:
How is resolution different from precision, anyway? Both specify a quantum value beneath which the representation is no longer accurate.

There is a huge differrence between resolution and precision or accuracy. They are two completely different things. It is one fo the reasons why manufactures of test equipment print detailed data sheets on their instrumentation. It is very possible that an instrument can be fairly precise one one range and marginal on another even though it may be resolving the same number of digits.
Quote:

You've misinterpreted what I've said. If you have a measurement of "0.25 and change" you don't know what that change is. It could be zero, or it could not be. Disposing of it is not an issue, because you don't know what it is in the first place. If you could measure it in any meaningful way, there would be significant digits to represent it! The fact that FP might, over the course of calculations, introduce a whole bunch of extra digits (but not significant digits, because you can't get signal from noise), is an unwelcome artifact. It's not anything you can use, and it's not the "change" I was referring to.

Agian I find myself agreeing with you then disagreeing with you. Just because you can resolve a value, it does not imply that the value is accurate or significant. If it can be resolved then keep it around, but if you have no siginificant measurment then you don't have any change.

You are correct that FP math can add extra digits, but it is up to the algorithm designer to determine if they are significant at any point. Premature rounding can introduce as many errors as excessive reliance on extra digits. You can not just make rash decisions on when information can be dropped.
Quote:

All you have to do is answer the question: How many applications, and are they worth the cost of implementing a vector unit vs. using parallelism and conventional FP units? It's nice that there are supercomputers that can do this, but Apple doesn't make supercomputers (silly marketing hype aside). I don't pretend to know the answer to that question, but it's not a simple question, and it can't be blown off.

I agree that the design question is not simple. In the context of the rumored revision to the VMX unit it is worth looking at. Considering Apple push into the workstation market it may well be worth the investment.

As an aside I have to believe that Apple and IBM see a great deal of potential in the VMX subsystem. How that will be improved and extended in the future will be an interesting subject to debate.
Quote:

We don't know that IBM and Apple (and probably Mot) are revisiting VMX to provide this functionality, either. There are all kinds of capabilities they could add that would greatly improve its appeal for streaming and signal processing work, without changing the sizes of the registers or supporting 64 bit anything.

I still don't know what the hang up about 64 bits is. If VMX2 has 256bit registers and keeps symetery with the current implementation then you will have greatly improved VMX in handling the original data types. In effect you would double single cycle performance on the data types currently processed by this unit. The side effect of being able to handle doubles could be very usefull.

The issue boils down to is the VMX unit a good place to be doing this? That is handling double vector math. Not being a hardware engineer I can only geuss that it would be easier to do it here rather than in the normal FPU.
Quote:
No, they're valid, they just aren't anything more than "concerns." I'm not saying can't or won't or shouldn't. I'm merely pointing out that the concerns are hairy enough, and the payoff uncertain enough, and other features desirable enough, that the support you want might or might not happen even at 65nm. Whether it happens depends on a large number of variables whose values are currently unknown.

Personally, looking at the problem, I am leaning toward "won't happen." The current top of the line PowerMac can crunch through 4 FP calculations per clock (2 CPUs with 2 FPUs each). There you go - no additional hardware necessary, and twice the memory bandwidth and twice the cache that would be available to a 64-bit-savvy VMX unit on one CPU.

I would hope that the PPC would not become stagnet, or worst yet fall into the confusion that is Intel hardware. So I have to think that some improvements and extentions are on the way - hopefully well planned. The issue I have is that some operations currently done on the VMX unit do not fit well into the FPU and integer units. I do not think that IBM nor Apple would try to cram into those units instructions and operations that would slow them down or hamper other improvements.
Quote:
So what are the tradeoffs in going to 256 bit or 512 bit registers, and how easy are they to surmount? Is it worth it to the target market? There is zero use for it in the embedded space (any time soon), zero use on the desktop (any time soon), so that leaves the workstation and server markets. Servers might be able to use it for IPv6 networking, but IBM seems to have other ideas for that sort of thing (FastPath, which will intercept a lot of the system interrupts and allow the main CPU to keep crunching away).

Well these are all the arguments that people first used when AltVec came out. Eventually the facility was found to be very usefull, sometimes in unexpected ways. In any event I have to disagree with you with respect to the thought that there would be zero demand in the embedded space or the desktop. You are also making an assumption as to what a server may be doing with the hardware, computation servers are big business nowadays. In fact computation servers are such a big thing that IBM is back into the business of selling computer time.

Even I was a bit reluctant to believe that that would ever happen again.

Dave
post #67 of 115
Thread Starter 
I am ready to bet that IBM and Apple are already calculating what subunits are used more in VMX, and what instructions should make a speed bump.

Currently the VMX has four execution unit, and can dispatch to any of the 2 (for the G4 e), imagine that one of these unit is used very often and is the bottleneck, while others units are not used at 100 %, the solution is simple : duplicate the right subunit.

VMX 2 should be still 128 bits, but will have larger executions units that can dwell with more instructions, and some of these units may be duplicates.

Add to the chip, some new features like SMT, some extra FP unit, and why not a third integer unit, and you will have a great chip : the G7 on 65 nm process.
post #68 of 115
Quote:
Originally posted by Nevyn
I'm in the "let's just go to 4x real 64-bit FPUs" camp too.

On another note entirely, one key point of the Altivec unit is dealing with piles of streaming data.

What about a FPGA unit in addition, instead of a change to the Altivec units themselves? It would seem like a great coprocessor to the AV unit from my POV - a lot of the issues involved in using AV involve massaging data into/out of various formats.

For those that don't know, "FPGA" stands for field programmable gate array, which essentially means the _hardware_ is configured for the specific job at hand. They aren't as fast as 'normal' chips, but you can explicitly ignore/change various things that you know your algorithim doesn't care about. (So it could do 65 bit math if you wanted. or 9 bit math. Or whatever).

4 FPU's? You can reach a point of dimminishing returns where you have all this extra silicon on the CPU and some of it is infrequently used. 4 FPU's is ALOT of FPUs. It is very hard for the CPU to keep 4 FPUs busy.

FPGA's are used in different situations than general purpose CPUs. I think that it would be very difficult to use them effectively in most "normal" programming models. For example, the programmer would have to set the "bitness" of a number (so that the FPGA could adjust its bitness). This would invariably leak CPU details up to the programmer which is a bad thing in many cases. I don't think that a FPGA would work out too well (esp. since they do run much slower than a normal CPU).
King Felix
Reply
King Felix
Reply
post #69 of 115
Quote:
Originally posted by wizard69
If you can't see what I'm getting at then we have to reconsider whom is missing the point. You bounce back and for between the concept of measurements and math apparently to confuse yourself or the readers of this thread.

No, that's the whole point: Abstract math is one thing. Experimentally derived data is another. Abstract mathematics is arbitrarily precise. Nothing else is. "0.5" means exactly one half in mathematics, and "about 0.5" to a field scientist. This is neither difficult nor confusing: One discipline does not have to account for error or noise, the other does. Their conventions differ accordingly.

Quote:
You are correct that FP math can add extra digits, but it is up to the algorithm designer to determine if they are significant at any point.

YES. FINALLY. Significant digits matter, it's up to the person doing the work to figure out and track what is significant, and FP contributes nothing to this but the occasional approximation. Welcome to my thesis. Now, if significant digits matter, then 0.5 / 2 is not necessarily 0.25, is it? Mathematically, yes. Experimentally, no, because the hundredths digit in the result claims a level of precision that doesn't appear anywhere else. (The denominator is a special case, because it has no decimal point, so it's assumed to be mathematically precise. The example would be clearer if it said 0.5 / 2.0 = 0.25.) If you're trying to contain noise, you pick the worst case, not the best case. Better to deal with acknowledged, controlled lack of precision than apparently significant noise.

Quote:
I still don't know what the hang up about 64 bits is. If VMX2 has 256bit registers and keeps symetery with the current implementation then you will have greatly improved VMX in handling the original data types. In effect you would double single cycle performance on the data types currently processed by this unit. The side effect of being able to handle doubles could be very usefull.

No you wouldn't, for the same reason that the rest of the 970 doesn't double 32 bit performance that way. If AltiVec went 256 bit, the original 128 bit functions would continue to function as they had, and half of all of the registers would simply go unused. As far as I know, it has never been considered worthwhile for the CPU to "pack" operations like that, and 64 bit CPUs have been around for a long time now. The benefit is considerably less than 2x, and the cost in transistors is steep.

If anything, 128-bit AltiVec operations would slow down because the bandwidth requirements would double. The same thing happened to scalar 32 bit operations in 64 bit CPUs. Application developers could recode for 256 bit registers to get around that, but there'd still be a lot of the old code out there.

Quote:
The issue boils down to is the VMX unit a good place to be doing this? That is handling double vector math. Not being a hardware engineer I can only geuss that it would be easier to do it here rather than in the normal FPU.

No, the issue boils down to: Are vectors of 64 bit values a common and soluble enough problem to be implemented in hardware? If so, then the vector unit is certainly the right place to put them. (Note that "the vector unit" is a logical sum of several physical units, so the 64 bit portion could even be separate). If not, then they can continue to be implemented in software. 4 64-bit FPUs will be available in about a month to any programmer who wants them, although it will take some clever threading to really use them all. But then, massive SMP is how supercomputers work, so anyone trying to do supercomputer work on a G5 will already be familiar with the problem.

Quote:
Well these are all the arguments that people first used when AltVec came out. Eventually the facility was found to be very usefull, sometimes in unexpected ways. In any event I have to disagree with you with respect to the thought that there would be zero demand in the embedded space or the desktop. You are also making an assumption as to what a server may be doing with the hardware, computation servers are big business nowadays. In fact computation servers are such a big thing that IBM is back into the business of selling computer time.

Some people were skeptical, but not anyone who knew anything about the problem. It was pretty obvious out of the starting gate that it would be router and telephony heaven in the embedded space, and filter and codec heaven on the desktop. Even tech columnists caught on fast to the idea that SIMD on the desktop would usher in vastly improved multimedia. Most of the skepticism I read centered around whether anyone would use it - not because it wasn't inherently useful, but because it was an extension that was only shipping in one product line. Those people had the Quadra 840av - which had a dedicated DSP chip that was hardly ever used, and eventually orphaned - firmly in mind. The consensus among the embedded AltiVec programmers around Ars Technica and altivec.org is that vectors of 64 bit values are of no use in the embedded space; they'd know, so I'm taking them at their word. As for servers, clusters are the computational engines, and of course Apple is selling racks of Xserves to bioinformatics and genetics firms in large part because of the dual AltiVec engines. But that still doesn't address the larger server space, much of which involves pushing files around, networking, and firing off the odd Perl script.

I don't remember too many people worrying about the G4's bandwidth at the outset. By the end, of course, Apple's own Advanced Mathematics Group was bitching in WWDC sessions about the lack of bandwidth, but them's the breaks. It is worth noting that even a hypothetical "double wide" VMX wouldn't tax the bus as much as the 12:1 demand:supply ratio of AltiVec to the G4's MaxBus. But then, that disparity crippled AltiVec on the G4, and Apple would be wise to make sure that doesn't happen again.
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #70 of 115
Wizard69, Amorph made a good post and you should listen to him- he is right.
King Felix
Reply
King Felix
Reply
post #71 of 115
Quote:
Originally posted by Yevgeny
4 FPU's? You can reach a point of dimminishing returns where you have all this extra silicon on the CPU and some of it is infrequently used. 4 FPU's is ALOT of FPUs. It is very hard for the CPU to keep 4 FPUs busy.

So, when we have dual cores we'll have trouble keeping one busy? That's silliness. It seems like we're headed towards taking an entire 'core' and replicating it -> you end up with 4 FPUs, 2 in one core, 2 in the other core. Using two cores on one chip would seem to take pretty much the identical amount of silicon as one core that happens to have twice as many functional units. (Acknowledging that the dispatching will be trickier, which is a factor in why we're headed to two cores). I don't actually care how the units are packaged, just that the end-user box has more FPUs. Preferably lots more

Quote:
Originally posted by Yevgeny
FPGA's are used in different situations than general purpose CPUs. I think that it would be very difficult to use them effectively in most "normal" programming models.

Good altivec is brain-pretzelizing also. So? It could be packaged API-wise as a complex pre-packaged addition to the vector instructions (with the penalties for each added). There's a big difference between an extra 5cycle (or 10 cycle) penalty on some operations, and being required to run the integer unit full tilt to massage things going into the AV unit.
post #72 of 115
Quote:
Originally posted by Nevyn
So, when we have dual cores we'll have trouble keeping one busy? That's silliness. It seems like we're headed towards taking an entire 'core' and replicating it -> you end up with 4 FPUs, 2 in one core, 2 in the other core. Using two cores on one chip would seem to take pretty much the identical amount of silicon as one core that happens to have twice as many functional units. (Acknowledging that the dispatching will be trickier, which is a factor in why we're headed to two cores). I don't actually care how the units are packaged, just that the end-user box has more FPUs. Preferably lots more

There is a very large difference between having dual cores with two FP units each and having one core with 4 FP units. First of all, if you have 4 FP units, then you have to find some ay (either in the compiler or in the hardware) to keep all 4 busy in an operation. Not everything needs 4 FP pipes. Secondly, if you have dual cores but not multithreaded software, then one of your cores is just sitting there waiting for something to do. Multithreaded software has taken it upon itself to resolve the issue of parallelizing the process, but adding more FPUs means that the chip or the compiler must find a way to do this, and it isn't always possible. If you had 8 FPUs, then you would have to find some insane way to keep them busy, and most of the time, most of them would just be sitting there wasting silicon. Multi core CPUs are different than massively parallel CPUs.

Of course, I can turn your logic on you just as easily. If making massively parallel CPUs with tons of FP and integer units is so easy, then why are IBM and Moto going for multi core CPUs? Answer: Multicore CPUs are better than massively parallel CPUs (like Itanium). Why are multi core CPUs better? Because making the programmer create multithreaded software is better than making the chip or the compiler smarter.

Quote:
Good altivec is brain-pretzelizing also. So? It could be packaged API-wise as a complex pre-packaged addition to the vector instructions (with the penalties for each added). There's a big difference between an extra 5cycle (or 10 cycle) penalty on some operations, and being required to run the integer unit full tilt to massage things going into the AV unit.

It would be a trade off. Maybe it would work in some cases, maybe it wouldn't work in other cases. I am not sufficiently familiar with FPGA's to say which is the case, but since I have never heard of any R&D into this issue, I can only assume that those who are familiar do not think that it is worth the effort.
King Felix
Reply
King Felix
Reply
post #73 of 115
Quote:
Originally posted by Yevgeny
Why are multi core CPUs better? Because making the programmer create multithreaded software is better than making the chip or the compiler smarter.

And yet the _next_ step is both multi-core _and_ hyperthreading. Where hyperthreading is essentially a way of ekeing real work out of unused units (or portions of units) without the programmer explicitly doing anything other than threading. Like the 'other FPU(s)' in the single-core-with-4FPU design.

Whatever the Power5-lite ends up having, it isn't going to be a floating point slouch.
post #74 of 115
Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.
post #75 of 115
Quote:
Originally posted by Yevgeny
There is a very large difference between having dual cores with two FP units each and having one core with 4 FP units. First of all, if you have 4 FP units, then you have to find some ay (either in the compiler or in the hardware) to keep all 4 busy in an operation. Not everything needs 4 FP pipes. Secondly, if you have dual cores but not multithreaded software, then one of your cores is just sitting there waiting for something to do. Multithreaded software has taken it upon itself to resolve the issue of parallelizing the process, but adding more FPUs means that the chip or the compiler must find a way to do this, and it isn't always possible. If you had 8 FPUs, then you would have to find some insane way to keep them busy, and most of the time, most of them would just be sitting there wasting silicon. Multi core CPUs are different than massively parallel CPUs.

If you have 4 FPUs then you keep them busy the same way you keep the VMX unit busy -- you have them operate on long vectors of non-interrelated data. The major difference is that the instruction dispatch rate has to be increased to feed the larger number of execution units, but that also buys you flexibility since not all the instructions have to be the same. The OoOE means that your loops will automatically take advantage of the available hardware, as long as you avoid data dependencies.

Most people don't understand that the AltiVec unit is not good at doing 3D operations where a 4-vector is stored in a single vector register, and a 4x4 matrix is stored in 4 vector registers. This does not work well, and the VMX unit generally doesn't outperform the FPU in this case... even on the G4, and the G5 is even more extreme because they doubled the FPU units. The way to make VMX do this kind of math efficiently is to do 4 vectors worth of math at a time, and spread your 4x4 matrix across 16 registers.

It is in the non-floating point support, and the permute capabilities that AltiVec really shines. The 4-way floating point operations primarily benefit from dispatching 1/4 the number of instructions, but the individual fields of the register cannot be inter-dependent.


On the subject of precision and accuracy, consider for a moment that a 32-bit float cannot represent more values than a 32-bit integer (~4 billion of them). In fact it can represent fewer because a few values are reserved for special meanings. The power of the floating point representation is that the scale of the representable values is non-linear and can thus cover a much wider range with a varying degree of precision. So which is more "precise", a 32-bit integer or a 32-bit float?
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #76 of 115
Quote:
Originally posted by Amorph


YES. FINALLY. Significant digits matter, it's up to the person doing the work to figure out and track what is significant, and FP contributes nothing to this but the occasional approximation. Welcome to my thesis. Now, if significant digits matter, then 0.5 / 2 is not necessarily 0.25, is it? Mathematically, yes. Experimentally, no, because the hundredths digit in the result claims a level of precision that doesn't appear anywhere else. (The denominator is a special case, because it has no decimal point, so it's assumed to be mathematically precise. The example would be clearer if it said 0.5 / 2.0 = 0.25.) If you're trying to contain noise, you pick the worst case, not the best case. Better to deal with acknowledged, controlled lack of precision than apparently significant noise.


Lets see if I can approach this from a differrent direction. Lets say your working in a lab and a technician has a widget running at a half a volt (0.5) and you ask him to cut the voltage in half (division by two). Would you be happy if the resultant voltage is 0.3 or 0.2 or would you expect a value of 0.25?

Quote:
No you wouldn't, for the same reason that the rest of the 970 doesn't double 32 bit performance that way. If AltiVec went 256 bit, the original 128 bit functions would continue to function as they had, and half of all of the registers would simply go unused. As far as I know, it has never been considered worthwhile for the CPU to "pack" operations like that, and 64 bit CPUs have been around for a long time now. The benefit is considerably less than 2x, and the cost in transistors is steep.

Now here you are wrong or atleast have not alluded to things properly. Remember SIMD is "single instruction multiple data"; the current VMX registers are working on multiple quanities of data at the same time. This is not comparable to widening the registers in a conventional ALU. In effect that is what VMX does; it takes packs of data values, of a supported data type, and executes an instruction against them.

Yes in some cases the instructions VMX instruction would have to be extended to handle the additional data, but that is not an unusual thing to do in processor development. Now the question becomes how usefull would that be. In some instances I could see it being very usefull
Quote:

If anything, 128-bit AltiVec operations would slow down because the bandwidth requirements would double. The same thing happened to scalar 32 bit operations in 64 bit CPUs. Application developers could recode for 256 bit registers to get around that, but there'd still be a lot of the old code out there.



No, the issue boils down to: Are vectors of 64 bit values a common and soluble enough problem to be implemented in hardware? If so, then the vector unit is certainly the right place to put them. (Note that "the vector unit" is a logical sum of several physical units, so the 64 bit portion could even be separate). If not, then they can continue to be implemented in software. 4 64-bit FPUs will be available in about a month to any programmer who wants them, although it will take some clever threading to really use them all. But then, massive SMP is how supercomputers work, so anyone trying to do supercomputer work on a G5 will already be familiar with the problem.

It is interesting to look a dsp chips and how they progressed over time. The integer registers tended to widen and eventually the DSPs started to support floats. It is a normal progression as more knowledge and understanding becomes available, the easier it becomes to apply the new capabilities.

The only reason we see stunted growth in the DSP market is the very good performance that is now possible with Alt-Vec and SIMD units on the intel side of the fence. If the capability to support data types greater than 32 bits is not supported on these processors, it will be eventually somewhere else.

Quote:
Some people were skeptical, but not anyone who knew anything about the problem. It was pretty obvious out of the starting gate that it would be router and telephony heaven in the embedded space, and filter and codec heaven on the desktop. Even tech columnists caught on fast to the idea that SIMD on the desktop would usher in vastly improved multimedia. Most of the skepticism I read centered around whether anyone would use it - not because it wasn't inherently useful, but because it was an extension that was only shipping in one product line. Those people had the Quadra 840av - which had a dedicated DSP chip that was hardly ever used, and eventually orphaned - firmly in mind. The consensus among the embedded AltiVec programmers around Ars Technica and altivec.org is that vectors of 64 bit values are of no use in the embedded space; they'd know, so I'm taking them at their word. As for servers, clusters are the computational engines, and of course Apple is selling racks of Xserves to bioinformatics and genetics firms in large part because of the dual AltiVec engines. But that still doesn't address the larger server space, much of which involves pushing files around, networking, and firing off the odd Perl script.

I don't remember too many people worrying about the G4's bandwidth at the outset. By the end, of course, Apple's own Advanced Mathematics Group was bitching in WWDC sessions about the lack of bandwidth, but them's the breaks. It is worth noting that even a hypothetical "double wide" VMX wouldn't tax the bus as much as the 12:1 demand:supply ratio of AltiVec to the G4's MaxBus. But then, that disparity crippled AltiVec on the G4, and Apple would be wise to make sure that doesn't happen again.

Now that last statement is something we can agree on!!!

Thanks
Dave
post #77 of 115
Well no he is either not communicating well or has atleast a few concepts wrong. It does not make sense to compare changing the width of a register in the main CPU ALU with a change of register size in a vector unit.

In a ALU you are always doing one operation on one piece of data in a register. Within a vector unit you are operating on a number of pieces of data at the same time. The effects of changing the width of a vector unit are differrent than that experienced when changing the width of a processors register.

Amorph has made some very good arguments, so I find it hard to understand how he let that slip out.

Thanks
Dave


Quote:
Originally posted by Yevgeny
Wizard69, Amorph made a good post and you should listen to him- he is right.
post #78 of 115
Quote:
Originally posted by Programmer
If you have 4 FPUs then you keep them busy the same way you keep the VMX unit busy -- you have them operate on long vectors of non-interrelated data. The major difference is that the instruction dispatch rate has to be increased to feed the larger number of execution units, but that also buys you flexibility since not all the instructions have to be the same. The OoOE means that your loops will automatically take advantage of the available hardware, as long as you avoid data dependencies.

Most people don't understand that the AltiVec unit is not good at doing 3D operations where a 4-vector is stored in a single vector register, and a 4x4 matrix is stored in 4 vector registers. This does not work well, and the VMX unit generally doesn't outperform the FPU in this case... even on the G4, and the G5 is even more extreme because they doubled the FPU units. The way to make VMX do this kind of math efficiently is to do 4 vectors worth of math at a time, and spread your 4x4 matrix across 16 registers.

It is in the non-floating point support, and the permute capabilities that AltiVec really shines. The 4-way floating point operations primarily benefit from dispatching 1/4 the number of instructions, but the individual fields of the register cannot be inter-dependent.


On the subject of precision and accuracy, consider for a moment that a 32-bit float cannot represent more values than a 32-bit integer (~4 billion of them). In fact it can represent fewer because a few values are reserved for special meanings. The power of the floating point representation is that the scale of the representable values is non-linear and can thus cover a much wider range with a varying degree of precision. So which is more "precise", a 32-bit integer or a 32-bit float?

Now we have some very intersting points. A single allocates one bit to the sign, 8 bits to the exponent and the rest to the Significand or 24 bits. A 23 bit fraction is not much to speak of, especially in this day and age when a A to D converter may spit out more bits for each measurement it takes.

As I have said 64 bits can be justified easly simple because of the limits imposed by float data type. Just because someone can not imagine a need for VMX support for this data type does not mean that it does not exist or won't exist in the future.

Everyone should remember when we where told that 640K would be all the memory you would ever need. This short sightedness is much the same with respect to this discussion of VMX2. Sure it does not exist yet (VMX2), but we would all be fools if we were to run around believing that VMX is all we will ever need.

Dave
post #79 of 115
Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.
post #80 of 115
Quote:
Originally posted by wizard69
As I have said 64 bits can be justified easly simple because of the limits imposed by float data type. Just because someone can not imagine a need for VMX support for this data type does not mean that it does not exist or won't exist in the future.

You seem to be obsessed with the accuracy of physical measurements. There are a very large set of software problems where there are no physical measurements to be dealt with -- perhaps most problems? Even if you have measurements, and they are lower precision than a 32-bit float, you still need higher precision math in order to run many algorithms on this data to avoid inaccuracies creeping in because of the nature of fixed-precision math.

You are absolutely right that there is a need for 64-bit numbers; my position is that the FPU(s) is where this data should exist, not the vector unit. The advent of SMT, the need to write new code to leverage VMX2, splitting the hardware base, the constraints on vector processing, the cost of context switching with a large vector register set are all reasons why you wouldn't want to put doubles into the vector unit (even with 100% backwards compatibility). There are some perfectly valid reasons to extend VMX2 in this way, but while it might be an obvious thing to do I believe that the reasons not to do it outweight the advantages. We'll see if IBM agrees with me when they announce the details of VMX2. I have a lot of confidence that whichever course they choose will be the right one since they know a lot more about the subject that anybody here.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Future Apple Hardware