or Connect
New Posts  All Forums:Forum Nav:

Vmx 2

post #1 of 115
Thread Starter 
Mac Bidouille published a new rumor about VMX 2. They warn that it was a rumor with a big R.

I translate for you : IBM has finished to make the first specifications of the VMX 2 (VMX for IBM is like Altivec for Motorola and Velocity engine for Apple). This new version will have a set of 65 supplementary instructions. The final specifications will be ready in early 2004. However, the complexity of VMX 2, who will recquire lonely 24 millons of transistors, has incitate IBM to wait until the 65 nm process, before implementing it in his chips.
One thing is extremely interesting to notice. It will be the PPC 990 and the power 6 who will adopt VMX 2. IBM should seems to have decide to use Altivec on his professional chips, who will be even more in phasis with mac os X. The VMX 2 should triple the performance of Altivec and will aso ensure backward compatibility.

If this rumor will be true, it's means a great future for the chip supply of great CPU for Apple.
post #2 of 115
Triple the performance of Altivec... do I smell 256bit altivec?

Ouch!

Although it's a rumor, it isn't a ridiculous one in my opinion.

I wonder, is there a real need for 256bit altivec? Or is there plenty of paralellism to take advantage of in todays code?
post #3 of 115
Thread Starter 
Quote:
Originally posted by Zapchud
Triple the performance of Altivec... do I smell 256bit altivec?

Ouch!

Although it's a rumor, it isn't a ridiculous one in my opinion.

I wonder, is there a real need for 256bit altivec? Or is there plenty of paralellism to take advantage of in todays code?


I think that SIMD unit is the only part of a modern CPU where you can take advantage of massive parallelism. I think that 256 bits, is the next logical move for VMX. I will even say that in the future decade 512 or even 1024 bits won't be stupid.
The mere problem with altivec 256 is to remove the internal bottleneck in order to be able to feed such a beast.
post #4 of 115
I've always been skeptical of the benefits of extending SIMD to greater sizes. As the vector length grows it suffers from diminishing returns. It seems to me that improving the implementation of the existing instruction set could yield equivalent performance gains without having to change existing programs. If a new VMX1 unit could process 128-bit vectors twice as fast it would be as fast as a 256-bit vector unit, more flexible and you wouldn't have to rewrite your code to gain the advantage of it.

Consider that in a SIMD unit of double width you must perform exactly the same operations on the new half of the data. That means that the halves of the operation on the vector cannot be inter-dependent, and they must be exactly the same. If instead of doubling the vector width you double the instruction dispatch rate and number of execution units you now have the same performance... but more flexibility because you don't have to do the same operation on the second half of the vector.

Futhermore, the cost of these huge registers would affect the context switching cost of the processor in a very negative way. Doubling the vector registers would be an extra 0.5K for every context, on top of the already substantial 1K... a 50% increase.


I find it interesting that this (questionable) rumour doesn't mention a longer vector size, but just says "supplementary instructions". There are a bunch of things that would be nice to have in the existing VMX but weren't implemented because they wouldn't have been possible to do in a single cycle in the initial implementation (adding across a vector, for example). These operations might be possible to implement efficiently with an increased pipeline depth like in the 970, and with a larger number of transistors. This would allow many algorithms to be implemented more efficiently than is currently possible, and increase the versatility of VMX. I expect a major goal for the VMX2 design would be to support compiler auto-vectorization, which would be the single biggest possible possible to the adoption of VMX.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #5 of 115
Quote:
Originally posted by Programmer
I've always been skeptical of the benefits of extending SIMD to greater sizes. As the vector length grows it suffers from diminishing returns. It seems to me that improving the implementation of the existing instruction set could yield equivalent performance gains without having to change existing programs. If a new VMX1 unit could process 128-bit vectors twice as fast it would be as fast as a 256-bit vector unit, more flexible and you wouldn't have to rewrite your code to gain the advantage of it.

Then maybe that's what the rumor is talking about. Improving the implementation substantially, and adding the "missing" instructions.

Tweak it!
post #6 of 115
Quote:
Originally posted by Programmer
There are a bunch of things that would be nice to have in the existing VMX but weren't implemented because they wouldn't have been possible to do in a single cycle in the initial implementation (adding across a vector, for example).

Would one of those things be double precision FP operations, by any chance? I know that Altivec code can be munged to deliver double precision results, but the overhead involved, and the complexity of coding, makes it less than competitive with a DP FP unit in any event. Could some additional instructions (esp. vector permutes?) mean improved DP FP operations?
Quote:
These operations might be possible to implement efficiently with an increased pipeline depth like in the 970, and with a larger number of transistors. This would allow many algorithms to be implemented more efficiently than is currently possible, and increase the versatility of VMX.

And might have something to do with "fastpath", I suppose.
Quote:
I expect a major goal for the VMX2 design would be to support compiler auto-vectorization, which would be the single biggest possible possible to the adoption of VMX.

Another might be SMT. That might explain at least some additional transistor counts.
"Spec" is short for "specification" not "speculation".
Reply
"Spec" is short for "specification" not "speculation".
Reply
post #7 of 115
Quote:
Originally posted by Tomb of the Unknown
Would one of those things be double precision FP operations, by any chance? I know that Altivec code can be munged to deliver double precision results, but the overhead involved, and the complexity of coding, makes it less than competitive with a DP FP unit in any event. Could some additional instructions (esp. vector permutes?) mean improved DP FP operations?

This is possible, but of dubious value unless you increase the vector size. If they decide to increase the vector size to 256-bits, then double precision makes sense (4 element double precision vectors). In 128-bits though you can only fit a pair of doubles which means you're only getting the throughput of a pair of FPUs... which the 970 has. Personally I don't think its worth it -- better to increase the number of FPUs. Again, it allows existing code to run unchanged at higher speeds and means you don't have to struggle to get things into vectors.

Quote:

And might have something to do with "fastpath", I suppose.

I wouldn't suppose any such thing. The only relationship between FastPath and VMX, in my opinion, is that FastPath allows the processor to spend more uninterrupted time executing VMX instructions on long streams of data.

Quote:

Another might be SMT. That might explain at least some additional transistor counts.

I agree in what I suspect is a backwards fashion to what you meant -- supporting SMT would not explain the transistor counts, but having SMT would justify the transistor counts. What I mean is that right now the VMX unit is probably spending a fair bit of time waiting....

Latency is the big stumbling block in processor design right now (and probably from now on). The individual stages of the pipeline are getting faster and faster, but memory keeps falling behind, pipelines keep getting longer, and inter-chip communications requires sending signals over larger distances than within the chip. This means you are waiting for things more and more... either waiting for the data to come in from memory (or cache or another chip), or waiting for a result to finish being calculated before using the result in the next calculation. All of this waiting means wasted opportunity to do something else. What is the point in having more execution resources if you can't even keep the existing ones busy as it is? SMT allows you to keep many more busy because you can now fill the wait times of each thread with the non-wait times of other threads. If you're waiting 90% of the time in each thread when it is running "full speed", then you can run 10 threads at "full speed". This doesn't speed up the individual threads, but if you can implement your task in multi-threaded fashion it is a huge improvement.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #8 of 115
Quote:
Originally posted by Programmer
Personally I don't think its worth it -- better to increase the number of FPUs. Again, it allows existing code to run unchanged at higher speeds and means you don't have to struggle to get things into vectors.

Except that, with autovectorizing compilers, it wouldn't be as much of a struggle and streaming DP FP functions could be a major boost to certain applications. I would think 256 bit vectorized DP FP ops as you described would outperform even four FP units? Perhaps not, in which case I agree that expanding Altivec to 256 might be less cost effective than spending the transistors on additional FP units.
Quote:
I wouldn't suppose any such thing. The only relationship between FastPath and VMX, in my opinion, is that FastPath allows the processor to spend more uninterrupted time executing VMX instructions on long streams of data.

I'm not as sanguine that the two are unrelated. We don't know enough about what "fastpath" is to certain of anything, I should think.
"Spec" is short for "specification" not "speculation".
Reply
"Spec" is short for "specification" not "speculation".
Reply
post #9 of 115
Quote:
Originally posted by Tomb of the Unknown
Except that, with autovectorizing compilers, it wouldn't be as much of a struggle and streaming DP FP functions could be a major boost to certain applications. I would think 256 bit vectorized DP FP ops as you described would outperform even four FP units? Perhaps not, in which case I agree that expanding Altivec to 256 might be less cost effective than spending the transistors on additional FP units.

Auto-vectorizing has yet to be proven to be particularly effective except in rather specialized situations. I don't think I would want to take the chance and burn a lot of transistors on something so questionable... and I don't think IBM will either.

4 FPUs have a lot more scheduling opportunities than a 4-way SIMD unit so while they would require processing more instructions to do the same work, they could get the work done as fast (or faster). The instruction dispatch limits will probably be quite a bit higher by the time VMX2 arrives.

Quote:
I'm not as sanguine that the two are unrelated. We don't know enough about what "fastpath" is to certain of anything, I should think.

Nice word. If you go and dig out what is known about FastPath and some of the IBM research papers I think you'd probably agree with me on this one. But for now I'll let you wallow in your pessimism.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #10 of 115
Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.
post #11 of 115
24 million transistors seems A LOT considering that the current VMX uses, what.. a tenth of that? or even less, I couldn't find the exact number. The original G4 (7400) used 6.4 million transistors in total. When 970 counts 58 million transistors in total and a mere corner of it seems to be used for the VMX units.. doesn't 24 million seem a bit much?
post #12 of 115
Quote:
Originally posted by Henriok
24 million transistors seems A LOT considering that the current VMX uses, what.. a tenth of that? or even less, I couldn't find the exact number. The original G4 (7400) used 6.4 million transistors in total. When 970 counts 58 million transistors in total and a mere corner of it seems to be used for the VMX units.. doesn't 24 million seem a bit much?

It does, but I suppose it depends on how you count? For instance, the figure may include transistors spent on an increased L2 cache to support it. So, technically the revised VMX unit would not require that many additional transistors to support the additional instructions, but IBM may be planning for this much of an increase to support a longer pipeline, greater cache, etc.

Or the entire rumor could just be the fever-induced maunderings of an addled mind.
"Spec" is short for "specification" not "speculation".
Reply
"Spec" is short for "specification" not "speculation".
Reply
post #13 of 115
Quote:
Does Altivec do vector dot or cross products?

Yes, and a whole lot more beside.

Off topic, what's the best way to start (hobbyist) programming in Altivec?
Stoo
Reply
Stoo
Reply
post #14 of 115
Quote:
Originally posted by Stoo
Yes, and a whole lot more beside.

Off topic, what's the best way to start (hobbyist) programming in Altivec?

Actually it doesn't -- at least not intrinsicly. You can do those operations using a bunch of instructions, but if you are trying to hold an xyzw vector in a single VMX register then it isn't very efficient -- the FPU is generally better at it. What is efficient to have many xyzw vectors, represented as long arrays of the x's, the y's, the z's, and the w's. Then the VMX unit can do those operations very efficiently by doing 4 operations at a time.

For 3D operations (xyzw vectors and 4x4 matrices) there could be many instructions added to VMX that would aid tremendously -- basically all of the instructions in the OpenGL vertex and fragment program specs, including the swizzle instruction. A couple of the simple ones are supported by VMX already, but most of the complex ones are not. VMX was designed for signal processing, i.e. grinding through long arrays of data. It is much less useful "scalar operations on 4-vectors", if you know what I mean.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #15 of 115
Thread Starter 
A 256 bit vector unit, will help in photoshop filter, but i guess will be useless for MP encoding ( i can be wrong, so correct me if it's the case).

The extra set of instructions will help in many situations, i remember to have seen programmers requiring new features in the past. But alone,these new set of instructions will not allow to triple the level of performance.

Instead of a giant 256 bit wide VMX unit with 162 + 65 instructions, can we imagine that the VMX 2 engine is the combinaison of 2 sub-unit : a simple one and a complex one.

The simple unit will be basically the current VMX one.
The complex one will support both set of instructions (162 + 65)

Simple stuff like some photoshop filters, will take advantage of both VMX unit simultaneously.
Complicated stuff like 3 D operations will take advantage of the complex unit.

Many chips facturer are doig this already for their FP unit and their Integer unit. This kind of architecture, allow to improve performances without raising too much the number of transistors.
In this way you can triple the performances without creating a monster behemot.
post #16 of 115
There are two articles in today's New York Times that may be of interest NYT Cray and NYT IBM Fishkill .
A look in the library at Cray might give some insight into possible additional VMX instructions and a possible motive for adding them.

With some thought, the VMX2 could offer a Cray compatible instruction set; the VMX2 could then become the computer and the 9*0 cpu the interface.

Perhaps someone could put more time into this speculation.
post #17 of 115
Quote:
Originally posted by Powerdoc
Instead of a giant 256 bit wide VMX unit with 162 + 65 instructions, can we imagine that the VMX 2 engine is the combinaison of 2 sub-unit : a simple one and a complex one.

The simple unit will be basically the current VMX one.
The complex one will support both set of instructions (162 + 65)

The current AltiVec units on all chips with them already do this with the existing VMX instruction set. The definition of the execution units and which instructions they operate on is an implementation detail that is not specified in the user programming model. VMX2 refers to the user programming model because it defines new instructions. The programmer writes these instructions and the processor's dispatch mechanism takes care of sending them to the correct execution unit(s). This is the PowerPC way (and it is similar to the x86, but dissimilar to the IA-64).
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #18 of 115
I'm very ignorant of all this stuff but implementing the OpenGl instructions in VMX seems an exciting prospect. The manipulation of graphics is a very important development thread for Apple who will want to '3D' a lot more of the user interface in the future, plus provide API's for developers to do so as well. Will this help in this endeavour? Will it also help to ease the optimisation of graphics drivers for specific apps? As I said, have only slightly more than a clue in this dept.
post #19 of 115
Thread Starter 
Quote:
Originally posted by Programmer
The current AltiVec units on all chips with them already do this with the existing VMX instruction set. The definition of the execution units and which instructions they operate on is an implementation detail that is not specified in the user programming model. VMX2 refers to the user programming model because it defines new instructions. The programmer writes these instructions and the processor's dispatch mechanism takes care of sending them to the correct execution unit(s). This is the PowerPC way (and it is similar to the x86, but dissimilar to the IA-64).

You are right, for example the g4e have a vector issue queu that dispatch 4 exucutions units : simple integer, complex integer, flotating point, permute.
Now imagine that most of the instructions of the new set of instructions are dedicated to the 3D, then IBM wil make a new excution unit, let's call it V3D. Let's imagine that V3D is very huge in size.

This new SIMD unit will have for example one permute unit, two simple integer unit, one complex integer unit, two floating points unit and one V3D unit. All with 128 bits registers. In this way some tasks can be performed by two simple integer unit simultaneously without enlarging the register to 256 bits size.

It's different of a system where the register are 256 bits, and where the execution unit will be permute, simple integer, complex integer, floating point, and a new unit who support a large amount of new instructions like 3D. This system appear me, more heavy than the previous one. And the previous one still work with 128 bit registers.
post #20 of 115
Quote:
Originally posted by shawk
There are two articles in today's New York Times that may be of interest NYT Cray and NYT IBM Fishkill .
A look in the library at Cray might give some insight into possible additional VMX instructions and a possible motive for adding them.

With some thought, the VMX2 could offer a Cray compatible instruction set; the VMX2 could then become the computer and the 9*0 cpu the interface.

Perhaps someone could put more time into this speculation.

A Cray-compatible instructon set isn't necessary, in this market the users recompile their software for their ultra-expensive hardware.

You bring up an interesting point, however -- money is being funneled back into high node performance super-computing, and IBM wants a piece of that action. Certainly they are in a good position to attack it, especially since they've just added VMX in for Apple.

Looking at the Cray documentation it looks like VMX isn't terribly far off in capabilities or speed; the main thing it lacks is 64-bit floating point capabilities. If this is what IBM wants to challenge with the POWER6 then I would agree that longer vectors and 64-bit floating point is definitely the way to go. Considering that Cray is only at about the 12.8 gigaflop level per node, IBM can easily get there with a POWER5/6 + VMX... but in that market double precision is essential. This might explain why IBM is suddenly interested in vector processing and auto-vectorizing compilers.

Heh, the Cray X1 instruction set includes a "vrip" instruction that you use when you're done with the vector registers. It declares them dead so that they don't need to be preserved across context switches. I like it... "vector rest in peace". VMX already has support for something similar, but its not so well named.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #21 of 115
Quote:
Originally posted by Powerdoc
You are right, for example the g4e have a vector issue queu that dispatch 4 exucutions units : simple integer, complex integer, flotating point, permute.
Now imagine that most of the instructions of the new set of instructions are dedicated to the 3D, then IBM wil make a new excution unit, let's call it V3D. Let's imagine that V3D is very huge in size.

This new SIMD unit will have for example one permute unit, two simple integer unit, one complex integer unit, two floating points unit and one V3D unit. All with 128 bits registers. In this way some tasks can be performed by two simple integer unit simultaneously without enlarging the register to 256 bits size.

It's different of a system where the register are 256 bits, and where the execution unit will be permute, simple integer, complex integer, floating point, and a new unit who support a large amount of new instructions like 3D. This system appear me, more heavy than the previous one. And the previous one still work with 128 bit registers.

If adding these "3D" instructions is their direction then it requires no additional architected registers, and I would expect the various new instructions to be handled in whichever execution unit is most appropriate in the new implementation. I doubt you would see a "V3D" unit per-se. The 970 VMX unit already has dual VPUs which can process vectors in parallel, effectively acting as a 256-bit vector unit like you allude to. In this scheme all of the execution units would grow in size a little, and there would be more copies of them which allows all instructions to operate in parallel... effectively giving you more than 256-bit vectors in parallel.

Regarding the OpenGL issue -- this might help a software implementation of the vertex programs, but pretty much all GPUs going forward are going to include this kind of vector hardware and graphics are better handled in the GPU which is tightly coupled to the pixel rasterization hardware and high speed memory subsystem. I don't see this capability being particularly useful to OpenGL, but it would be useful to applications using OpenGL... a lot of calculations are required in some applications to figure out what to pass to the graphics engine. Also, some kinds of simulations (like those found in games) would really benefit. While some of these new instructions might appear, I would expect VMX2 to primarily remain true to its heritage in signal processing, especially if IBM is going after the supercomputing market like alluded to above.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #22 of 115
Thread Starter 
Thanks for your answer Programmer

I have now a clear image, VMX 2 will support more instructions than the current implementation and his architecture will be more parallelar than the previous one.
post #23 of 115
Quote:
Originally posted by Programmer
The 970 VMX unit already has dual VPUs which can process vectors in parallel, effectively acting as a 256-bit vector unit like you allude to.

Could you explaine what you mean?
The 970 VMX consists of a permute unit fed by one queue and a simple interger unit, compex integer unit and floating point unit fed by a second queue.
How can I let this act as a 256-bit unit?
post #24 of 115
Quote:
Originally posted by smalM
Could you explaine what you mean?
The 970 VMX consists of a permute unit fed by one queue and a simple interger unit, compex integer unit and floating point unit fed by a second queue.
How can I let this act as a 256-bit unit?

No, I can't explain what I mean because I don't understand how I could say that either (at least not and be correct at the same time). I can offer excuses like "it was early in the morning" or "it was late at night". You're quite right, of course. I had "two instructions per clock" stuck in my head.

Now a simple addition of another dispatch queue in an improved 970-design with extra execution units to back it up would result in effectively a 256-bit vector unit from a performance point of view, while retaining the flexibility of context costs of the 128-bit unit. Still wouldn't give you double precision, of course.

Sorry. Bad programmer, bad.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #25 of 115
Quote:
Originally posted by smalM
Could you explaine what you mean?
The 970 VMX consists of a permute unit fed by one queue and a simple interger unit, compex integer unit and floating point unit fed by a second queue.
How can I let this act as a 256-bit unit?

You might be interested in this PDF file from Apple's developer site. It talks about 256-bit FP operations and how to get the Altivec unit to do them.

http://developer.apple.com/hardware/ve/pdf/oct3a.pdf
"Spec" is short for "specification" not "speculation".
Reply
"Spec" is short for "specification" not "speculation".
Reply
post #26 of 115
Thread Starter 
Quote:
Originally posted by Tomb of the Unknown
You might be interested in this PDF file from Apple's developer site. It talks about 256-bit FP operations and how to get the Altivec unit to do them.

http://developer.apple.com/hardware/ve/pdf/oct3a.pdf

It's quite different from what we where discussing here. It's a discussion of how to make 256 bit FP operation on a altivec unit, rather than altivec 256 bits, or a parallelar altivec unit (the current one have only 4 executions units, all differents).
post #27 of 115
Quote:
Originally posted by Tomb of the Unknown
You might be interested in this PDF file from Apple's developer site. It talks about 256-bit FP operations and how to get the Altivec unit to do them.

http://developer.apple.com/hardware/ve/pdf/oct3a.pdf

This is the difference between operating on a single very large number, and acting on 256-bits worth of small numbers (i.e. 8 32-bit numbers). SIMD is far better at doing the same operation to many small numbers than it is at doing an operation on a big number... that's what it is designed to do after all: process "vectors" of numbers.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #28 of 115
Quote:
Originally posted by Programmer
This is the difference between operating on a single very large number, and acting on 256-bits worth of small numbers (i.e. 8 32-bit numbers). SIMD is far better at doing the same operation to many small numbers than it is at doing an operation on a big number... that's what it is designed to do after all: process "vectors" of numbers.

I don't even know what data type takes 256 bits. The only reason why you would have a 256 bit register is for parallel vector operations.

I only know of one 128 bit data type, a MS GUID. When you have to compare lots of them (e.g. in a COM QI call), it takes quite a bit of time. My coworkers and I have frequently wished that Intel chips had some way to do a 128 bit compare.
King Felix
Reply
King Felix
Reply
post #29 of 115
Quote:
Originally posted by Yevgeny
I don't even know what data type takes 256 bits. The only reason why you would have a 256 bit register is for parallel vector operations.

I only know of one 128 bit data type, a MS GUID. When you have to compare lots of them (e.g. in a COM QI call), it takes quite a bit of time. My coworkers and I have frequently wished that Intel chips had some way to do a 128 bit compare.

The need for numbers that large (or precise) is very rare. Some scientific uses, and the like.

There are relatively quick ways to do a 128-bit GUID compare on a 32-bit machine (of course having enough registers to load 8 words at the same time always helps!). You shouldn't be doing this in assembly, and in a high level language just write the function once. I can't imagine that it would be your performance bottleneck -- if it is you have some serious design issues.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #30 of 115
Quote:
Originally posted by Programmer
The need for numbers that large (or precise) is very rare. Some scientific uses, and the like.

....

I find this comment amusing. Computers in the scientific world have three uses. One is to analyze data from experiments. Another is to simulate the real world. The third is to control instrumentation. A 128-bit number is relevant if you can measure two quantities differing by only the last digit. But the fact is that there is no instrument constructed by the hand of man (or woman) that approaches that kind of accuracy. There is no quantity in any system of measurements that is defined to 128 bits of accuracy. If there is nothing defined to 128-bit accuracy and nothing that can measure it anyway, then it most certainly makes no sense to simulate anything to that kind of accuracy. And finally, there is no instrument on the face of the Earth that requires control by 128-bit strings.

Is it possible to map 128-bit numbers into something useful scientifically? Probably, but that doesn't mean that a scientific problem is naturally expressed in 128-bit terms or that 128-bit numbers are the most efficient way to do it. Isn't it true that 128-bits are more accurate than 32-bits or even 64-bits? And if that is the case, then what is the harm in using 128-bit numbers? Yes, it is true. However, the major use of large-precision numbers is to hide bad algorithms. Bad algorithms waste time. Calculations performed using large-precision numbers also waste time and memory. That is time better spent making simulations more representative of the real world.
post #31 of 115
Thread Starter 
Quote:
Originally posted by Mr. Me
I find this comment amusing. Computers in the scientific world have three uses. One is to analyze data from experiments. Another is to simulate the real world. The third is to control instrumentation. A 128-bit number is relevant if you can measure two quantities differing by only the last digit. But the fact is that there is no instrument constructed by the hand of man (or woman) that approaches that kind of accuracy. There is no quantity in any system of measurements that is defined to 128 bits of accuracy. If there is nothing defined to 128-bit accuracy and nothing that can measure it anyway, then it most certainly makes no sense to simulate anything to that kind of accuracy. And finally, there is no instrument on the face of the Earth that requires control by 128-bit strings.

Is it possible to map 128-bit numbers into something useful scientifically? Probably, but that doesn't mean that a scientific problem is naturally expressed in 128-bit terms or that 128-bit numbers are the most efficient way to do it. Isn't it true that 128-bits are more accurate than 32-bits or even 64-bits? And if that is the case, then what is the harm in using 128-bit numbers? Yes, it is true. However, the major use of large-precision numbers is to hide bad algorithms. Bad algorithms waste time. Calculations performed using large-precision numbers also waste time and memory. That is time better spent making simulations more representative of the real world.

For the measurement i agree.
For the second part i disagree. Mathemacians have demonstrated that a lack of large precisions and numbers can lead to a huge error,during a simulation.
A very small error can be exponentially increase in some type of calculations. For most of the simulations, this is not an issue, but for some others it's important.
post #32 of 115
Quote:
Originally posted by Programmer
The need for numbers that large (or precise) is very rare. Some scientific uses, and the like.

There are relatively quick ways to do a 128-bit GUID compare on a 32-bit machine (of course having enough registers to load 8 words at the same time always helps!). You shouldn't be doing this in assembly, and in a high level language just write the function once. I can't imagine that it would be your performance bottleneck -- if it is you have some serious design issues.

Yes, generally speaking, there isn't really much of a need for 128 bit numbers.

In particular, it would be possible to check the portion of the GUID that corresponds to the time created, then check the portion of the GUID that corresponds to the IP address. To be equal, you would need to run all 8 32 bit tests, but inequality could be established pretty easily- within the first or second query. IP address isn't that reliable a way of differentiating GUIDs because programmers have a tendency to generate GUIDs on the same machine (the box they program on).

When you to lots of COM interface programming, GUID comparisons are actually an irritating bottleneck (the software design forces them on you, but they are slow in comparsion to the actual work you have to get done). It is commonly the case that you must obtain an interface on an object that is almost the last interface that is queried for (in one case, the interface that I regularly needed was the 15th of 17 interfaces on an object!). Being able to do a GUID check in one clock cycle is something that we wish for (a quick check through our code base turned up 5 million lines of code implementing 6,611 unique COM interfaces implemented on 9,641 coclasses in 510 dll's). At least the architecture is clean, and this is what is important because we expose all of our internal code to 3rd party developers so that they can extend our software using the same objects that we use. When you want to expose millions of lines of C++ code to developers, software enginnering sometimes trumps speed.
King Felix
Reply
King Felix
Reply
post #33 of 115
Quote:
Originally posted by Yevgeny
When you to lots of COM interface programming, GUID comparisons are actually an irritating bottleneck (the software design forces them on you, but they are slow in comparsion to the actual work you have to get done). It is commonly the case that you must obtain an interface on an object that is almost the last interface that is queried for (in one case, the interface that I regularly needed was the 15th of 17 interfaces on an object!). Being able to do a GUID check in one clock cycle is something that we wish for (a quick check through our code base turned up 5 million lines of code implementing 6,611 unique COM interfaces implemented on 9,641 coclasses in 510 dll's). At least the architecture is clean, and this is what is important because we expose all of our internal code to 3rd party developers so that they can extend our software using the same objects that we use. When you want to expose millions of lines of C++ code to developers, software enginnering sometimes trumps speed.

Its not necessarily an either-or situation. I've seen a COM implementation that does a much better job of the QueryInterface than Microsoft's does, and it still gives you all of the abstraction power that interface-based programming offers.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #34 of 115
Quote:
Originally posted by Mr. Me
Is it possible to map 128-bit numbers into something useful scientifically?

Yes, there are situations where this is necessary. Is it necessary 99.9999% of the time? No. I agree that many coders squander the available precision unnecessarily, but there are problems out there which require this level of precision. It is not worth the circuits to implement this directly in commodity desktop processors, but that doesn't mean that occasionally it isn't useful. This is why Apple published a paper on doing 128-bit floating point arithmetic with the AltiVec unit. Its an interesting read, but they wouldn't have done it if it wasn't useful to somebody.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #35 of 115
Quote:
Originally posted by Programmer
Its not necessarily an either-or situation. I've seen a COM implementation that does a much better job of the QueryInterface than Microsoft's does, and it still gives you all of the abstraction power that interface-based programming offers.

True, it isn't an either or situation. I thought of my own speed boost to QI that would use hash table at the base object in the interface inheritance to reduce QI runtime from O(n) to O(1) (amortized). Of course, we would have to implement this in all our objects... actually it turns out that you can usually get around most QI problems by simply rearranging the order in which you compare a requested interface to the interfaces that a given object supports. Place the frequently requested interfaces at the front of the list of interfaces to query. Anyhow, I am sure that effective COM programming isn't the best use of a future hardware topic on VMX. It suffices to say that some people have a use for 128 bit operations that sadly are not doable on the x86 instruction set.

It doesn't surprise me that there are better COM implementations out there. The MIDL compiler barely qualifies as a usable product in my mind...
King Felix
Reply
King Felix
Reply
post #36 of 115
I have to disagree with this on several counts. You miss what is possibly the most important usage for a computer in science, and that is as a communications tool. Foats don't often come into play here unless we are talking about visualization applications.

Second resolutions in measurements do not constrain the need for resolution in processing.

The advent of and the use of large number processing libraries more or less discount your assertion that large number capabilities are not needed. Much of what science deals with is not the product of man kind anyways.

Your statement about the definition of quanities is a bit rediculous. Many an Engineering math handbook has definitions for constants out past 20 digits. It is never a good thing to loose information simply because the current number system you are using can't handle it.

You want to use the largest number sizes that are consistant with not loosing information. If you applications would work better with 68 bit floats, then the next logical size it 128 bit floats, though I suppose one could argue for 96 bit floats. Besides there are even applications outside of science that can make use of large floats, just look at the national budget.

Thanks
Dave


Quote:
Originally posted by Mr. Me
I find this comment amusing. Computers in the scientific world have three uses. One is to analyze data from experiments. Another is to simulate the real world. The third is to control instrumentation. A 128-bit number is relevant if you can measure two quantities differing by only the last digit. But the fact is that there is no instrument constructed by the hand of man (or woman) that approaches that kind of accuracy. There is no quantity in any system of measurements that is defined to 128 bits of accuracy. If there is nothing defined to 128-bit accuracy and nothing that can measure it anyway, then it most certainly makes no sense to simulate anything to that kind of accuracy. And finally, there is no instrument on the face of the Earth that requires control by 128-bit strings.

Is it possible to map 128-bit numbers into something useful scientifically? Probably, but that doesn't mean that a scientific problem is naturally expressed in 128-bit terms or that 128-bit numbers are the most efficient way to do it. Isn't it true that 128-bits are more accurate than 32-bits or even 64-bits? And if that is the case, then what is the harm in using 128-bit numbers? Yes, it is true. However, the major use of large-precision numbers is to hide bad algorithms. Bad algorithms waste time. Calculations performed using large-precision numbers also waste time and memory. That is time better spent making simulations more representative of the real world.
post #37 of 115
Quote:
Originally posted by wizard69
Besides there are even applications outside of science that can make use of large floats, just look at the national budget.

???
If you're running out of bits in double precision floating point math doing anything related to money, fire your programmer.

Yes, the US national debt exceeds $4B, it is still NOWHERE near something that needs floating point sizes beyond double precision. Yes, I grok compounded interest and other methods of getting "partial cents" etc - but _in_the_end_ it all has to be rounded. (Because we don't cut coins into pieces anymore). The other aspect is that something like the debt isn't really a 'single item'. It is a collection of a very large number of smaller items... each of which has its own individual interest rate, maturity etc. -> Each of those is _individually_ calculable with 100% accuracy on hand calculators (once you acknowledge the inherent discreetness + rounding rules).

On another note, the text in front of me tabulates only e & pi beyond 16 digits. Well, and pi/2, and a slew of other sillyness like that.

You don't really mean "If you've got the more precise info, you're insane to chuck it, all calculations must proceed with all available information" do you? You really mean "I've assessed 1) the number of significant bits I need in the end, and 2) I've assessed how my algorithm will spread my error bars/reduce the number of significant bits, and 3) I've measured as accurately as I need to, with a fair chunk of extra accuracy"

Because if you really mean "never use approximations when better data is available", please call me when you've accurately and precisely calculated the circumference of a circle exactly 1 meter in diameter. In units of _meters_ please, not multiples of pi. Oh, and here's the first million digits of pi or so.
post #38 of 115
Quote:
Originally posted by wizard69
I have to disagree with this on several counts. You miss what is possibly the most important usage for a computer in science, and that is as a communications tool. Foats don't often come into play here unless we are talking about visualization applications.

Second resolutions in measurements do not constrain the need for resolution in processing.

The advent of and the use of large number processing libraries more or less discount your assertion that large number capabilities are not needed. Much of what science deals with is not the product of man kind anyways.

Your statement about the definition of quanities is a bit rediculous. Many an Engineering math handbook has definitions for constants out past 20 digits. It is never a good thing to loose information simply because the current number system you are using can't handle it.

You want to use the largest number sizes that are consistant with not loosing information. If you applications would work better with 68 bit floats, then the next logical size it 128 bit floats, though I suppose one could argue for 96 bit floats. Besides there are even applications outside of science that can make use of large floats, just look at the national budget.

Thanks
Dave

You can express the national budget in terms of doubles. Because all computer numbers are discrete, you can never have precision for infinitely repeating numbers like the number produced by 1/3. 64 bits can represent in decimal 9*10^15 (999,999,999,999,999) or floating point values from 4.9*10^-307 to 1.8*10^308. That is quite range of values and you still get 15 digits of accuracy.

64 bit doubles are sufficiently accurate to describe the location of anything in the solar system to within a few millimeters. Any scientist who thinks his numbers are that accurate in the first place is an idiot. The error in measurement is larger than the error introduced by the inaccuracy by a 64 bit number.
King Felix
Reply
King Felix
Reply
post #39 of 115
Quote:
Originally posted by wizard69
You want to use the largest number sizes that are consistant with not loosing information. If you applications would work better with 68 bit floats, then the next logical size it 128 bit floats, though I suppose one could argue for 96 bit floats.

Actually, IEEE specifies an 80 bit floating point type, which was implemented in the 68040 and lost in the transition to PowerPC.

For the national debt, you'd want one of those big IBMs that can do fixed-point math in hardware.

None of this is really relevant to a revision of VMX - G4s have always been able to do 64 bit floating point just fine (if a bit slowly). The issue is whether it's worth the extra bandwidth and silicon to handle 64 bit values in vectors, and currently the answer appears to be no. SSE2 can do 2x64 bit FP, but that's only to make up for the fact that the x86's built-in floating point unit is hilariously bad. That's not the case on any PowerPC (that has an FP unit in the first place).
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
"...within intervention's distance of the embassy." - CvB

Original music:
The Mayflies - Black earth Americana. Now on iTMS!
Becca Sutlive - Iowa Fried Rock 'n Roll - now on iTMS!
Reply
post #40 of 115
Kickaha and Amorph couldn't moderate themselves out of a paper bag. Abdicate responsibility and succumb to idiocy. Two years of letting a member make personal attacks against others, then stepping aside when someone won't put up with it. Not only that but go ahead and shut down my posting priviledges but not the one making the attacks. Not even the common decency to abide by their warning (afer three days of absorbing personal attacks with no mods in sight), just shut my posting down and then say it might happen later if a certian line is crossed. Bullshit flag is flying, I won't abide by lying and coddling of liars who go off-site, create accounts differing in a single letter from my handle with the express purpose to decieve and then claim here that I did it. Everyone be warned, kim kap sol is a lying, deceitful poster.

Now I guess they should have banned me rather than just shut off posting priviledges, because kickaha and Amorph definitely aren't going to like being called to task when they thought they had it all ignored *cough* *cough* I mean under control. Just a couple o' tools.

Don't worry, as soon as my work resetting my posts is done I'll disappear forever.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Future Apple Hardware