[quote]Originally posted by Nevyn:
<strong>
An Altivec calculation can rip through all of L1 & L2 in practically no time, particularly if it's streaming data of some kind.
All you have to do to guestimate the 'max' amount of bandwidth the VPU can suck up, is figure out 1) how many instructions/clock can be retired, 2) clock rate, 3) bits per instruction.
3) is 128 bits.
2) is 2.5 GHz.
1) is tougher, let's assume it's always just one.
I get _40_GB/s. (That's _bytes_)
Realize that a chunk of work can be 'saved' in the L1/L2 cache (or the registers, wherever)... but the tasks that AV is used on are data heavy -> they don't _FIT_ in the caches.
Nevermind that there's two integer units in the 970, and two FPUs -> 4x64 -> _another_ 256 bits/cycle.
I'm not saying that this is a really how much bandwidth you _need_, just that more is always better, and alot more is alot better

One of the key benefits of the ppc approach to FPUs & SIMD is that all the units can operate independently - it's just that actually doing this has been somewhat choked because lots of computing capacity can't get data fast enough. (If the AV unit is running full tilt, there's roughly zero bandwith to keep the integer units & FPUs fed. And the caches are filled with drek.)
Then there are duals, which need 2x the bandwidth.</strong><hr></blockquote>
Who initiated Altivec anyway, Apple, IBM, MOT? IBM didn't want it but who thought up the idea?