Cell based server revealed.

brendon · May 30, 2005 11:13PM

Quote:

Originally posted by Programmer

Well then be happy -- it is 100% certain that the SPEs and the PPE do double precision floating point. The Cell's aggregate performance at DP, however, is only about 2.5x that of a single 2.7 GHz 970 so its nothing to write home about. On the other hand, a great many algorithms that currently use double precision do so out of laziness instead of any actual need for that level of precision.

So do you think that Apple may use the Cell in a dual configuration if they do decide to do that??

addison · May 31, 2005 3:22AM

Is the Cell to the G5, what the G3 was to the G4?

new · May 31, 2005 4:06AM

Quote:

Originally posted by Addison

Is the Cell to the G5, what the G3 was to the G4?

no

pb · May 31, 2005 5:18AM

Quote:

Originally posted by Addison

Is the Cell to the G5, what the G3 was to the G4?

As said, no, the CELL is considerably more different from the G5 than the G4 is from the G3. Read here for example.

And by the way, CELL is not a precise CPU, is the name of a highly modular technology. We are going to see several CPUs of this architecture, each adapted to a specific use. The prototype CPU we have seen some time ago, has one principal processing unit (64-bit PPC but more simple than a G5), and eight vector processing units coordinated by the principal one. The package is designed for very high clock speeds and very high speed interfaces. It reaches a stratospheric level perfomance peak on carefully written code, but this is no surprise.

brendon · May 31, 2005 10:58AM

Quote:

Originally posted by PB

As said, no, the CELL is considerably more different from the G5 than the G4 is from the G3. Read here for example.

And by the way, CELL is not a precise CPU, is the name of a highly modular technology. We are going to see several CPUs of this architecture, each adapted to a specific use. The prototype CPU we have seen some time ago, has one principal processing unit (64-bit PPC but more simple than a G5), and eight vector processing units coordinated by the principal one. The package is designed for very high clock speeds and very high speed interfaces. It reaches a stratospheric level perfomance peak on carefully written code, but this is no surprise.

Strange that two days after it was rumored by the Wall Street Journal that Apple was in talks to use Intel processors, that IBM and Sony announced that Cell would be open. And then went on to say just what they thought that they could give away for free. Maybe Steve just wanted to talk to Intel about working together on WiFi standards, and merging Firewire technologies and USB. And getting much lower costs for Cell technologies, all in one meeting, and one well timed phone call. Way to go Steve, time management at its best.

pb · May 31, 2005 2:18PM

Quote:

Originally posted by Brendon

Maybe Steve just wanted to talk to Intel about working together on WiFi standards, and merging Firewire technologies and USB.

Actually I don't see more than what you say in the alleged talks between Apple and Intel. Oh wait, perhaps this one too:

Quote:

And getting much lower costs for Cell technologies, all in one meeting, and one well timed phone call. Way to go Steve, time management at its best.

Intel is involved in a number of hardware technologies, other than CPU manufacturing, that could very well be of interest to Apple.

tht · May 31, 2005 2:56PM

Quote:

Originally posted by Programmer

Well then be happy -- it is 100% certain that the SPEs and the PPE do double precision floating point. The Cell's aggregate performance at DP, however, is only about 2.5x that of a single 2.7 GHz 970 so its nothing to write home about. On the other hand, a great many algorithms that currently use double precision do so out of laziness instead of any actual need for that level of precision.

Just to remind everyone, a 3.2 GHz Cell would do ~25 giga-dual-precision FPU ops on aggregate. On aggregate means spread across all of the processor cores in the Cell CPU. In single threaded dual precision app performance, Cell would perform more like a 400 MHz CPU or 3 GFLOPS. It would take all cores working optimally to get the aggregate FLOPS number.

Most dual precision applications can be threaded really well, so work can be spread across 8 cores to achieve something close to the theoretical numbers. I still have my doubts that a developer can achieve something close to multiprocessor performance rules of thumb with Cell processors than with the traditional cluster setup. (It's huge plus certainly, but it's also a lot of transistors!) Cell as a shared memory system would seem more memory starved than a NUMA-style cluster setup. Great for games and media, not that great for CFD.

Also Programmer, a great many algorithms may not need double precision FPU accuracy, but remember that the Cell SPE is not accurate in single precision FPU. Some developers may not have a choice!

programmer · May 31, 2005 11:09PM

Quote:

Originally posted by THT

I still have my doubts that a developer can achieve something close to multiprocessor performance rules of thumb with Cell processors than with the traditional cluster setup. (It's huge plus certainly, but it's also a lot of transistors!) Cell as a shared memory system would seem more memory starved than a NUMA-style cluster setup. Great for games and media, not that great for CFD.

I disagree -- I think it will do better than a comparable (in terms of price & power) shared memory system. In single precision it will do much better. In double precision things aren't nearly so clear, but I suspect that the DP GFLOPS number doesn't tell the whole story (benchmark numbers rarely do).

Quote:

Also Programmer, a great many algorithms may not need double precision FPU accuracy, but remember that the Cell SPE is not accurate in single precision FPU. Some developers may not have a choice!

[/B]

That concern is overblown -- the SPE is no less accurate than "standard" FPUs, except that it does not support denormalized numbers and it has slightly different rules for how it comes up with its results. These rules are different than the IEEE rules, but are no less accurate. The lack of denormalized numbers will impact some algorithms, but those are in the minority and often a proper IEEE implementation will take a (possibly serious) performance hit when processing denorms (although PPC FPUs are typically quite efficient in this regard). Often the specific places where this issue matters can be dealt with directly by using doubles, leaving the rest of the calculations in singles.

This is all a distraction from the main point: for most performance critical algorithms that most of Apple's users care about, the SPEs do extremely well. Moving all graphics, audio, video and network processing onto the SPEs (and those things don't care about the SPEs difference from standard IEEE FP) would relieve an enormous load currently shouldered by the main processor.

tht · June 1, 2005 11:11AM

Quote:

Originally posted by Programmer

That concern is overblown -- the SPE is no less accurate than "standard" FPUs, except that it does not support denormalized numbers and it has slightly different rules for how it comes up with its results. These rules are different than the IEEE rules, but are no less accurate.

How many digits of accuracy does the SPE have compared to other CPUs? It doesn't support denormalized numbers and Realworldtech says it rounds to zero. Those could be some pretty big exceptions for a lot of people.

Quote:

This is all a distraction from the main point: for most performance critical algorithms that most of Apple's users care about, the SPEs do extremely well. Moving all graphics, audio, video and network processing onto the SPEs (and those things don't care about the SPEs difference from standard IEEE FP) would relieve an enormous load currently shouldered by the main processor.

I can agree with you here. It's a big win for media and graphics. How big of a win it is compared to some other design utilizing 200+ million transistors, I think is up for debate.

I'll be a skeptic about this until Apple comes out with the hardware, or perhaps IBM comes out with the hardware. It's usefulness as a laptop chip is even starting to wane now that it is coming out hotter than speculated.

splinemodel · June 1, 2005 11:23AM

Quote:

Originally posted by Henriok

A 970@2.7 GHz does peak 11 Gflops at scalar operations but 22 Gflops on SIMD operations (AltiVec).

Remember that the G5 is quite stong on floting point operations if compared to other mainstream processors like Athlon/Opteron or Pentium/Xeon. But these Cell processors are just insanely strong on single precision floting point operations due to their 8 SPUs. Most applications seem to favour double precision though and Cell isn't as good there.

If it's a 64bit chip, which I believe it is, there is no difference (as far as I know) between 64bit single and 32bit double float. Sure, there could be an 128bit float, but the current baseline is still 32bits, and I don't think anyone needs 128bit floating point precision in normal circumstances. So it would seem that their 200Gflop figure is in league with what we expect from double floats.

tht · June 1, 2005 1:58PM

Quote:

Originally posted by Splinemodel

If it's a 64bit chip, which I believe it is, there is no difference (as far as I know) between 64bit single and 32bit double float. Sure, there could be an 128bit float, but the current baseline is still 32bits, and I don't think anyone needs 128bit floating point precision in normal circumstances. So it would seem that their 200Gflop figure is in league with what we expect from double floats.

This doesn't make sense.

Single and double precision are terms indicating the accuracy, the number of digits of precision, typically for floating point math.

32 or 64 bit chips are terms indicating the width of the integer unit and how much memory it can support.

A single precision value is represented in the chip in binary by a "word" 32 bits long (or wide). A double precision value is represented by a 64 bit long binary. Yes, it's confusing. Every single PowerPC chip Apple has ever shipped have FPU units that are 64 bits wide and are able to do 64 bit, dual precision FPU ops in hardware.

Only the 970-based Macs are able to do 64 bit integer ops in hardware, and are able to address more than 4 GB of memory. All other Macs do 32 bit, integer ops, and address less than 4 GB of memory.

The 200+ GFLOP figure quoted for Cell is for single precision math using the 8 SPEs. The SPE can do a 4 single precision multiply-add ops per cycle. A multiply-add is considered 2 floating point operations. A single precision value or "word" is 32 bits long. The Cell SPE uses a 128 bit wide SIMD word (much like AltiVec/VMX), able to hold 4 32 bit values, and do 4 single precision FMACs (multiply-adds) per cycle. The SPE is fully pipelined for single precision ops, meaning an instruction can be theoretically issued/completed every cycle,

So, 4 FMACs x 2 OPS/FMAC x 3.2 GHz = 25.6 GFLOPS per SPE. There are 8 SPEs in a Cell, 8 SPE/Cell x 25.6 GFLOPS/SPE = 204.8 GFLOPS per 3.2 GHz Cell. There will also be some ~20 GFLOPS from the PPE.

For double precision FPU ops, the Cell SPE is not fully pipelined. That is, it cannot issue or complete a DP FPU op every cycle, but must wait some clock cycles for the DP FPU to reach before being able to issue another one.

The only numbers seen regarding this is in the presentation link from one of the other FH threads. DP FPU ops have issue rates of 14 or 7 cycles. If it is 7 cycles, it means a SPE can only issue/complete a DP FPU op every 7 cycles. Take the 25.6 GFLOPS/SPE and divide by 7 = 3.7 giga-DP-FLOPS/SPE. 3.7 x 8 = 30 giga-DP-FLOPS per Cell.

As far as people needing single precision or double precision, well that's a complicated question. IBM seems to marketing Cell to content creation types, not science and engineering types. So, I think IBM knows it has a tough sell where DP FPU is used heavily, and where it is more difficult to utilize a lot of the SPEs.

splinemodel · June 1, 2005 2:18PM

The confusion is due to the fact that standards dictate "single" and "double" universality (i.e. the same bit count), but in practice it seems to differ. Maybe I was on crack, but I remember some UltraSparcs I worked on to use float and double interchangeably. This of course boils down to the compiler, but float and double are defined on the compiler-level.

I'll also admit that I don't know where I was going with the last post. The point is that a G5 rated at 11 Gflop will have much lower Gdflop, and cell will too, so the ~20x improvement probably holds for doubles as well as floats.

tht · June 1, 2005 3:39PM

Quote:

Originally posted by Splinemodel

I'll also admit that I don't know where I was going with the last post. The point is that a G5 rated at 11 Gflop will have much lower Gdflop, and cell will too, so the ~20x improvement probably holds for doubles as well as floats.

The two FPU units in the 970 execute single and double precision with the same latencies and are fully pipelined for both. It's 11 GFLOPS DP for a 2.7 GHz 970fx. The 22 GFLOPS Henriok quoted was an AltiVec number, and we all know AltiVec only does single precision.

brendon · June 1, 2005 3:50PM

Quote:

Originally posted by THT

The two FPU units in the 970 execute single and double precision with the same latencies and are fully pipelined for both. It's 11 GFLOPS DP for a 2.7 GHz 970fx. The 22 GFLOPS Henriok quoted was an AltiVec number, and we all know AltiVec only does single precision.

So a 970 or the like with a few, 2 or 4 SPEs attached would cover all bases for Apple, or what about a dual core that shares 6 to 8 SPEs, or is that too big?

amorph · June 1, 2005 5:02PM

Re: Emulating AltiVec on SPEs: No. You don't want to go there. That's why the PPE has AltiVec on board.

Re: 64-bit FP performance: As far as I've heard, the SPEs are not 64-bit capable in the same way that most PowerPCs are. They do 64-bit FP computations much, much more slowly than 32-bit FP. I've heard that they can do 10 32-bit FP ops per 1 64-bit op. If you get all 8 SPEs working at once, that's almost 1 instruction per clock. Fortunately, Cell clocks high.

Re: Having to code for the SPEs. That's inevitable, but given that Apple is pushing everyone toward frameworks, and given that many of those frameworks can now dynamically compile for multiple targets (e.g., Core Image) I think Apple has set things up so that a surprising number of apps could start taking some advantage of a platform like Cell with nothing more than an OS update.

The most interesting thing about Cell, to me, is the fact that it's going to be mass-produced on a gigantic scale for a cheap machine. This is not some huge, rare beast like the POWER4. If it is half as successful as IBM and Sony seem to think it will be (and remember, their hype machines are in full gear, so don't believe everything you read) it will turn the whole price/performance equation on its ear.

amorph · June 1, 2005 5:10PM

Quote:

Originally posted by Brendon

So a 970 or the like with a few, 2 or 4 SPEs attached would cover all bases for Apple, or what about a dual core that shares 6 to 8 SPEs, or is that too big?

I'm not sure that the 970 core can reasonably be dropped into Cell (where Cell refers to the arrangement of a PowerPC core attached to some number of SPEs via an on-chip token ring bus). Leaving aside feasibility, dropping the clockspeed that far on the SPEs would be painful for performance, and the resulting chip would be huge and hot.

Apple can play with dual PPE cores, or twiddling the exact number of SPEs, if they so choose.

brendon · June 1, 2005 7:04PM

Quote:

Originally posted by Amorph

I'm not sure that the 970 core can reasonably be dropped into Cell (where Cell refers to the arrangement of a PowerPC core attached to some number of SPEs via an on-chip token ring bus). Leaving aside feasibility, dropping the clockspeed that far on the SPEs would be painful for performance, and the resulting chip would be huge and hot.

Apple can play with dual PPE cores, or twiddling the exact number of SPEs, if they so choose.

OK how about a 970 with more than one VMX unit? What with GCC auto vectorizing and the advent of H.264, mo' VMX would be a good thing. Forgot to mention da' cores, audio, image, data, that would love to play in a bigger VMX sand-box.

mdriftmeyer · June 1, 2005 7:27PM

Everyone can salivate all they want on the possibilities of a CELL but in terms of profits margins it boils down to cost and Apple is in no position, nor desire to absorb chip costs on hardware products, especially on technology that hasn't even hit the general market, at this time.

I'd expect some announcement of Dual Core chips based on Power5 core technologies, much sooner than CELL.

programmer · June 1, 2005 11:58PM

Quote:

Originally posted by Brendon

OK how about a 970 with more than one VMX unit? What with GCC auto vectorizing and the advent of H.264, mo' VMX would be a good thing. Forgot to mention da' cores, audio, image, data, that would love to play in a bigger VMX sand-box.

The biggest limitation on VMX in the 970 is not the number of VMX units, it is the available bandwidth. Most algorithms are bound by the amount of data they can move through the processor. I've also found that some algorithms run out of registers too fast, which is why the SPE (and Microsoft's VMX128) have 128 registers (compared to the 970's 32 architectured and 80 physical registers).

The Cell was carefully designed to have huge bandwidth. The SPE local store gives those cores crazy bandwidth, the on-chip bus allows very high speed communications and data sharing, and the I/O and XDRAM interfaces provide very fast access to the large data store. Consider that the SPE's local store is about the same speed as the 970's L1 cache, but is 4 times bigger and completely controllable (the 970 cache is organized by splitting into seperate instuction/data caches, has limited associativity, and is built around 128 byte cachelines). The 970's L2 is ten times slower than its L1, and has similar control issues -- and can't even be controlled by prefetch instructions.

programmer · June 2, 2005 12:06AM

Quote:

Originally posted by Splinemodel

I'll also admit that I don't know where I was going with the last post. The point is that a G5 rated at 11 Gflop will have much lower Gdflop, and cell will too, so the ~20x improvement probably holds for doubles as well as floats.

This turns out to not be the case -- the 970's FPUs are just as fast at double precision as they are at single, so they achieve 11 "GDFLOPS". The Cell, on the other hand, has significantly (~10x) lower DP performance. This first Cell chip simply isn't built with DP in mind. It'll be interesting to see if IBM feels that a DP-optimized SPE is a worthwhile investment for future Cells. Wild speculation: If all the instructions are in place for DP (and it sounds like they are) then they might be able to just drop in one that does pipelined DP and it'll run at some pretty staggering DP rates.

THT: the SPE's single precision has 32-bits of precision just like anybody else's single precision. Denormalization only comes into it when dealing with numbers so small the smallest exponent isn't low enough to represent the number. When this happens on the Cell, it rounds to zero. Those numbers are on the order of 1e-38, however... so they are pretty damn small.

Cell based server revealed.

Comments