Cell based server revealed.

addison · May 29, 2005 7:42AM

Here is a page showing photographs of an IBM cell based server. Apparently it is running at 2.8Ghz, but the story says that 3Ghx has been achieved in lab conditions. There are no benchmarks, so it really gives us no clue as to how interesting this would be to us.

WWDC could be interesting.

new · May 29, 2005 7:52AM

"If operated at 3 GHz, Cell's theoretical performance reaches about 200 GFLOPS, which works out to about 400 GFLOPS per board,

doesn't the G5 do 8 Gflops?

henriok · May 29, 2005 8:29AM

Quote:

Originally posted by New

doesn't the G5 do 8 Gflops?

A 970@2.7 GHz does peak 11 Gflops at scalar operations but 22 Gflops on SIMD operations (AltiVec).

Remember that the G5 is quite stong on floting point operations if compared to other mainstream processors like Athlon/Opteron or Pentium/Xeon. But these Cell processors are just insanely strong on single precision floting point operations due to their 8 SPUs. Most applications seem to favour double precision though and Cell isn't as good there.

new · May 29, 2005 8:41AM

Quote:

Originally posted by Henriok

A 970@2.7 GHz does peak 11 Gflops at scalar operations but 22 Gflops on SIMD operations (AltiVec).

Remember that the G5 is quite stong on floting point operations if compared to other mainstream processors like Athlon/Opteron or Pentium/Xeon. But these Cell processors are just insanely strong on single precision floting point operations due to their 8 SPUs. Most applications seem to favour double precision though and Cell isn't as good there.

so what tasks will the cell be strong at?

electric monk · May 29, 2005 9:40AM

Encoding/decoding for example: HD stuff - imagine 8 h.264 1080p streams at once just using the SPE's. Or 48 MPEG-2 streams.

And it's not bad at double floating point, still better then the G5, but single point is where it shines.

It's perfectly fine at regular stuff using the PPE though. Well, ok it sucks at out of order code and suffers a big branch mis-predict penalty, but using SMT a 3.2 GHz PPE runs about as well as a pair of 1.4-1.6 GHz G4's. Run it at 4 GHz, stick a couple PPE's together, toss in some SPE's and Cell will match or better G5's on regular stuff, and make it scream for mercy at floating point operations. Not to mention it has a heck of a lot more memory bandwidth, so anything that's bottlenecked there loves the Cell.

programmer · May 29, 2005 10:35AM

Quote:

Originally posted by New

so what tasks will the cell be strong at?

Video/image/audio/signal processing, 2D & 3D graphics, ray tracing, anything AltiVec does well, numerical processing, etc. In other words, most things that take a lot of compute time. The main trick is that software developers will have to code specifically for the Cell because it is a new execution model and existing software won't "just work".

new · May 29, 2005 11:10AM

Quote:

Originally posted by Programmer

Video/image/audio/signal processing, 2D & 3D graphics, ray tracing, anything AltiVec does well, numerical processing, etc. In other words, most things that take a lot of compute time. The main trick is that software developers will have to code specifically for the Cell because it is a new execution model and existing software won't "just work".

so could an apple cell server compliment, say the virginia tech supercluster, in any effective way?

mi0im · May 29, 2005 3:49PM

Quote:

Originally posted by Henriok

Most applications seem to favour double precision though and Cell isn't as good there.

CELL design team won't agree with you.

http://www.epcc.ed.ac.uk/scicomp/abs...ay.php?Abst=77

henriok · May 29, 2005 4:16PM

Quote:

Originally posted by mi0im

CELL design team won't agree with you.

http://www.epcc.ed.ac.uk/scicomp/abs...ay.php?Abst=77

That link of yours.. Was that supposed to mean anything?

I didn't say Cell was bad, it seems to excel in that respect too, I did say that it wasn't _as_good_ on double floats.

Quoting from BM's Cell pages:

Peak performance (single precision): > 256 GFlops

Peak performance (double precision): >26 GFlops

I am standing by that Cell isn't as good on double precision floats as it is on single precision. Don't you agree?

But on the other hand, you did perhaps contend my other statement, that applications seemed to favour double precision flotas. That's just an observation I've made, I can't back that up with a quote.

mi0im · May 29, 2005 5:08PM

Quote:

Originally posted by Henriok

I am standing by that Cell isn't as good on double precision floats as it is on single precision. Don't you agree?

It seems you love big number blindly. But, HPC applications need adequate B/F (bandwidth to flops) ratio. CELL's DP fp peak number balances its memory bandwidth.

programmer · May 29, 2005 7:47PM

Quote:

Originally posted by mi0im

It seems you love big number blindly. But, HPC applications need adequate B/F (bandwidth to flops) ratio. CELL's DP fp peak number balances its memory bandwidth.

This is extremely dependent on the algorithm(s) involved. Some algorithms do very little computation per memory operation (or they are just poorly coded to work that way). Better are those that manage to do a lot of ops per memory fetch/store. These machines can typically manage a theoretical peak rate of 50-100 single precision FLOPS per cacheline read from memory, assuming proper streaming and prefetching... that is a lot of computation and typically programmers don't come even remotely close to this.

boemane · May 29, 2005 8:34PM

Quote:

Originally posted by Programmer

Video/image/audio/signal processing, 2D & 3D graphics, ray tracing, anything AltiVec does well, numerical processing, etc. In other words, most things that take a lot of compute time. The main trick is that software developers will have to code specifically for the Cell because it is a new execution model and existing software won't "just work".

So if the Cell does everything that AltiVec does, and does it well (better than altivec), could apple utilise the Cell as a co-processor and divert code that the Cell cpu excels at to it (kinnda like what the altivec unit is used as now).

I know the Cell CPU is probably a lot more expensive than the AltiVec unit in the G5, but I'm thinking for top-of-the-line PowerMacs. having this kind of processing power would make it a LOT faster than its competitors on the WinTel side...

Dual 3GHz G5's where each G5 has its own Cell co-processor runnig at 2.8 - 3 GHz.

I know this is probably way to expensive, but on the other hand PS3 and XBox 360 trows in 6 of these Cell CPU's in a package for under 400 dollars...

Just a thought

programmer · May 29, 2005 9:35PM

Quote:

Originally posted by BoeManE

So if the Cell does everything that AltiVec does, and does it well (better than altivec), could apple utilise the Cell as a co-processor and divert code that the Cell cpu excels at to it (kinnda like what the altivec unit is used as now).

Why is everyone so fixated on the Cell as a coprocessor? It would be a lot better as a Mac central processor.

Quote:

I know the Cell CPU is probably a lot more expensive than the AltiVec unit in the G5, but I'm thinking for top-of-the-line PowerMacs. having this kind of processing power would make it a LOT faster than its competitors on the WinTel side...

Actually the Cell is designed to go into low cost game consoles, TVs and other consumer electronics.

brendon · May 29, 2005 10:36PM

Quote:

Originally posted by Programmer

Why is everyone so fixated on the Cell as a coprocessor? It would be a lot better as a Mac central processor.

For me it is the fear that if Apple does use Cell for a main processor, it will be like or worse than Altivec, where it is great technology but few outside of Apple will use it for quite some time. It would be great if there was a way for Apple to put a 970 front end on a Cell like processor. Baring that it would be great for Apple to be able to use Cell technology to add altivec like, SPEs, to a 970. That way if a programmer had the time to learn about the cell features and utilize them that would be great for them. If time did not permit then all would not be lost, they would have to worry about wheather the program would work well given the knowledge that they have for 970PPC programming. For me this is a transition step coprocessor or better for me is incorporation cell like technology into the 970 line to get the ball rolling.

programmer · May 29, 2005 11:19PM

Quote:

Originally posted by Brendon

For me it is the fear that if Apple does use Cell for a main processor, it will be like or worse than Altivec, where it is great technology but few outside of Apple will use it for quite some time.

Three points:

1) There is no worry about nobody programming for Cell, given that Sony, Toshiba and IBM are pushing it. Game consoles and super computers... that will attract a fair bit of developer interest. Add Apple and you've got desktops as well.

2) Even it is only used by the OS (i.e. only Apple codes for it) it will be a big win.

3) The developers who aren't going to use SPEs are the same ones that won't bother to learn how to make the 970 go fast, so 4-5 GHz PPE vs. 2-2.5 GHz 970 won't make much difference.

mi0im · May 30, 2005 6:15AM

Quote:

Originally posted by Programmer

These machines can typically manage a theoretical peak rate of 50-100 single precision FLOPS per cacheline read from memory, assuming proper streaming and prefetching... that is a lot of computation and typically programmers don't come even remotely close to this.

Two points:

(1) CELL SPE has no cache.

(2) I'm talking about DP math used for scientific applications. This presentation describes typical B/F ratio of such applications.

electric monk · May 30, 2005 7:43AM

Quote:

Originally posted by mi0im

(1) CELL SPE has no cache.

Technically yes. What they have are local stores, which are similar to, but not exactly like a cache.

programmer · May 30, 2005 10:43AM

Quote:

Originally posted by mi0im

Two points:

(1) CELL SPE has no cache.

(2) I'm talking about DP math used for scientific applications. This presentation describes typical B/F ratio of such applications.

Right -- slide 13 is what I'm talking about. The B/F of an algorithm. This can be strongly affected by how the algorithm is encoded, and often lazy/naive programmers over-utilize bandwidth. By improving how efficiently bandwidth is used, the B/F ratio is shifted toward FLOPS.

Keep in mind as well that the Cell's SPEs have an extremely high bandwidth local store, and inter-SPE bus. For algorithms that can be arranged to stream through these local stores, the B/F ratio of the hardware can be shifted significantly toward bandwidth... as long as the aggregate bandwidth to main memory doesn't exceed the XDRAM's capability (or the I/O ports, depending on where data is going to/from).

Oh, and the lack of cache in SPEs is considered a good thing. If bandwidth is what you're worried about then finer control over it lets you take maximum advantage of what you have. Caches do not give you fine control. An asynchronous DMA engine to local memory does.

mikenap · May 30, 2005 11:18AM

sounds like a bunch of B/F to me! hahahahahaha!

brendon · May 30, 2005 5:02PM

Quote:

Originally posted by Programmer

Right -- slide 13 is what I'm talking about. The B/F of an algorithm. This can be strongly affected by how the algorithm is encoded, and often lazy/naive programmers over-utilize bandwidth. By improving how efficiently bandwidth is used, the B/F ratio is shifted toward FLOPS.

Keep in mind as well that the Cell's SPEs have an extremely high bandwidth local store, and inter-SPE bus. For algorithms that can be arranged to stream through these local stores, the B/F ratio of the hardware can be shifted significantly toward bandwidth... as long as the aggregate bandwidth to main memory doesn't exceed the XDRAM's capability (or the I/O ports, depending on where data is going to/from).

Oh, and the lack of cache in SPEs is considered a good thing. If bandwidth is what you're worried about then finer control over it lets you take maximum advantage of what you have. Caches do not give you fine control. An asynchronous DMA engine to local memory does.

OK, FP is important to me and I think to the life of Apple in the scientific community. Double precision is where that is at and it would great news for me if the SPEs could do Double precision FP.

programmer · May 30, 2005 10:23PM

Quote:

Originally posted by Brendon

OK, FP is important to me and I think to the life of Apple in the scientific community. Double precision is where that is at and it would great news for me if the SPEs could do Double precision FP.

Well then be happy -- it is 100% certain that the SPEs and the PPE do double precision floating point. The Cell's aggregate performance at DP, however, is only about 2.5x that of a single 2.7 GHz 970 so its nothing to write home about. On the other hand, a great many algorithms that currently use double precision do so out of laziness instead of any actual need for that level of precision.

Cell based server revealed.

Comments