Official: Larrabee a pile of fail

lemon bon bon. · May 31, 2010 7:39AM

I'll believe it when I see it from Intel.

And what's more.

I'll believe it when I see it from Apple in their computer line.

They've a poor history of gpu updates.

From the legendary Rage (ATI) 16 MB to the Vanilla Nvidia cards in the 'Pro.'

...with 256-512 vram. *Blows party pheeper. 'Feee-ooOWW...'

1 gig as standard in my consumer gpu cards now. For very little money. Apple. Penny pinchers. *Mutters.

Lemon Bon Bon.

imacmatician · May 31, 2010 3:01PM

Larrabee announced as HPC part:

http://www.intel.com/pressroom/archi...100531comp.htm

marvin · June 1, 2010 1:58AM

Quote:

Originally Posted by iMacmatician

Larrabee announced as HPC part:

http://www.intel.com/pressroom/archi...100531comp.htm

The details at the following link list it as having 32 x 1.2GHz Xeon cores with 500 GFLOPs performance:

http://www.pcworld.com/article/19762..._32_cores.html

That's around the same as an NVidia 8800GTX.

It says the first commercial product will have 50 cores using the 22nm Sandy Bridge process and they have 4 threads per core instead of 2 with current chips. This would mean 200 threads per GPU, which would be like having 200 x 300MHz Stream processors, although not quite as the cores are more capable. Also, they have an unspecified number of vector units but it seems only a few of those so I'm not sure what part they play given that you could run vector code on the CPUs and Intel plan to merge them later on.

If it scales up, the commercial product should perform somewhere between a Radeon HD 5850 and a GeForce GTX 295.

This year's Fermi GPUs and AMD's latest are listed at over 2 TFLOPs though so Intel's offering in 2011 will be 1/3 of NVidia's/AMD's in 2010.

The TDP is noted as 125W for a 48-core chip here:

http://www.tcmagazine.com/tcm/news/h...knights-corner

That's pretty good considering the GTX 295 is 289W. Plus, the 295 is just two GPUs sandwiched together so Intel could do the same.

It'll be interesting to see what the pricing comes in at and also how they use this tech for IGPs.

imacmatician · June 1, 2010 7:09AM

Quote:

Originally Posted by Marvin

The details at the following link list it as having 32 x 1.2GHz Xeon cores with 500 GFLOPs performance:

http://www.pcworld.com/article/19762..._32_cores.html

That's around the same as an NVidia 8800GTX.

That chip looks like Larrabee silicon. If it is, its (1.2 GHz)·(32 cores)·(16 FLOPS/cycle) = 614 GFLOPS performance and 1.23 TFLOPS with FMA. Intel missed clock targets by a lot. For years it was planned for ~1 TFLOPS w/o FMA and ~2 TFLOPS w/ FMA. I believe these are SP values, DP is half that.

TDP is 300 W.

I don't know where the "Xeon cores" came from.

Quote:

Originally Posted by Marvin

This year's Fermi GPUs and AMD's latest are listed at over 2 TFLOPs though so Intel's offering in 2011 will be 1/3 of NVidia's/AMD's in 2010.

GTX 480 is 1.34 TFLOPS with FMAD.

marvin · June 1, 2010 3:36PM

Quote:

Originally Posted by iMacmatician

I don't know where the "Xeon cores" came from.

Knights Ferry is based on Larrabee but it's not Larrabee and it apparently uses the Xeon processor for the core design - the TDP shouldn't be the 300W noted for the original Larrabee:

http://www.zdnet.co.uk/news/desktop-...-hpc-40089093/

http://www.zdnet.co.uk/news/processo...89094/4/#story

"The silicon in the Knights Ferry development plaform is code-named Aubrey Isle ? a derivative of Larrabee ? with a peak performance promised in excess of 1 teraflop. The cores are based on the Xeon 7500 architecture with 100 new MIC-specific instructions."

Quote:

Originally Posted by iMacmatician

GTX 480 is 1.34 TFLOPS with FMAD.

Yeah but Larrabee is around 500GFlops SP:

"At ISC, Skaugen showed a performance run on a Knights Ferry platform with LU factorization, which is used to implement Linpack. Running this code, the development chip hit 517 gigaflops, a mark Skaugen said was unmatched by any other platform. Skaugen later told me that this was single precision gigaflops, not double precision, which makes the "unmatched" claim somewhat questionable to me."

http://www.hpcwire.com/home/specialf...-95334544.html

It's actually less than 1/3 of what NVidia/AMD can do.

The big advantage Larrabee might have is the development code. I don't think many developers want to learn vectorization languages like OpenCL, CUDA etc and it shows from how few people actually use them. People just want to run x86 code very fast.

I'm still not sure how Knights Ferry will improve IGPs though. NVidia is putting Fermi in a mobile chip but it's for desktop replacements. I guess if you can run x86 code on the IGP then it means a massive boost for normal apps on a laptop and no more lack of features as it's fully programmable but raw performance will probably still suck vs NVidia's/AMD's IGPs and those are already highly programmable.

imacmatician · June 1, 2010 5:55PM

Quote:

Originally Posted by Marvin

Knights Ferry is based on Larrabee but it's not Larrabee and it apparently uses the Xeon processor for the core design -

The die photos of Larrabee and Aubrey Isle are identical. If AI is indeed based on Xeon then there's a heck of a lot of modifications made because one AI core would be half the size of a Nehalem core, or even smaller.

Quote:

Originally Posted by Marvin

the TDP shouldn't be the 300W noted for the original Larrabee:

Well, that's the TDP, and it's consistent with previously rumored Larrabee GPU TDP.

Quote:

Knights Ferry, the first hardware based on MIC, is a co-processor 300w PCIe card with 32 cores running at up to 1.2GHz and managing 128 threads at four threads per core, with 8MB shared coherent cache and up to 2GB of GDDR5 memory. Described as a software development platform, it is intended to lead to Knights Corner, a new design due some time in 2011 or 2012.

Quote:

Originally Posted by Marvin

Yeah but Larrabee is around 500GFlops SP:

On SGEMM it's over a TFLOP. Also, does LU involve FMA or not?

The GTX 480 value I mentioned is theoretical value, just like 1.23 TFLOPS for Larrabee.

If it really is 500 GFLOPS w/ FMA then Intel probably removed a feature from Larrabee when making Aubrey Isle.

marvin · June 1, 2010 9:33PM

Quote:

Originally Posted by iMacmatician

The die photos of Larrabee and Aubrey Isle are identical. If AI is indeed based on Xeon then there's a heck of a lot of modifications made because one AI core would be half the size of a Nehalem core, or even smaller.

They may not be full Xeon cores, there will be parts of a standard Xeon that aren't needed in a co-processor but I imagine that they will be using up a fair amount of space. Here is an image of a GTX 280 die compared to Penryn:

http://rightshift.info/OLD/blog/wp-c...06/gpu_die.jpg

Quote:

Originally Posted by iMacmatician

Well, that's the TDP, and it's consistent with previously rumored Larrabee GPU TDP.

I missed that in the article. That's pretty high for this level of performance and the same as the GTX 295, which has two high-end GPUs inside. The 22nm process will help but that's very high. Maybe the 125W TDP rating on the 48-core was based on the 22nm model.

Quote:

Originally Posted by iMacmatician

On SGEMM it's over a TFLOP. Also, does LU involve FMA or not?

I don't think Knights Ferry will have FMA support because Intel aren't using it until 2011 so the commercial products will have support and we will see the improvement it makes then. Current high-end NVidia and AMD GPUs support it so the performance of Intel's shipping product should look a bit better so long as AMD/NVidia don't increase their performance significantly next year.

futurepastnow · June 1, 2010 10:43PM

"Xeon" is a meaningless marketing name.

1337_5l4xx0r · June 2, 2010 6:21AM

http://arstechnica.com/business/news...m_campaign=rss courtesy of Jon Stokes:

Quote:

You'll recall that Tesla is also a kind of "many-core," vector-heavy, GPU-derived coprocessor aimed at HPC workloads, so in this sense there's considerable overlap with MIC. And the standard thinking, which Intel is happy to promote, goes that MIC is better than Tesla because it's x86 and Tesla isn't, which means that it will be easier to port code to the new processor. So if both Tesla and Knight's Corner are GPU-derived, many-core, floating-point-centric processors with support for plenty of thread- and data-level parallelism, why am I suggesting that the MIC architecture in general is probably a greater danger to Itanium?

The first part of the answer lies in defining what you mean by "easy to port."

The hard part about porting from a multicore or single-core architecture to a many-core architecture is not the ISA transition, it's the fact that you have to redesign most apps from the ground up. Porting to many-core, whether it's Tesla or MIC, requires you to start over from scratch in the vast majority of cases. The end result is that going from x86 to MIC is, for many applications, about the same level of challenge as going from x86 to Tesla, because you have to start over from the application and algorithm design phase.

Note that none of this is to say that Intel MIC and Tesla are the same—they're different in some very fundamental respects, not the least of which are the facts that the MIC cores are better for general-purpose computing, and that MIC has a real virtual memory implementation. My only point is that regular x86, MIC, and Tesla represent three different architectures, and to go from x86 to either MIC or Tesla means that you have to start over.

Some may object to the claim that you absolutely must start over if you go from x86 to MIC, because MIC is a collection of x86 cores, which would seem to imply that you could just run some vanilla x86 code on a MIC machine. This is true, of course, but why would you ever want to do that? Why would you shell out for a giant, 50-core MIC chip to run some minimally parallel workload on three or four in-order cores, when Intel will sell you a pair of dual-core Atoms for next to nothing? You either rearchitect your application to use a very large number of cores, or you stick with the much cheaper multicore x86 options.

In the end, MIC's attractiveness vs. Tesla will have little to do with its being x86, and more to do with its relative performance per watt on the kinds of workloads that HPC customers care about.

backtomac · June 2, 2010 7:04AM

Quote:

Originally Posted by 1337_5L4Xx0R

http://arstechnica.com/business/news...m_campaign=rss courtesy of Jon Stokes:

It seems to me that with Open Cl there is no, or little, advantage of x-86 cores for GPGPU.

Am I wrong here?

imacmatician · June 2, 2010 10:59AM

Quote:

Originally Posted by Marvin

I missed that in the article. That's pretty high for this level of performance and the same as the GTX 295, which has two high-end GPUs inside. The 22nm process will help but that's very high. Maybe the 125W TDP rating on the 48-core was based on the 22nm model.

SSC is a totally different chip.

Quote:

Originally Posted by Marvin

I don't think Knights Ferry will have FMA support because Intel aren't using it until 2011 so the commercial products will have support and we will see the improvement it makes then. Current high-end NVidia and AMD GPUs support it so the performance of Intel's shipping product should look a bit better so long as AMD/NVidia don't increase their performance significantly next year.

That explains it. So assuming FMA capability wasn't hardware removed, then FMA wasn't one of the "100 new instructions" for Knights Ferry.

Quote:

Originally Posted by FuturePastNow

"Xeon" is a meaningless marketing name.

Figures.

marvin · June 2, 2010 1:34PM

Quote:

Originally Posted by backtomac

It seems to me that with Open Cl there is no, or little, advantage of x-86 cores for GPGPU.

Am I wrong here?

There are a lot of advantages. Pretty much every parallel processing software in use today runs on x86 CPUs - mostly distributed/cluster/grid computing setups. When you change the form of parallelism then you'll have to rework the code too but you'll be able to reuse a lot more code from those areas than you can use for GPGPU computing and you should be able to share that code-base with a distributed network.

You get distributed GPU computing too of course but x86 computing has been round much longer and is more prevalent.

The ultimate aim is to run everything from one shared set of processing resources, that's the best way to maximize the use of the hardware. It's clear that we won't be transitioning all software to GPU compute kernels so accelerating x86 code is the better solution despite having to rewrite some of it.

Code familiarity is more significant than some people like to think too. This is an issue with Objective-C. If you write code in certain languages with common syntax for years and suddenly you have to migrate to another format, it can be very difficult to do. Same goes for debugging tools and stability. If your display runs off the same GPU that you do GPGPU tasks on and the driver crashes, you have to reboot. Intel's chips may offer better stability.

As with most of these types of development, you can only tell the real advantages and disadvantages when they are in commercial use. NVidia/AMD have a few years head start and don't seem to be having many problems but not many people are developing GPU computing software despite the fact that a huge number of people can use it now. The one important area that will affect most people is video encoding/decoding. If Intel can boost that significantly and NVidia/AMD cannot then they have a huge advantage, almost to the point that NVidia's/AMD's efforts will have been worthless.

bitemymac · June 2, 2010 3:16PM

Quote:

Originally Posted by Marvin

NVidia/AMD have a few years head start and don't seem to be having many problems but not many people are developing GPU computing software despite the fact that a huge number of people can use it now. The one important area that will affect most people is video encoding/decoding. If Intel can boost that significantly and NVidia/AMD cannot then they have a huge advantage, almost to the point that NVidia's/AMD's efforts will have been worthless.

The video encoding/decoding option is something that is available now. It's even built into ATI driver 10.4, it provides video encoding option under advance mode, and some third party softwares do take advantage of GPU video encoding on window platform.

It's just matter of someone wrtting the application to use GPU instead of CPU for these tasks even on OS X. Is intel ready for this?

hiro · June 2, 2010 9:13PM

Quote:

Originally Posted by Marvin

There are a lot of advantages. Pretty much every parallel processing software in use today runs on x86 CPUs - mostly distributed/cluster/grid computing setups. When you change the form of parallelism then you'll have to rework the code too but you'll be able to reuse a lot more code from those areas than you can use for GPGPU computing and you should be able to share that code-base with a distributed network.

You get distributed GPU computing too of course but x86 computing has been round much longer and is more prevalent.

The ultimate aim is to run everything from one shared set of processing resources, that's the best way to maximize the use of the hardware. It's clear that we won't be transitioning all software to GPU compute kernels so accelerating x86 code is the better solution despite having to rewrite some of it.

Code familiarity is more significant than some people like to think too. This is an issue with Objective-C. If you write code in certain languages with common syntax for years and suddenly you have to migrate to another format, it can be very difficult to do. Same goes for debugging tools and stability. If your display runs off the same GPU that you do GPGPU tasks on and the driver crashes, you have to reboot. Intel's chips may offer better stability.

As with most of these types of development, you can only tell the real advantages and disadvantages when they are in commercial use. NVidia/AMD have a few years head start and don't seem to be having many problems but not many people are developing GPU computing software despite the fact that a huge number of people can use it now. The one important area that will affect most people is video encoding/decoding. If Intel can boost that significantly and NVidia/AMD cannot then they have a huge advantage, almost to the point that NVidia's/AMD's efforts will have been worthless.

Pretty much agreed. You can only wring so much improvement from GPGPU, even several hundred percent improvement in one thread is small potatoes compared to a well designed cloud-aware distributed application. GPGPU doesn't scale arbitrarily, and it requires thinking about implementing the algorithms in a completely new mindset. Scaling across ridiculously expanded numbers of garden variety general purpose CPU cores can scale far more aggressively. You need to consider the data structures and algorithms in the beginning of application development to do that well, but they are implemented in ways that most programmers are already familiar with. THEN you can sprinkle GPGPU goodness on top of that and get the best of both worlds -- something you cannot get if you only depend on GPGPU acceleration.

programmer · June 4, 2010 10:09AM

Quote:

Originally Posted by backtomac

It seems to me that with Open Cl there is no, or little, advantage of x-86 cores for GPGPU.

Am I wrong here?

That the cores are x86 (or its 64-bit extension) is pretty much irrelevant. The important part is that they are fully functional general purpose processors with heavy duty vector units tacked on. No matter how hard Intel/AMD try, your existing x86 single-threaded (or lightly threaded) app is not going to take advantage of massively concurrent hardware, and thus you're going to have to rewrite it. Since nobody writes in assembly language anymore, it doesn't matter what the ISA is... except for when trying to leverage SIMD (and OpenCL hides that to a large extent). The advantage is that a large array of x86s is going to be more flexible and able to cope with a larger variety of workloads than a GPU. It should be easier to achieve peak GFLOP numbers on a massively parallel CPU than on a GPGPU. The latter's peak numbers are highly theoretical, except on a very small set of workloads, and ought to be regarded with skepticism.

SIMD utilization is very important, which you cannot achieve in practice in portable code without something like OpenCL. On Larrabee, for example, the difference between scalar code and efficient SIMD code is greater than an order of magnitude (i.e. 10x). So between organizing code into parallelized tasks and using SIMD, most existing code needs to be re-written for any of this hardware anyhow.

programmer · June 4, 2010 10:13AM

Quote:

Originally Posted by Hiro

Pretty much agreed. You can only wring so much improvement from GPGPU, even several hundred percent improvement in one thread is small potatoes compared to a well designed cloud-aware distributed application. GPGPU doesn't scale arbitrarily, and it requires thinking about implementing the algorithms in a completely new mindset. Scaling across ridiculously expanded numbers of garden variety general purpose CPU cores can scale far more aggressively. You need to consider the data structures and algorithms in the beginning of application development to do that well, but they are implemented in ways that most programmers are already familiar with. THEN you can sprinkle GPGPU goodness on top of that and get the best of both worlds -- something you cannot get if you only depend on GPGPU acceleration.

This is all very heavily algorithm dependent. There are plenty of examples where compute clusters are very obviously the way to go, and there are many where the network connections cripple the algorithm and an array of GPUs is the only option. In either case, the data structures and algorithms need to be carefully considered from the start... "sprinkling GPGPU goodness on top" doesn't actually work that well, it needs to be factored in from the get-go.

hiro · June 4, 2010 12:30PM

Quote:

Originally Posted by Programmer

This is all very heavily algorithm dependent. There are plenty of examples where compute clusters are very obviously the way to go, and there are many where the network connections cripple the algorithm and an array of GPUs is the only option. In either case, the data structures and algorithms need to be carefully considered from the start... "sprinkling GPGPU goodness on top" doesn't actually work that well, it needs to be factored in from the get-go.

Of course it needs to be built in from the get-go, that's the whole point of my earlier post, you cannot do any sort of efficient bolt-on after-the-fact parallelism.

But it is unlikely to get much scalability in a single thread algorithm that can be accelerated through GPGPU compared to a multi-thread CPU distributed algorithm which someplace performs the same GPGPU accelerate-able calculation. A 1000x acceleration for a efficient single thread GPGPU algorithm still pales to a 1000 client * 200x GPGPU speedup in distribution. Sure we give up max per thread performance, who cares when you actually architect for scale across many not optimal but still very good clients.

Can you imagine the throughput difference any of the classics like SETI-at-home or Folding -at-home could gain with OpenCL enabled clients that know how to use 64 local CPU cores at the same time. That's not far off, just a couple years for high end desktop systems. Less than a decade for low end systems. Now imagine what new stuff we could do with access to that much distributable power. What can you do with an iPad or iPhone type device if you can access that flavor of non-local computation. Human I/O is slow enough to be able to do some pretty remarkable stuff behind the scenes.

programmer · June 4, 2010 3:15PM

Quote:

Originally Posted by Hiro

But it is unlikely to get much scalability in a single thread algorithm that can be accelerated through GPGPU compared to a multi-thread CPU distributed algorithm which someplace performs the same GPGPU accelerate-able calculation. A 1000x acceleration for a efficient single thread GPGPU algorithm still pales to a 1000 client * 200x GPGPU speedup in distribution. Sure we give up max per thread performance, who cares when you actually architect for scale across many not optimal but still very good clients.

I agree with what you're saying overall, but my point was that some algorithms don't do well when distributed across client machines. There are plenty of cases where 1000 client machines with 1000x GPU acceleration are exactly the same speed as 1 client machine with 1000x GPU acceleration. Cloud computing is not just another annoying buzzword... it will enable great stuff, but its not panacea and doesn't obviate the need for computational power in the client node.

hiro · June 4, 2010 7:13PM

Quote:

Originally Posted by Programmer

I agree with what you're saying overall, but my point was that some algorithms don't do well when distributed across client machines. There are plenty of cases where 1000 client machines with 1000x GPU acceleration are exactly the same speed as 1 client machine with 1000x GPU acceleration. Cloud computing is not just another annoying buzzword... it will enable great stuff, but its not panacea and doesn't obviate the need for computational power in the client node.

I agree totally. It is the new stuff that will wow us in the cloud, not the stuff we already do. I'm working on one of those kind of forward-looking it had better massively scale projects. You may also be one of the few who understand my near daily pain of railing against "We can go thin client because we are going to use the cloud!". Ungh.

programmer · June 4, 2010 10:26PM

Quote:

Originally Posted by Hiro

I agree totally. It is the new stuff that will wow us in the cloud, not the stuff we already do. I'm working on one of those kind of forward-looking it had better massively scale projects. You may also be one of the few who understand my near daily pain of railing against "We can go thin client because we are going to use the cloud!". Ungh.

LOL... yeah, I feel your pain. The reality is that compute is needed on both sides of the latency/bandwidth bottleneck.

Official: Larrabee a pile of fail

Comments