Does the core of a multicore chip have to be identical ?

powerdoc · September 20, 2002 2:03PM

Many rumors said that the next high end chip used by Apple will be a chip made by IBM with dual core and altivec unit.

Do you think it's necessary to have two identicals core with Altivec unit , or does it possible to have only one core including altivec.

One core with altivec and the other core without can save many transistors and lower the cost of the chip. In a other way the number of transistors saved could not be sufficiant to make and advantage.

What do you think of that, is it possible and if yes does it can bring something in term of prize and heat issue ?

kecksy · September 20, 2002 2:14PM

Maybe one core is PPC and the other is x86.

I kid. I kid.

I would expect that both cores would have Altivec, since multi-threaded programs might use Altivec in multiple threads.

roduk · September 20, 2002 2:34PM

As I understand it, the existing G4 has more than one vector processing unit, I think it may be four. If a new dual core chip had two units for each core, or four units for one core and none for the other, would there be much difference in terms of the number of transistors or the heat generated?

[ 09-20-2002: Message edited by: RodUK ]

blabla · September 20, 2002 2:40PM

[quote]Originally posted by RodUK:

As I understand it, the existing G4 has more than one vector processing unit, I think it may be four. If a new dual core chip had two units for each core, or four units for one core and none for the other, would there be much difference in terms of the number of transistors?<hr></blockquote>

Well, those 4 units are not identical, and together they form a complete altivec core. It wouldnt be very economical to divide the altivec core into thwo Power4 cores, because you would need a double set of altivec registers. Remember,- registers are really expensive (number of transistors) to implement.

Using different cores would be possible, but why designing two different Power4 cores would require more engineering R&D.

roduk · September 20, 2002 2:44PM

I wonder whether the Altivec units are core specific, or whether they can be shared between cores?

airsluf · September 20, 2002 4:07PM

blabla · September 20, 2002 5:07PM

[quote]Originally posted by RodUK:

I wonder whether the Altivec units are core specific, or whether they can be shared between cores?<hr></blockquote>

Doubt it.. and i doubt we will see two different cores.

How many transistors does Altivec requires? Well.. about 10 million transistors in the 7400 design, ans probably only slightly more in 7450:

<a href="http://www.altivec.org/articles/simplify.cfm"; target="_blank">http://www.altivec.org/articles/simplify.cfm</a>;

"This microprocessor, which is designed to support both desktop computing and high-performance embedded applications contains 10.5 million transistors in a 83 mm^2 die using a .22m CMOS technology with six layers of copper metal for interconnect."

So, having a dual core Power4 would require about 20 million more transistors. If you have two different cores, you would need additional transistors to ensure altivec code run on the altivec enabled core..

<a href="http://researchweb.watson.ibm.com/journal/rd/461/warnock.html"; target="_blank">http://researchweb.watson.ibm.com/journal/rd/461/warnock.html</a>;

"The IBM POWER4 processor is a 174-million-transistor chip that runs at a clock frequency of greater than 1.3 GHz."

The Power4 itself contain 174 million transistors. So adding dual core altivec support would require ~11-12%(?) more transistors. Of course, the Apple version could be a somewhat simpler chip (less cache) than the current Power4.. but..

Two different cores would mean a lot R&D for less than nothing....

So lets be realistic here: If Apple is going to use a _dual-core_ Power4, its going to be 2 identical altivec-supporting cores.

[ 09-20-2002: Message edited by: blabla ]

tht · September 20, 2002 5:44PM

Originally posted by blabla:

How many transistors does Altivec requires? Well.. about 10 million transistors in the 7400 design

That 10.5 million number is for the entire 7400 chip. The AltiVec unit takes about a third of that, so it's more about 3.5 million transistors for AltiVec in the 7400 and a little bit more in the 7450.

So, having a dual core Power4 would require about 20 million more transistors.

It'll be around 6 to 9 million transistors total, or 3 to 5 per core. Or about 10% of a hypothetical 60 to 80 million transistor chip. This is all in line with what a 0.13 micron process will give you for a 100 sq mm die.

The Power4 itself contain 174 million transistors. So adding dual core altivec support would require ~11-12%(?) more transistors. Of course, the Apple version could be a somewhat simpler chip (less cache) than the current Power4.. .

The Power4 has 1.5 MB L2 cache which is about 50 million transistors. It also has on-die logic for crossbar switches for the L2, L3 memory tags, GX bus, and chip-to-chip bus. The L3 tags and 1.5 MB L2 cache take up half of the die space of the Power4.

The addition of a separate SIMD unit is not a problem die space or transistor wise, and since in all likelihood, the GPUL will only have 0.5 MB L2, and reduced support for L3 (<32, probably around 4 to 8), if it has it at all, there will be plenty of transistors for SIMD units.

So lets be realistic here: If Apple is going to use a _dual-core_ Power4, its going to be 2 identical altivec-supporting cores.

Yes. Very true. The SIMD support will either be a separate unit or will make use of the integer and floating point units depending on the time and money invested. But asymmetric cores would be totally out of the question.

[ 09-20-2002: Message edited by: THT ]

blabla · September 20, 2002 5:55PM

[quote]

It'll be around 6 to 9 million transistors total, or 3 to 5 per core.

<hr></blockquote>

Ops, sorry <img src="graemlins/embarrassed.gif" border="0" alt="[Embarrassed]" /> ..

But if your transistor estimates are correct, adding full altivec support is definitively not going to result in much difference in die-size.

[ 09-20-2002: Message edited by: blabla ]

powerdoc · September 21, 2002 1:08AM

thanks for the answer : it appears that eveypeople agree that the core have to be identical. 4 millions transistors per core is not a big issue.

THT , by separate altivec unit do you mean an Altivec unit like the one of the G4chip opposed to the SIMD unit of the X86 chip ?

tht · September 21, 2002 1:20PM

Originally posted by Powerdoc:

THT , by separate altivec unit do you mean an Altivec unit like the one of the G4chip opposed to the SIMD unit of the X86 chip ?

Yes. The G4 has a specific execution unit that execute AltiVec instructions while the x86 chips use their integer and floating point units to execute SIMD instructions. I get the feeling that this GPUL will use its integer and floating point units to execute its SIMD instructions...

blabla · September 22, 2002 7:01AM

[quote] I get the feeling that this GPUL will use its integer and floating point units to execute its SIMD instructions...<hr></blockquote>

Wouldnt that actually require more engineering effort? ( Assuming somehow IBM already got the G4 altivec impelmentation, but i would imagine its not as easy as just slap on altivec ) The POWER4 integer unit probably doesnt support saturated arithmetic and such.. And even if it did, in some cases Altivec would still be much faster.

I actually suspect reusing the POWER FP units would be easier. And anyway: much of the complexity of Altivec is caused by the permute unit. Im pretty sure you have written about it somewhere in the looong Power4 thread, but life is too short to read long threads...

matsu · September 22, 2002 9:22AM

Bearing in mind that I know absolutely nothing about the magic electronic bits that make computers go, may I venture a theory.

Altivec seems like a VERY good technology to me. The kinds of tasks the digital media requires -- encoding and decoding large sets of data and streaming media -- are fast becoming the core of 'computing experience' as we know it. Voice, images, video, each discretely or in combination, demands a fast effiecient way of turning 'codecs' into experiences. IBM didn't go for altivec initially maybe through pride or some-such other politics, but they mentioned it in the original Sahara promotional groundwork and this new chip has "over 160" special instructions. I think they're going for it now.

Not that Altivec is some kind of magic cure for an antique FSB and last year's fab tech, but it works. Of course P4 works too, and quite well. They're the most powerful consumer desktop chips out there by a fair margin. The approach may be different though -- ramp up the clock and give it a FSB to match and it will stream data very well, and it does.

So what does IBM want? Do they want to depend on a fast FSB and lots and lots of clocks in order to get big throughput? It's a strategy that works but it seems like a really good solution that doesn't depend on clock speed orthodoxy is siting right there waiting for them. I think they'll use it 'cause, so long as the FSB is sufficient to keep it fed, they'll have an answer that gives them more for less in the long run, and that is a very PPC quality. Furthermore, Motorola has made enouugh of a mess at this point that IBM jumping in would be seen as a saviour and not an admission by IBM that, yes, Moto was right about altivec all along.

powerdoc · September 22, 2002 11:22AM

An another question :

What is the more expansive : one dual core chip or two single core chip ?

Like Matsu, i do not know the magic of electronic, but i am ready to bet that the dual core will be cheaper and more powerfull (better communication between the two cores rather than two chips, especially with the lame MPX bus).

A single high end chip in the powermac line will appear logical. And will be good also for the others products of Apple, they will not have to limit the performance of these products for lame marketing considerations, they will have the fastest G4 avalaible.

And for the Altivec unit let's guess that they'll choose to make an altivec unit similar of the one of the 745x series. As THT said his bet for a L2 cache of 0,5 MB and L3 cache controller between 2 to 8 MB sound logical for a 0,13 micron process SOI chip with less than 100 millions transistors (60 % of the transistors of the power4). But saying that i realise that this chip will be as hot as the 0,18 micron Athlon, perhaps IBM will product directly 00,9 micron GPUL. In the last case we will not see it before MWNY.

matsu · September 22, 2002 11:46AM

Again bear with me, 'cause I don't think it came out in my earlier post. I don't really know what I'm talking about so my probing lacks a certain efficieny.

The P4 does it's SIMD duties using it's Int and FP units, while the G4's altivec is it's own seperate unit. Is this what allows altivec to be 128 bits wide on an otherwise 32bit processor, and what Apple refers to in their little anti-pentium propaganda graphic? You know, the one with the little blocks getting crushed in the pentium's narrow data path? I'm probably missing something, but it seems to me that running the altivec functionality through the fp and int units would effectively limit it's width to that of the CPU. In this case 64 bits (not as bad as 32) but not as good as the 128 bits of parallelism in the original altivec either.

Somebody ???

powerdoc · September 22, 2002 11:58AM

[quote]Originally posted by Matsu:

Again bear with me, 'cause I don't think it came out in my earlier post. I don't really know what I'm talking about so my probing lacks a certain efficieny.

The P4 does it's SIMD duties using it's Int and FP units, while the G4's altivec is it's own seperate unit. Is this what allows altivec to be 128 bits wide on an otherwise 32bit processor, and what Apple refers to in their little anti-pentium propaganda graphic? You know, the one with the little blocks getting crushed in the pentium's narrow data path? I'm probably missing something, but it seems to me that running the altivec functionality through the fp and int units would effectively limit it's width to that of the CPU. In this case 64 bits (not as bad as 32) but not as good as the 128 bits of parallelism in the original altivec either.

Somebody ???<hr></blockquote>

You are right, (from what i have read in several forums including ars technika). Altivec is the only thing in a G4 that is better than the X86 chip. The FP unit is better also but is alone. The FP unit of the P4 share some aera with the integer unit at the difference of the Athlon with have 3 fully pipelined FP unit, but these unit are different and specialised and thus not able to make three multiplications per cycle for example.

mmicist · September 22, 2002 1:06PM

[quote]Originally posted by Matsu:

Again bear with me, 'cause I don't think it came out in my earlier post. I don't really know what I'm talking about so my probing lacks a certain efficieny.

The P4 does it's SIMD duties using it's Int and FP units, while the G4's altivec is it's own seperate unit. Is this what allows altivec to be 128 bits wide on an otherwise 32bit processor, and what Apple refers to in their little anti-pentium propaganda graphic? You know, the one with the little blocks getting crushed in the pentium's narrow data path? I'm probably missing something, but it seems to me that running the altivec functionality through the fp and int units would effectively limit it's width to that of the CPU. In this case 64 bits (not as bad as 32) but not as good as the 128 bits of parallelism in the original altivec either.

Somebody ???<hr></blockquote>

1) P4's SSE2 instructions do not use the integer and fp units or registers to do their work, they have separate units.

2) Using the existing units in POWER4 to try and do the VMX instructions would give you a tremendous slowdown relative to true vector units, especially in permute and masked instructions (precisely what makes AltiVec so much more powerful than SSE2), you might as well not bother, scalar code would be just as fast. (As an aside, do you think IBM would allow themselves to build a vector unit with *much* worse performance than Motorola's G4?)

michael

[ 09-22-2002: Message edited by: mmicist ]

powerdoc · September 22, 2002 1:14PM

[quote]Originally posted by mmicist:



(As an aside, do you think IBM would allow themselves to build a vector unit with *much* worse performance than Motorola's G4?)

michael

[ 09-22-2002: Message edited by: mmicist ]<hr></blockquote>

That's a good point .

matsu · September 22, 2002 2:36PM

I don't think so either, I was trying to say, if anything, that I expect IBM to use pretty much the same Altivec a Motorola, only a generally much faster CPU, with 64bits, a faster FSB, and the accordingly greater throughput that goes with it.

tht · September 23, 2002 10:15AM

Originally posted by mmicist:

1) P4's SSE2 instructions do not use the integer and fp units or registers to do their work, they have separate units.

The SSE2 instructions, all SIMD instructions, are dispatched to the same execution unit used by floating point instructions. They just have different registers. You'll note that 64 bit SSE2 ops have a port latency of 2 cycles. That should make you go "hmmm" right there. However, architecture descriptions explicitly say that all SIMD instructions go to the same execution unit as FP instructions.

2) Using the existing units in POWER4 to try and do the VMX instructions would give you a tremendous slowdown relative to true vector units, especially in permute and masked instructions (precisely what makes AltiVec so much more powerful than SSE2), you might as well not bother, scalar code would be just as fast.

Yes, it'll be interesting to see how IBM implements the permute instructions and such, but I'm imagining the possibility of using 2x64 bit FP SIMD instructions as something too good to pass up. But tremendous slowdown, perhaps. Perhaps they'll permute instruction execution to the integer units.

(As an aside, do you think IBM would allow themselves to build a vector unit with *much* worse performance than Motorola's G4?)

If the presumption in the former is true, you may have a point, but what if it isn't true? This hypothetical GPUL has a few advantages that the G4 will not, so things may even out, and in some cases be considerably better.

mmicist · September 23, 2002 4:35PM

Quote:

Originally posted by THT:



{Originally posted by mmicist:

1) P4's SSE2 instructions do not use the integer and fp units or registers to do their work, they have separate units.}

The SSE2 instructions, all SIMD instructions, are dispatched to the same execution unit used by floating point instructions. They just have different registers. You'll note that 64 bit SSE2 ops have a port latency of 2 cycles. That should make you go "hmmm" right there. However, architecture descriptions explicitly say that all SIMD instructions go to the same execution unit as FP instructions.

<hr></blockquote>

I must learn not to post when I'm so tired. It's only some of the integer instructions that have their own execution units, not the fp ones.

Quote:

{2) Using the existing units in POWER4 to try and do the VMX instructions would give you a tremendous slowdown relative to true vector units, especially in permute and masked instructions (precisely what makes AltiVec so much more powerful than SSE2), you might as well not bother, scalar code would be just as fast.}

Yes, it'll be interesting to see how IBM implements the permute instructions and such, but I'm imagining the possibility of using 2x64 bit FP SIMD instructions as something too good to pass up. But tremendous slowdown, perhaps. Perhaps they'll permute instruction execution to the integer units.

<hr></blockquote>

Yes the two FPUs would be able to do 2x64 bit SIMD instructions, but this would be an extension to AltiVec, and would'nt be any faster than using standard FPU instructions except in rare cases, so I can't see any developer bothering. Unfortunately using the FPUs to do the 4x32bit FP SIMD instructions would slow things down by a factor of 2.

As far as integer instructions are concerned, you could mostly do it with modified integer units, but not the permute instructions.

The masking elements of the instructions would almost certainly slow down implementations using the scalar execution units.

[/QB]

(As an aside, do you think IBM would allow themselves to build a vector unit with *much* worse performance than Motorola's G4?)

If the presumption in the former is true, you may have a point, but what if it isn't true? This hypothetical GPUL has a few advantages that the G4 will not, so things may even out, and in some cases be considerably better.[/QB]

It has so many advantages its difficult to list them all, but why go to all the bother of redesigning them to be able to execute SIMD instructions, (they would need new ports to the SIMD registers at a very minimum, and probably a fair bit more) when it would be reasonably simple to add on a proper SIMD unit, given you have to add the registers and at least a permute unit, giving greater chances for instruction parallelism and improved performance, at the cost of a few million transistors.

michael

Does the core of a multicore chip have to be identical ?

Comments