970GX and low power 970s for PowerBooks

wizard69 · November 16, 2004 9:34PM

Quote:

Originally posted by Programmer

You would only get that impression if you didn't bother reading what I wrote. AltiVec was an excellent addition because they did it right and they did it once. The cost/benefit of adding AV was very good because Apple invested in it heavily (as have other key developers), and it was well designed and implemented.

The problem is I have read what you have written and am bothered by it. It is a very good thing that they did a very good job with the first implementation of AltVec I think everybody can agree with that. The problem is that the world doesn't stop after the applause, just as the rest of PPC has been improved there is an opportunity to do so with the vector unit.

Quote:

Where do you get that? I thought I was quite clear in saying that developers need to weight the cost against the benefits. AltiVec has been a substantial performance win from the day it was introduced, and as Apple's continuing commitment to it has raised confidence in its longevity (unlike numerous other technologies). As a result more and more developers start using it, and it gains momentum. If there were multiple versions to support (yes, even counting extensions) that process would have to start all over again for each revision.

I geuss we will have to disagree here. If you really want to get a buy in with new developers show them that there is a future in the unit. As it is with the 970, vector performance didn't really improve much at all clock for clock with the G4.

Quote:

Because changing the instruction set isn't necessary. The 601 used the original PPC instruction set and it essentially hasn't changed since then, yet we have the massively faster 970. AltiVec has room to improve without changing the ISA.

AltVec certainly can improve, but expanding the instruction set shouldn't be off limit with respect to those improvements. Even the FPU has benefited over time with new instructions, most people have been please with that. As long as the programming model for existing instructions doesn't change then I don't really see a problem.

Quote:

Again you are just ignoring what I'm writing. SMT/more cores doesn't require anything different than supporting Apple's existing dual processor machines, and even if you don't do that the user still benefits from multiple threads due to OSX. More execution units doesn't change the ISA. And I'm not talking about saving IBM work... you'll note that POWER5 didn't change the ISA either. Adding SMT to the POWER5 sped up existing software. All existing software.

Well this is wrong, SMT on Power5 did not speed up all existing code, IBM on their web site reviewed issues where that simply isn't the case. Power5 is faster due to a number of improvements not just SMT. SMT can be a performance negative just like in the Intel world, apparently it is much better than the Intel approach but there are still situations where it just doesn't help.

The point is that AltVec can be extended to speed up future code, without impacting current code. Some of thsoe improvements could be the result of new instructions or data types.

Thanks

Dave

marcuk · November 16, 2004 10:53PM

I dont understand, if a new Altivec was fully backwards compatible with the old ISA, why would it be a bad thing if the were new instructions for doubles?

Existing Altivec code still runs fine, but new code could do x2 performance, or 64 bit vectors.

Where is the Altivec thread you speak of. Can't find it.

earthtoandy · November 17, 2004 12:53AM

the fact that it is lower power means that it will be able to run faster clock speeds at lower temps. Thats the idea guys! The lower power consumption allows it to go faster without frying the chip.

zapchud · November 17, 2004 4:45AM

Quote:

Originally posted by MarcUK

I dont understand, if a new Altivec was fully backwards compatible with the old ISA, why would it be a bad thing if the were new instructions for doubles?

Existing Altivec code still runs fine, but new code could do x2 performance, or 64 bit vectors.

Where is the Altivec thread you speak of. Can't find it.

The existing Altivec code would still run fine on machines that has Altivec 2. The problem is that if someone wrote code for Altivec 2, only the processors supporting AV2 would be able to run the program, and the developer would have to write a fallback in AV or scalar code for older processors. It translates for additional work (iow. investments in time and money) for the Altivec programmer as he/she'd have to write more processor-specific code, only to have the new code working for a tiny fraction of the installed base. This can only be worth it if performance gains are truly huge (as they are with regular altivec compared to scalar).

As I'm sure somebody else have said; tweak the implementation of existing Altivec on the processor for more effectiveness and remove practical limitations. Neither old nor new code will stop working anywhere.

As for SMT or IMC: Use the transistors on these technologies instead. Both will earn speed increases. Writing code for SMT will not alienate processors without SMT.

dfiler · November 17, 2004 7:38AM

Interesting discussion. Perhaps an abstraction of the specifics would be enlightening...

I would hazzard to assert, that for every technological platform, there is an optimal balance between advancement and compatability.

If the platform stagnates, users lose, in that they might be more productive with additional improvements. Yet, at the same time, developers can concentrate on delivering code optimized for a rather homogenous platform.

If the platform is constantly evolving, users should theoretically have more powerfull tools in their hands sooner. But... developers are will need to aquire the skills for each variation and tailor code to take advantage of each revision.

With Altivec, I think that currently, we have more to bennefit from uniformity and experienced programmers, than from an incremental improvement that fragments the platform.

wizard69 · November 17, 2004 9:44AM

Quote:

Originally posted by Zapchud

The existing Altivec code would still run fine on machines that has Altivec 2. The problem is that if someone wrote code for Altivec 2, only the processors supporting AV2 would be able to run the program, and the developer would have to write a fallback in AV or scalar code for older processors.

Unlike some here, I don't see this as an additional burden. Developers already have to do this for PPC without AltVec or they phase out support for older processors. It is not an issue.

Quote:

It translates for additional work (iow. investments in time and money) for the Altivec programmer as he/she'd have to write more processor-specific code, only to have the new code working for a tiny fraction of the installed base. This can only be worth it if performance gains are truly huge (as they are with regular altivec compared to scalar).

Any performance gains from an AltVec2, no matter what the update implements, will likely be very domain specific. For those applicaitons where additional functionality in the vector unit improves performance, there will be no resistance what so ever to adopting the new technology. Look at it this way if whole industries switch to PPC simply because of its performance running certain gnomics codes do you really think leaving behind older processors is of concern? The point is performance on yesterdays processor isn't a big issue considering where AltVec is being used extensively.

Quote:

As I'm sure somebody else have said; tweak the implementation of existing Altivec on the processor for more effectiveness and remove practical limitations. Neither old nor new code will stop working anywhere.

This is as much a part of AltVec2 as anything. the reality is that is you are going to add execution units or tweak other things you might as well consider new operations and data types. Either way old code isn't a problem.

Quote:

As for SMT or IMC: Use the transistors on these technologies instead. Both will earn speed increases. Writing code for SMT will not alienate processors without SMT.

Again I have to disagree, code written specifically for a machine supportin SMT is very likely to alienate processors without SMT. At the very least you will have a huge difference in performance.

In any event there is a huge surplus of transistors right now in IBM's 970 implementations. One could implement SMT, and an integrated Memory Controller and a host of toher functionality and still not be at the size of an Intel chip. We could argue about finding the right balance but it is already clear from Power5 that simiply adding SMT will not fill the chip to the same area as one Prescott.

It will be rather sad to have the main core improve continuously and not see any attention paid to the vector side of the chip. On the 970 vector performance is already lagging the G4 in some respects (thankfully other parts of the chip compensate), I just can't see why we have this big resistance to doing better. If not better at least having the same attention paid to it that we see being applied to the other components of the processor.

Thanks

dave

dfiler · November 17, 2004 9:52AM

Quote:

Originally posted by wizard69

I just can't see why we have this big resistance to doing better. If not better at least having the same attention paid to it that we see being applied to the other components of the processor.

There is absolutely no resistance here 'to doing better'. Do you really characterize this discussion as about that?

This discussion is obviously about the tradeoffs involved with changing or extendeding the current implementation of altivec. Some are arguing that the proposed additions are not worth it at this point in time. Why assume that they want to keep altivec the same for all of eternity?

zapchud · November 17, 2004 11:20AM

Quote:

Originally posted by wizard69

[B]Unlike some here, I don't see this as an additional burden. Developers already have to do this for PPC without AltVec or they phase out support for older processors. It is not an issue.

They already have to do it once, if at all. Having to do it twice is an issue.

Quote:

Any performance gains from an AltVec2, no matter what the update implements, will likely be very domain specific. For those applicaitons where additional functionality in the vector unit improves performance, there will be no resistance what so ever to adopting the new technology. Look at it this way if whole industries switch to PPC simply because of its performance running certain gnomics codes do you really think leaving behind older processors is of concern? The point is performance on yesterdays processor isn't a big issue considering where AltVec is being used extensively.

I'm not sure what you're trying to say here.

Quote:

This is as much a part of AltVec2 as anything. the reality is that is you are going to add execution units or tweak other things you might as well consider new operations and data types. Either way old code isn't a problem.

It has nothing to do with Altivec 2. Refining and tweaking execution units does not change the programming interface. See the difference between the 7400 and 7450 class of CPUs.

Quote:

Again I have to disagree, code written specifically for a machine supportin SMT is very likely to alienate processors without SMT. At the very least you will have a huge difference in performance.

Why is it likely to alienate processors without SMT? As Programmer said, creating SMT optimized code is no different than creating SMP optimized code. Multithreaded code does work fine on single-threaded processors.

The difference is not likely to be huge, but well worth the transistor and development cost.

Quote:

In any event there is a huge surplus of transistors right now in IBM's 970 implementations. One could implement SMT, and an integrated Memory Controller and a host of toher functionality and still not be at the size of an Intel chip. We could argue about finding the right balance but it is already clear from Power5 that simiply adding SMT will not fill the chip to the same area as one Prescott.

What's up with having to create a chip as physically large as the Prescott? Add some cache, if you aren't satisfied with the die size.

I'd say implementing the above on the 970 would create much more of a performance increase than Altivec 2 over-all, while having none of the disadvantages.

Quote:

It will be rather sad to have the main core improve continuously and not see any attention paid to the vector side of the chip. On the 970 vector performance is already lagging the G4 in some respects (thankfully other parts of the chip compensate), I just can't see why we have this big resistance to doing better. If not better at least having the same attention paid to it that we see being applied to the other components of the processor.

None have said or suggested that the vector unit should be forgotten about and left in the dust. The 970 vector unit is lagging the G4 one in some respects because it is less refined and tweaked. But you can still have one altivec code base that works on both CPUs. You don't need to extend or change the ISA in any way to "fix" the 970 vector unit.

programmer · November 17, 2004 11:50PM

Quote:

Originally posted by wizard69

The problem is I have read what you have written and am bothered by it. It is a very good thing that they did a very good job with the first implementation of AltVec I think everybody can agree with that. The problem is that the world doesn't stop after the applause, just as the rest of PPC has been improved there is an opportunity to do so with the vector unit.

My point is that there is no need to change the ISA to accomplish this. The 970's vector implementation isn't nearly as strong as it could be without changing the ISA. The nature of most vector code tends to mean that throwing more execution units at it will have a nearly linear speed up.

Quote:

AltVec certainly can improve, but expanding the instruction set shouldn't be off limit with respect to those improvements. Even the FPU has benefited over time with new instructions, most people have been please with that. As long as the programming model for existing instructions doesn't change then I don't really see a problem.

There have been something like 3 instructions added to the FPU since the 601, and they are rarely used. Since all processors since the 601 (except perhaps the 603, I can't remember offhand) have implemented them developers can, at this point, use them without worrying about compatibliity.

Quote:

Well this is wrong, SMT on Power5 did not speed up all existing code, IBM on their web site reviewed issues where that simply isn't the case. Power5 is faster due to a number of improvements not just SMT. SMT can be a performance negative just like in the Intel world, apparently it is much better than the Intel approach but there are still situations where it just doesn't help.

I phrased that badly... SMT works with all existing code, and in cases where it can help performance the OS can adjust the thread priorities so that it does. Where it doesn't, it can effectively be turned off. This logic is built into AIX, and it uses the POWER5's rather prodigious self-monitoring capabilities.

Even better: SMT, IMC, and more cache can be left out of future processors without impacting the software installed base.

Quote:

The point is that AltVec can be extended to speed up future code, without impacting current code. Some of thsoe improvements could be the result of new instructions or data types.[/B]

Yes, it could. My objection is that doubling the register sizes and/or adding double precision support is hugely expensive in terms of transistors, and it provides marginal benefits to most applications (and only those that are re-written to use the new instructions... not likely to happen until the number of machines in the market with these capabilities has reached a level to make it practical). Double precision support is only an improvement over the dual FPUs if you also go to 256-bit registers... a very significant expense for a small fraction of potential applications.

And don't discount the importance of not forcing complexity on all following chips. AltiVec, as it stands, is a fairly heft investment that Apple is stuck with -- fortunately it has proven to have a substantial payoff and significant software investment has already been made. Doing the 256-bit registers + double precsion math would saddle all future chips for Apple with this heavy cost. If Apple wants to, for example, do a 4 core SMT processor with really long pipelines and a super high clock rate, they can't if they've tied themselves into an AltiVec unit that is 2-3 times as complex.

Perhaps there is an instruction or two that could be added cheaply, but unless they have some really revolutionary instructions (possible, but unlikely) the potential benefit is hardly something to excited over (just like those extra FPU instructions were of only passing interest).

Oh, and by the way.

onlooker · November 18, 2004 3:25AM

Damn you guys are at each other and I'm not even talking for once. Makes me feel better.

wizard69 · November 18, 2004 1:47PM

Quote:

Originally posted by onlooker

Damn you guys are at each other and I'm not even talking for once. Makes me feel better.

We aren't going to let you off that easy, time to wiegh in with your perspective!

It is probally worth noting that we are not going to come to an agreement here. Even if they don't expand the instruction set I think everyone does agree on one thing, that is that we would love to see effort put into ratcheting up vector performance on the 970.

Then again I suppose there is an element that would not want to see better performance.

Dave

zapchud · November 18, 2004 1:54PM

Quote:

Originally posted by wizard69

Even if they don't expand the instruction set I think everyone does agree on one thing, that is that we would love to see effort put into ratcheting up vector performance on the 970.

Of course.

programmer · November 19, 2004 12:33AM

Quote:

Originally posted by wizard69

Then again I suppose there is an element that would not want to see better performance.

There's always a stick in the mud somewhere.

I am really curious to see what IBM is going to do in the POWER6. If they can't make headway in terms of clock rate, what will they turn to? Adding a vector unit to the POWER family would be interesting, but a bit of an also-ran. Will IBM go out on a limb in their flagship server product and try something radical? What would that look like?

And what's after the next thing in the 970 family. The next thing is apparently going to be a refined 90nm process, larger caches, maybe slightly stretched pipelines, a small clock rate bump, and a version with twin cores. After that, however, is a more interesting question. Will we get SMT? IMC? Or something more radical/unconventional? On-chip I/O perhaps?

wizard69 · November 19, 2004 2:06PM

Quote:

Originally posted by Programmer

There's always a stick in the mud somewhere.

I am really curious to see what IBM is going to do in the POWER6. If they can't make headway in terms of clock rate, what will they turn to? Adding a vector unit to the POWER family would be interesting, but a bit of an also-ran. Will IBM go out on a limb in their flagship server product and try something radical? What would that look like?

Well I'm not convinced that clock rate growth is completely gone, but there is an obvious need to increase performance through other alternatives. One place to look for these payoff would be additional execution units.

This is beyond AltVec optimizations and would involve special purpose execution units that enhance things such as networking and cryptology. I also still believe that enhanced instructions will play a role in the future. After all if you can't speed the clock rate up and you have gotten as wide as possible with the cores then the only thing really left is to implement instructions that do more. The FPU is one place such enhancements would pay off, on the other hand a whole new execution unit (or adapt the vector unit) to do BCD math could pay off for some usage.

It was my understanding at one time that IBM had a power variant that was extended to do BCD math. I'm sure they have a number of things up their sleeves. I still believe one of those would be an improved vector component optimised more for scientific applications than signal processing.

Quote:

And what's after the next thing in the 970 family. The next thing is apparently going to be a refined 90nm process, larger caches, maybe slightly stretched pipelines, a small clock rate bump, and a version with twin cores.

For me the question is how soon will we see these? I don't really see all that small of a clock rate bump either. I think a 500MHz gain to 3 GHz should be easy on a reoptimized process / core.

For small systems though one thing that I would have to think that IBM and Apple must seriously be looking at is high integration devices beyond simply dual core. Here I'm talking about SoC devices, with the driver being higher performance for certain I/O.

Quote:

After that, however, is a more interesting question. Will we get SMT? IMC? Or something more radical/unconventional? On-chip I/O perhaps?

As noted above I don't see on chip I/O as being radical at all. The drivers in this area will be higher performance and lower costs in that order. I do suspect though that we will see IBM/Apple grow into this slowly with possibly an integrated memory controller sitting next to high speed buses such as Hypertransport or PCI-Express. This would actually be a nice machine with the low pin count I/O busses going directly to the I/O chips.

Right now I see the 970's bus a a big drag on low cost high performance machinery. Put the DMA/Memory interface on chip along with the I/O bus and you have an avenue to low cost and high performance. The current arraingement with the 970 really doesn't permit that and is not likely to ever be usefull in low power devices.

Of course others would see these sorts of ideas as questionable. They will be implemented in part in the near future thoug. Things such as IMC have payoffs in both performance and power usage so there is a existing need.

Dave

programmer · November 19, 2004 10:10PM

Ironically, if IBM goes for a SIMD unit in the POWER6 it might very well be AltiVec2. What is reasonable for a high-end server with no legacy of vector code is quite different from what is appropriate to a desktop/laptop machine.

The specialized units you describe, along with on-chip I/O are appealing becuase their existance can be entirely hidden behind the OS. These things aren't radical in the embedded market, but the desktop market hasn't yet seen these things. It'll be interesting to see if Apple moves their chipset IP onto the processor by working closely with IBM's designers.

500 MHz? That's only a 20% increase from today, and I don't consider it particularly dramatic.

mi0im · November 20, 2004 4:03AM

Programmer,

Please note that SIMD for the next generation microprocessor has a different role from the one that used to be. Present high performance processors consume too much power in the instruction sequence units, that manage deep OoOE. To achieve both high IPC and low power consumption, several companies plan to use in-order or simple out-of-order execution pipeline with SIMD. In such processors, native ISA will be converted to internal SIMD instructions by some software or hardware. So, there is no need to change the ISA.

IBM's ultra high frequency microprocessor research and Intel's PARROT are two examples of such architecture.

wizard69 · November 20, 2004 7:53AM

Quote:

Originally posted by Programmer

500 MHz? That's only a 20% increase from today, and I don't consider it particularly dramatic.

Well yeah it may only be 20% increase in performance but that shouldn't be condemed when one consideres that 500MHz ues to represent the maximum clock rate of whole computers. In other words that 500MHz is equivalent to the computing performance of a machine that would still be usefull today. Such a boost would not go unnoticed by the average user.

If Apple could manage this much of a boost every 6 months I'd be very happy with them. As we have seen this hasn't happened consitantly at all with Apple products. It is the thought that 3GHz is still doable that has me excited about that 20% increase. Sure a 50% increase would be fantastic but that doesn't look promising at all.

Dave

programmer · November 20, 2004 11:46AM

Quote:

Originally posted by mi0im

Programmer,

Please note that SIMD for the next generation microprocessor has a different role from the one that used to be. Present high performance processors consume too much power in the instruction sequence units, that manage deep OoOE. To achieve both high IPC and low power consumption, several companies plan to use in-order or simple out-of-order execution pipeline with SIMD. In such processors, native ISA will be converted to internal SIMD instructions by some software or hardware. So, there is no need to change the ISA.

IBM's ultra high frequency microprocessor research and Intel's PARROT are two examples of such architecture.

Thanks for the links those are interesting papers. I didn't read them in depth, but its not clear to me how the conversion to SIMD would be accomplished unless it was some form of re-compliation (a la Transmeta -- and vectorizers have rarely been effective to date) or a simple remapping to avoid duplicating execution unit functionality (i.e. they don't become SIMD operations, just the extra results are discarded). Anything more complex would make for a huge decoder, which is expressly what they are attempting to eliminate.

The IBM paper on in-order high frequency processors brings up some interesting issues. The value of that compared to lower frequency OOOE super-scalar processors is far from clear, however. All I can tell you is that from my perspective the OOOE processor is much easier to deal with and acheive decent performance with. The in-order high frequency / high latency processor can very easily be made to perform very poorly. On carefully crafted code and problems well suited to its design, it can be made to perform very well... but the majority of code does not fall into that camp.

Particularly interesting is the Intel admission that their x86 decoder is a huge power hog.

programmer · November 20, 2004 11:54AM

Quote:

Originally posted by wizard69

Well yeah it may only be 20% increase in performance but that shouldn't be condemed when one consideres that 500MHz ues to represent the maximum clock rate of whole computers. In other words that 500MHz is equivalent to the computing performance of a machine that would still be usefull today. Such a boost would not go unnoticed by the average user.

If Apple could manage this much of a boost every 6 months I'd be very happy with them. As we have seen this hasn't happened consitantly at all with Apple products. It is the thought that 3GHz is still doable that has me excited about that 20% increase. Sure a 50% increase would be fantastic but that doesn't look promising at all.

You misunderstand me -- I think they'll get another 500 MHz or so, and that's it. The extra 20% will be a nice little boost but will deliver noticably less than a 20% improvement in system performance for most tasks. And it will be water cooled. After that further clock rate increases will come only from designs like those discussed in the papers that mi0im linked to, which means we'll lose the benefits of OOOE and various other things.

More interesting is sticking to 2.5 GHz and adding a few more cores. The first multi-core chip out of the gate will double the number of cores and net us something like a 60-90% performance improvement at a system level or on software that is multi-threaded. Right now most software (and in particular most benchmarks) is not multi-threaded, but that will change. The majority of things which people are doing that require high performance these days can be parallelized fairly well, and if it can't at least your machine will still operate smoothly while you're running that task at full speed.

wizard69 · November 21, 2004 11:22AM

Quote:

Originally posted by Programmer

Thanks for the links those are interesting papers. I didn't read them in depth, but its not clear to me how the conversion to SIMD would be accomplished unless it was some form of re-compliation (a la Transmeta -- and vectorizers have rarely been effective to date) or a simple remapping to avoid duplicating execution unit functionality (i.e. they don't become SIMD operations, just the extra results are discarded). Anything more complex would make for a huge decoder, which is expressly what they are attempting to eliminate.

I'm wondering how many of you consider Transmetas approach and hardware to be a success? Really!

Transmetas processors are really VLIW machines not SIMD machines anyways but that to me is not the point. I see Transmets greatest failing to be building a processor that can not be runned in a native mode. There may very well be an approach that allows the processor itself to translate a PPC instruction stream on the fly into VLIW instructions, but it just seems to be a complete waste of effort. Why not produce a VLIW processor that interpets legacy code yet can run native VLIW code unrestricted?

An interesting thought just popped into my head - stand back. A VLIW engines would be a good replacement for a SIMD engine. That is a VLIW engines should be able to emulate an SIMD instruction stream rather easily maybe with little cost in hardware. A SIMD unit however could not emulate a VLIW unit with any sort of performance that one would want to right home about. Think about this: AltVec2 is simply an expansion of the vector engine into a VLIW engine, with the old SIMD words emulated in hardware. this would vastly improve the utility of the "vector" engine while leaving the rest of the processor unencumbered with stuff that doesn't fit. Ok to step back in place now!

Dave

970GX and low power 970s for PowerBooks

Comments