I thought I would take a bit of time out and try and speculate on what there might be in the G5 which would cause it to be a considerably faster chip than the G4. Why not, it's a holiday
My suspicion is that the rumours of about 58 million transistors and a very large die may be close to the mark, as Apple would have said to themselves, let's move away from the attitude of embedded device designers for this thing, as it seems to be limiting the performance unacceptably, and allow ourselves a huge area and a lot of power dissipation from the start
. The power dissipation and cost can be dealt with in later revisions for portables and low cost devices (or they can use G4 class devices).
There are two main ways of making a faster chip, increase the clock frequency, and increase the work done each clock cycle.
First off, I don't actually know what is the limiting factor on the scaling of the G4, something(s) in the chip is(are) causing it to have problems with high frequencies, and I will just assume this has been mended (at least to a certain extent) in the G5.
From what I can see, the G5 has a 10 stage pipeline, instead of the 7 stage of the latest G4s, this will certainly help speed things up. Otherwise I suspect speed improvements will come mainly from process changes and corrections of errors in other areas.
These changes require relatively little addition to the chip area, a few extra rename registers to cope with the deeper pipeline and execution units
For the increase in work per clock cycle, I suspect the most important element would be an improved memory system, this seems to include an enlarged L2 cache of 512K, which would mean an extra 13 million or so transistors, but little increase in area. I really hope they have an embedded memory controller using DDR RAM, as this would massively increase the bandwidth and reduce the latency (delay) of the memory system, and of itself give a considerable increase in work per clock cycle. If the memory controller could handle dual channels (128 bits wide) this would be killer bandwidth (up to 5 times the maximum of current machines, which appear to have usable bandwidth well below their low theoretical limit) which is vital for things like streaming video processing and indeed scientific modelling.
The other increase in work per clock cycle would presumably come from increasing the number execution units. As far as the integer units are concerned, there probably isn't much point in putting in more units, as it is unlikely much more instruction level parallelism can be found, although another unit for address generation to take full advantage of the new memory system would maybe be worthwhile.
Increasing the FPU units would almost certainly provide worthwhile speed-up, as most FPU intensive code is not heavily branched, but looped, and I would expect to see at least one more multiply/add unit in the FPU, and the new memory system would be able to keep it fed with the data it needs.
As far as Altivec is concerned, rumours that they have had to work hard to get the same per clock efficiency as the G4 units would suggest that they have not significantly chenged the number of execution units here, although I personally would like it if they have a double precision vector FPU unit to match the Pentium 4's capabilities in this regard.
At a system level, the inclusion of point to point busses would considerably improve multi-processor performance, and I expect to see at least two system busses, either rapidIO or Hypertransport, although I would opt for Hypertransport as the more likely, especially if they have been able to tap into AMD work on the busses for their upcoming Hammer series chips. (Note, I discount the possibility of AMD fabbing the G5 for Apple, as they currently have no excess capacity at their Dresden plant, and are looking to outsource the production of some of their own devices. Sharing design work on peripherals etc., however, I think quite likely).
So my guess is two 16 bit Hypertransport 400MHz double pumped bidirectional busses, or maybe one 32 bit one, this is limited by the number of pins required for each bus (each 16 bit bus uses 103 pins).
The only problem with this analysis is that it doesn't leave enough pins (if you believe the approx. 550 pin package rumour) for the L3 cache interface, which Motorola have definitely talked about. Personally, I would drop this interface as unnecessary, since there is 512K of level 2 cache, and the onboard memory system would mean a ridiculously large and fast L3 would be required in order to see any significant benefit in performance.
I'll stop here for now, and put what I think this all means for the performance of the new chip in another post if anybody wants it.
Note I'm not making any guesses as to when
this new chip may be appearing.