Faster G4 - MOTO 7470

razzfazz · May 28, 2002 2:31PM

[quote]Originally posted by Brendon:



Um... Yes I was not clear, sorry about that. In the 8500 series chips or Book-e design is modular. Apple can choose what execution units they want on the chip without Moto having to total redesign. With this in mind as the transistor count goes up so does the heat. I think that Apple would be better off keeping the memory controller off chip and trading functionality for another FPU.

<hr></blockquote>

Just to clarify, I don't think the memory controller is part of the e500 core's approach to modularity, i.e. unlike an FPU or maybe an AltiVec unit, it is not a mere APU that can be added or removed more or less at will. It's too much of a key component for the processor, and as such too tightly integrated for that, at least as far as I understand.

[quote]If MaxBus is not going DDR than another option would be RapidIO, but if Apple could have it be HT it would be better. In that the resources of implementing HT on CPU for IO would also allow the CPU access to the system chip, which would give access to the whole system.<hr></blockquote>

RapidIO can do that too.

[quote]In this design the system chip would act as a traffic cop allowing all system resources access to each other.<hr></blockquote>

I believe all the new topologies (RIO, HT) are going away from that and moving to switched fabrics instead.

Bye,

RazzFazz

[ 05-28-2002: Message edited by: RazzFazz ]

razzfazz · May 28, 2002 2:49PM

[quote]Originally posted by Programmer:

You're probably right RazzFazz, but there are some sticky issues with the on-chip memory controller. The biggest one is how to manage a memory pool per-processor, both from an OS point of view and from a user-upgrade point of view.

<hr></blockquote>

Good point, didn't think of that. For individual threads, it shouldn't be that hard (just bind each thread to a single processor), but for the OS itself and any shared libs etc. it's probably quite a challenge - certainly doable, though, as this is exactly the direction AMD is moving to with the Opteron, and Sun and others have already been doing NUMA for a long time.

EDIT: Also, note that the NUMA approach allows the RAM throughput to linearly scale with the number of processors.

[quote]Also a factor is that RapidIO is brand new and there are no devices (or very few at least) which speak that "language". Apple builds all of its devices into a single chip anyhow, so they are going to gain very little from a RapidIO-based system... and they could gain more from the faster and simpler HyperTransport.<hr></blockquote>

I don't know enough about RIO and HT to really be able to judge how much of a difference in complexity they are, but it seems to me that Motorola is more likely to be going the RIO route than the HT one, at least as far as the CPU-Northbridge interconnect is concerned, since Apple would most likely be the one and only customer of a hypothetical HT-variant of a chip.

Also, in the case of an on-chip memory controller, the system bus speed issue loses a lot of it's relevance, as most of the FSB traffic nowadays probably is made up by CPU-RAM-transfers.

Still, I think it's quite possible for Apple to use something like this, similar to the nForce:

CPU<-RIO->NB<-HT->SB

(if there is going to be a distinction between north- and southbridge in future Macs at all, that is).

[quote]What my suggestion boils down to is replacing MPX with HyperTransport. In a single processor system this is very straightforward. In an MP system the chipset would need multiple HyperTransport ports, and would have to make each CPU's transactions somehow visible to the other CPUs... a little ugly but doable.

<hr></blockquote>

Well, dunno, all those AMD CPUs that are going to have HT as their FSB seem to be using on-chip memory controllers, so I'm not quite sure HT is even designed to be used the way you're proposing it.

[quote]Again, you are probably right and Apple will just use what Motorola hands them. Given Moto's track record (and emphasis on embedded designs) I wouldn't mind seeing Apple having more independence in terms of how they implement their systems. I suspect Apple wouldn't mind having that independence either -- most companies don't like having their bread-and-butter depend on another companies R&D so heavily, especially when they are so obviously behind.<hr></blockquote>

I don't think a design based on integrated memory controllers and RapidIO as system interconnect would necessarily be behind a design using HyperTransport and a northbridge-based memory controller. In fact, some of the stuff in the 8540 presentation looks quite nice to me (once you add FP and AltiVec APUs, that is).

Bye,

RazzFazz

[ 05-28-2002: Message edited by: RazzFazz ]

spooky · May 28, 2002 3:24PM

Once again, if its not a G5 what does it matter anyway?

brendon · May 28, 2002 9:41PM

OK I'll venture out on the dumb-*** limb. Please critique using TWO MaxBus pipes, one for each CPU and still have the chips able to communicate to each other. The MaxBus pipes would tie into the system chip. This would allow for 2 independent 1+gig pipes into the CPUs, where currently they have to share a 133MHz bus. I still think that HT and/or RapidIO will be used with the G5. I just don't think that time is here yet. I guess the system chip would still have to sort out which chip gets what data. Is something like this possible?? Could both chips make simultaneous calls and still have the system chip able to respond?? Ok enough, I'm out too far on the limb again.

programmer · May 29, 2002 12:10AM

[quote]Originally posted by Brendon:

OK I'll venture out on the dumb-*** limb. Please critique using TWO MaxBus pipes, one for each CPU and still have the chips able to communicate to each other. The MaxBus pipes would tie into the system chip. This would allow for 2 independent 1+gig pipes into the CPUs, where currently they have to share a 133MHz bus. I still think that HT and/or RapidIO will be used with the G5. I just don't think that time is here yet. I guess the system chip would still have to sort out which chip gets what data. Is something like this possible?? Could both chips make simultaneous calls and still have the system chip able to respond?? Ok enough, I'm out too far on the limb again.<hr></blockquote>

A wide bus like MPX is harder to build motherboards for so building two or more of them would be expensive. The chipset would have zillions of pins, and each CPU would still be limited to 1 GB/sec -- a single fast G4 can saturate that in a lot of tasks. It also doesn't let you leverage DDR in a single processor system.

The reason I suggested HyperTransport is because of its high bandwidth (up to 12 times MPX), low pin count, and the fact that it has been in shipping products for some time now. That Apple is on the committee is an added bonus.

brendon · May 29, 2002 12:43AM

[quote]Originally posted by Programmer:



A wide bus like MPX is harder to build motherboards for so building two or more of them would be expensive. The chipset would have zillions of pins, and each CPU would still be limited to 1 GB/sec -- a single fast G4 can saturate that in a lot of tasks. It also doesn't let you leverage DDR in a single processor system.

The reason I suggested HyperTransport is because of its high bandwidth (up to 12 times MPX), low pin count, and the fact that it has been in shipping products for some time now. That Apple is on the committee is an added bonus.<hr></blockquote>

I realize all that, except for the MPX stuff thanks, I just needed to venture out on that limb. I think that limb does my soul good. If I go out too far reality will break the fall! I agree with you about HT I think you are on the right track, not a limb.

outsider · May 29, 2002 7:54AM

I think that the RapidIO/on-die memory controller is probably the best solution for speed. For single processor systems. The problem comes when you introduce multiple processors, as I understand. But microprocessor designers have already solved the problem for multiprocessing and on-die L2 & L3 cache controllers. As I understand the cache tags keep an inventory of where certain information is in cache so other processors can have access to the cache contents of other processors. I'm sure this can be applied to main memory although it may increase the size of the processor a lot because instead of just keeping track of 1-2MB of cache, it will have to keep track of Gigabytes of cache. This may be the obstacle behind using some sort of tags for main memory. Unless you have a seperate chip for all processors to use that is dedicated to main memory tags and all the processors just update the chip as their contents change.

programmer · May 29, 2002 8:50AM

[quote]Originally posted by Outsider:

I think that the RapidIO/on-die memory controller is probably the best solution for speed. For single processor systems. The problem comes when you introduce multiple processors, as I understand. But microprocessor designers have already solved the problem for multiprocessing and on-die L2 & L3 cache controllers. As I understand the cache tags keep an inventory of where certain information is in cache so other processors can have access to the cache contents of other processors. I'm sure this can be applied to main memory although it may increase the size of the processor a lot because instead of just keeping track of 1-2MB of cache, it will have to keep track of Gigabytes of cache. This may be the obstacle behind using some sort of tags for main memory. Unless you have a seperate chip for all processors to use that is dedicated to main memory tags and all the processors just update the chip as their contents change.<hr></blockquote>

One way to do this with existing mechanisms is to use the paged memory management unit. Unfortunately it stores its tables in memory, and each processor would need its own copy. You'd end up in an unfortunate situation of the page tables occupying large amounts of memory in each processor's memory pool.

A more viable alternative is to divide the available physical memory space by processor. The G4 has 36-bit logical addresses (16 GB), so each processor could be given, say, 4 GB of space. A 64-bit processor could be given a much larger chunk of phyiscal space (4000 TB, for example, which would still allow a thousand processors in the same system!). All memory transactions outside of a processor's local physical space would be broadcast on the system's interconnect bus (RapidIO or HT) for the appropriate processor to respond to. In this way no tables would be required. The operating system would then use DMA and the virtual memory system to remap VMM pages into the physical space of the processor that needs it for the best possible performance. Each processor would likely have the page tables for its local memory space.

Or you could go with a much simpler scheme of all memory managed by an Apple controlled chipset and connected via HT.

This scheme doesn't require any changes to the OS, and the chipset controller would just need to broadcast all memory transactions to all the HT ports to ensure that all processors knew who had what cache lines. BTW: recent additions to the HT protocol make this kind of use of HT entirely feasible... even likely. :cool:

[ 05-29-2002: Message edited by: Programmer ]

outsider · May 29, 2002 9:18AM

Another solution for desktop processors is having 2 version of the chip with on-die memory controllers: one with a single core and the other with dual cores. This will just limit motherboards to 2 processor cores but with some extra engineering you might be able to design one with 4 cores on a card. But dual processors is not half bad (it's what we max out at now).

programmer · May 29, 2002 9:40AM

[quote]Originally posted by Outsider:

Another solution for desktop processors is having 2 version of the chip with on-die memory controllers: one with a single core and the other with dual cores. This will just limit motherboards to 2 processor cores but with some extra engineering you might be able to design one with 4 cores on a card. But dual processors is not half bad (it's what we max out at now).<hr></blockquote>

Good point, a multi-core or hyperthreaded processor effectively gives you two (or more) chips "behind the memory controller". I'd be surprised if this arrived before the G5 though.

I was just reading about AMD's Hammer chips. They claim that the on-chip memory controller gives about 10-15% speed improvement due to the reduced latencies and higher clock rate in the controller (i.e. processor speed). It is also just an Athlon with a 64-bit instruction decoder and 2 extra pipe stages. Hmmm... add 64-bit, lengthen pipes slightly, and add a memory controller. That sounds kind of familiar...

Now if only Motorola would add 2 FPUs and 2 more integer units the 10-stage pipe G4 could duke it out with the best of them.

[ 05-29-2002: Message edited by: Programmer ]

outsider · May 29, 2002 10:17AM

Well maybe using the e500 core you should be able to modularize a processor with 2 or more of those cores, a shared altivec unit, shared memory controller, and seperate L2 caches for each core.

lowb-ing · May 29, 2002 11:45AM

OK, here's my take on the 7470:

It'll be ready for a MWNY Introduction, though actual delivery will have to wait for a month or two after that. Top speed will be 1,4 GHz. It will support a faster bus, either HT or MPX with DDR, and apple will have paid for the extra R&D, since moto don't really need it. It will not have an on-chip memory controller (that's for the G5), but will be connected to an xserve-like memory controller. It will be fabbed on a 0.13 micron process.

I think NUMA will have to wait till the 64-bit version of the G5 (shure hope there's going to be one) comes along. that's because NUMA and 64-bit both require the OS to be rewritten somewhat in order to take full advantage (thats what i heard, anyway). Better to do it all at once.

What do you guys think?

razzfazz · May 29, 2002 1:51PM

[quote]Originally posted by Programmer:



This scheme doesn't require any changes to the OS, and the chipset controller would just need to broadcast all memory transactions to all the HT ports to ensure that all processors knew who had what cache lines.<hr></blockquote>

You'd most certainly not want to have the northbridge just broadcast data to all available ports, though. Otherwise, you'd end up with pretty much the same situation as the current shared bus again (at least for transactions from the northbridge to the CPUs), since all the point-to-point NB->CPU (note direction) HT connections would carry exactly the same data.

Bye,

RazzFazz

programmer · May 29, 2002 10:45PM

[quote]Originally posted by RazzFazz:

You'd most certainly not want to have the northbridge just broadcast data to all available ports, though. Otherwise, you'd end up with pretty much the same situation as the current shared bus again (at least for transactions from the northbridge to the CPUs), since all the point-to-point NB->CPU (note direction) HT connections would carry exactly the same data.<hr></blockquote>

No need to send all the data to anybody except the intended recipient -- but the chipset would need to send the addresses of the cachelines loaded by the other processors. This is required by the bus snoop to avoid cache conflicts. HT supports variable sized transactions though, and I think there was something added to the protocol recently to handle exactly this situation (but I'm not expert on HT protocols so I could be wrong). I did somewhere specifically see a mention of processor <-> chipset interconnects as an HT application, however.

razzfazz · May 30, 2002 7:03AM

[quote]Originally posted by Programmer:



No need to send all the data to anybody except the intended recipient -- but the chipset would need to send the addresses of the cachelines loaded by the other processors.

<hr></blockquote>

Oh, guess I just misinterpreted your saying "the chipset controller would just need to broadcast all memory transactions" here.

[quote]I did somewhere specifically see a mention of processor <-> chipset interconnects as an HT application, however.<hr></blockquote>

Certainly. AMD's forthcoming Opteron will connect to the chipset by means of HyperTransport. Note though that the chipset in that case will only handle AGP, PCI, and the other peripherals, and especially will not contain the memory controller (which is on-chip).

Bye,

RazzFazz

programmer · May 30, 2002 9:45AM

[quote]Originally posted by RazzFazz:

Certainly. AMD's forthcoming Opteron will connect to the chipset by means of HyperTransport. Note though that the chipset in that case will only handle AGP, PCI, and the other peripherals, and especially will not contain the memory controller (which is on-chip).

<hr></blockquote>

There really isn't much difference between memory and memory-mapped I/O (which is pretty much the whole point of memory mapped I/O

).

outsider · May 30, 2002 11:07AM

here is something i found interesting. It's on page 3 of the MPC8540 slideshow presentation PDF on Motorola's web site (<a href="http://e-www.motorola.com/webapp/sps/site/prod_summary.jsp?code=MPC8540&nodeId=01M98655"; target="_blank">can be found here</a>). It shows the bandwidth of an 8 bit RapidIO connection at 16Gbps or 2GBps. A 64bit 133MHz bus (like MPX) is limited to 8Gbps or 1GBps. For RapidIO to get these throughputs it needs to be running at least 2GHz! Is this possible over PCB? A 16bit RIO connection should be able to acheive 4GBps then at 2GHz.

kecksy · May 30, 2002 10:45PM

Did you know the Maxbus spec allows for a 128-bit varient? This would be far better choice than a 64-bit DDR bus. All current G4s should support it since since its still a standard protical MaxBus. I'll have to double check motorola's documentation.

programmer · May 30, 2002 11:48PM

No current G4s support the 128-bit MPX variant. They don't have the pins for it. This isn't a better solution than DDR because it is a lot more expensive to run all those extra traces across the motherboard and to build all the extra pins into the chipset and the processor.

tiramisubomb · May 31, 2002 5:08AM

Due to the different architecture between Intel Pentium and Motorola PowerPC, I can't even tell if Apple was behind at all. Who wouldn't like faster G4? From the snapshot of view, the P4 has numbers in every dept which supersede all categories of G4 except Altivec. They range from MHZ, to bus speed and with much greater bandwidth also. But is Apple really behind? I recently look at some performance figures published by Barefeats on the new PowerBook. Remember the new Ti only equips with 800 MHZ. But the variety of test shows the percentage of increase in performance comes close to desktop Mac. The L3 also translates to much better performance. Although no number comparison with Intel based notebook was made available, it seems that MHZ becomes less important now adays. And if u look at the stats provided by Apple especially apps that takes advantage of G4, such as Final Cut Pro, it really shows that G4 based Macs not only took advantage of velocity engine but clearly demonstrate their performance edge when running on MP PowerMacs. Numbers are just confusing, it never gives you the whole story. But apps especially developed by Apple, have clearly shown that if optimized for G4, performance stands out. Now, back at the numbers, Intel have made quantum leaps to utilize much faster bus to reduce latencies. But when I think about it, they need to rush the data to minimize the bubbles in the pipeline and data computed in P4 and G4 are completely different. With lower frequencies in G4 processors, their instructions were also written different, so consider even a lower bandwidth say betweem 1 GB to 2GB/sec sys bus, Apple can still acheive great performace and with larger L3 cache, it just another way to enhance performance. I know a lot of people continue to cite MHZ myth is bull **** . But it does make some sense if u think about it logically, instructions have to go through 20 steps to complete a process while G4 only takes 7. I haven't heard of other CPU vendors using that much pipelines. If u have seen one, I will like to know. Even if we have a 2 GHZ G4 with 20 pipes, Mac users will complain just as much. Because the trade off in pipelines were so significant, the percentage increase in MHZ were definitely nothing more than marketing. When u look at the mobile P4 running at 1.4 to 1.5 GHZ, there is basically little or no increase in actual performace compared to P3. What a rip off? Is more like 400 extra horsepower with zero increase in performance. Do u call that technology? Perhaps, we (Mac users) should not be so depressed when there is no G5 or 2 GHZ G4. They are just numbers. Performance is more important.

Faster G4 - MOTO 7470

Comments