Is it possible to have more Altivec co-processors or to widen the amount of registers on a single unit from 128 to say...256(2x) or 512(4x)?
Assuming the main processor and memory could feed and retrieve from them effiecently, would this not be a low cost way to gain great linear performance growth in all Altivec coded APIs.
A 512bit wide Altivec unit that could run PhotoShop filters(+Quicktime +MPG encode) 4x current speeds would be pretty nice.</strong><hr></blockquote>
If the register size is changed then the programming model is broken (or an additional one is introduced and you start having backward compatibility problems). Personally, having seen the bloody mess in the x86 world, I think that is a bad idea, at least until a better way to write vector code is developed. The existing AltiVec unit is already outrunning the available bandwidth by quite a bit at 1 GHz, so this shouldn't be an issue unless there is a huge bandwidth increase with no corresponding increase in processor clock rate. If this happens then adding another couple vector execution units should solve that "problem" (oh if only that would be the problem! what a fine problem to have indeed!). A much much better idea would be to leave the AltiVec unit as-is and add another scalar FPU so that existing double precision code would run twice as fast (or nearly so) at the same clock rate. This would require no code changes and would immediately benefit math heavy non-AltiVec applications. Combined with a bandwidth increase the PowerPC would once again be a contender for the top performance honours, despite Intel's silly clock rates.
As has been discussed many times before: quad processors don't make any sense until the bandwidth problem is fixed. The best known fix for this is a NUMA-style architecture where each processor primarily uses its own local memory pool.
As has been discussed many times before: quad processors don't make any sense until the bandwidth problem is fixed. The best known fix for this is a NUMA-style architecture where each processor primarily uses its own local memory pool.</strong><hr></blockquote>
I know I am getting in over my head, but is there no distributed memory architecture available? That is what I am envisioning, quad processors hooked to a distributed memory architecture. Although I am not sure of the limitations, which there obviously are, or they would have implemented this earlier. The reason I ask, is that I am more familiar with the way network switches share a distributed memory architecture. Each blade contains a SparcLite processor which is plugged into the backplane thus sharing available memory with all the boards in the chassis. Why isn't there something similar in the desktop world?
[quote]Originally posted by GardenOfEarthlyDelights:
<strong>There?s more evidence that Britney Spears has singing talent than the G5 is imminent. But I guess that?s what rumor boards are for.</strong><hr></blockquote>
It all depends on how you interpret what "G5" means. It literally means "a 5th generation PowerPC processor". Since the G4 is a "4th generation PowerPC processor", the next processor that Apple uses that has a new internal architecture is a G5. That's a pretty fuzzy line, especially considering the fairly minor change from G2 -> G3. Slapping a on-chip memory controller & RapidIO bus on the G4 core would certainly qualify if you go by that metric. Alternatively, a brand new core (64-bit, POWER4 or e500 derived, whatever) would also qualify.
I have no idea how imminent any such processor is, but prevailing opinion seems to be '03 ... probably mid to late.
<strong>I know I am getting in over my head, but is there no distributed memory architecture available? That is what I am envisioning, quad processors hooked to a distributed memory architecture. Although I am not sure of the limitations, which there obviously are, or they would have implemented this earlier. The reason I ask, is that I am more familiar with the way network switches share a distributed memory architecture. Each blade contains a SparcLite processor which is plugged into the backplane thus sharing available memory with all the boards in the chassis. Why isn't there something similar in the desktop world?
[ 07-03-2002: Message edited by: Ti X ]</strong><hr></blockquote>
Because it is expensive. I don't know the specs on Sun's bus offhand, but it is probably fast and wide which makes the boards and chips quite costly to produce. The bus interface has to be built into each processor in a shared bus system. Also, the farther you spread a bus, the slower it is going to get -- do you have a link to specs on Sun's bus, I'm curious what they have.
A NUMA will generally be faster as long as a processor can work out of its local pool most of the time.
If the register size is changed then the programming model is broken (or an additional one is introduced and you start having backward compatibility problems). Personally, having seen the bloody mess in the x86 world, I think that is a bad idea, at least until a better way to write vector code is developed. The existing AltiVec unit is already outrunning the available bandwidth by quite a bit at 1 GHz, so this shouldn't be an issue unless there is a huge bandwidth increase with no corresponding increase in processor clock rate. If this happens then adding another couple vector execution units should solve that "problem" (oh if only that would be the problem! what a fine problem to have indeed!). A much much better idea would be to leave the AltiVec unit as-is and add another scalar FPU so that existing double precision code would run twice as fast (or nearly so) at the same clock rate. This would require no code changes and would immediately benefit math heavy non-AltiVec applications. Combined with a bandwidth increase the PowerPC would once again be a contender for the top performance honours, despite Intel's silly clock rates.
As has been discussed many times before: quad processors don't make any sense until the bandwidth problem is fixed. The best known fix for this is a NUMA-style architecture where each processor primarily uses its own local memory pool.</strong><hr></blockquote>
Actually, widening Altivec to 256 bits might be useful. I realise it creates some backward compatibility problems, although not enormous ones, as many of the algorithms could soon use a greater resolution (ie. going from 16 bit to 32 bit integers for colour etc.), and I personally could use double precision units in Altivec very effectively in Fourier transforms etc. It would, of course, need at least a recompile to take advantage.
Whether the added complexity is worthwhile or not, I am not in a position to say, and I suspect that, for most uses, extra execution units might be a better solution at the moment. Not, though, that extra execution units increases the complexity of the scheduler, register access etc., and may slow down the processor, whereas wider units just need lots of extra badwidth.
Notwithstanding, greater bandwidth is a must, and I'm right behind you on needing a new approach to vector programming.
Because it is expensive. I don't know the specs on Sun's bus offhand, but it is probably fast and wide which makes the boards and chips quite costly to produce. The bus interface has to be built into each processor in a shared bus system. Also, the farther you spread a bus, the slower it is going to get -- do you have a link to specs on Sun's bus, I'm curious what they have.
A NUMA will generally be faster as long as a processor can work out of its local pool most of the time.</strong><hr></blockquote>
I have nothing from Sun, but here is a blurb on one of my companies routers about distributed memory.
Parallel Access Shared MemoryTM switching architecture to deliver wire-speed performance within very high-capacity routing switches. This architecture is not only fast, it provides the ideal foundation for QoS and wire-speed multicast delivery on frame-based networks.
Parallel Access Shared Memory is a shared memory architecture design ? all ports share a central memory location. However, unlike traditional bus-based shared memory architectures, Parallel Access Shared Memory gives every port on every module a dedicated simultaneous path into and out of a central memory fabric ? eliminating the need for a bus arbitration device. Parallel Access Shared Memory, coupled with a completely non-blocking switching fabric, enables the Routing Switches to guarantee full wire-speed performance on all ports, even during periods of high congestion.
The switching fabric in this case is 52Gbps, which would equate to the DDR in a different scenario.
<strong>Actually, widening Altivec to 256 bits might be useful. I realise it creates some backward compatibility problems, although not enormous ones...</strong><hr></blockquote>
Any compatibility problems Apple introduces is simply going to reduce the effectiveness of developers writing code for Apple's machines. They have a hard enough time already getting people to optimize for the PPC/AV, if they complicate matters at all the problem will just get worse. This is far more important than being able to eek more performance out of a new kind of dedicated functional unit. Increasing register size would also increase the cost of stacking the registers, require wider internal pathways, and would mean that you only get 1 register per cache line (and require 32-byte alignment, not just 16). All manageable, of course, but I just don't see this being enough of a benefit for the potential payback.
<strong>Parallel Access Shared MemoryTM switching architecture to deliver wire-speed performance within very high-capacity routing switches. This architecture is not only fast, it provides the ideal foundation for QoS and wire-speed multicast delivery on frame-based networks.
Parallel Access Shared Memory is a shared memory architecture design ? all ports share a central memory location. However, unlike traditional bus-based shared memory architectures, Parallel Access Shared Memory gives every port on every module a dedicated simultaneous path into and out of a central memory fabric ? eliminating the need for a bus arbitration device. Parallel Access Shared Memory, coupled with a completely non-blocking switching fabric, enables the Routing Switches to guarantee full wire-speed performance on all ports, even during periods of high congestion.
The switching fabric in this case is 52Gbps, which would equate to the DDR in a different scenario.</strong><hr></blockquote>
Sounds a lot like reading the RapidIO propoganda what with all the switches and fabrics. This is probably the kind of thing we'll see in the future, but I don't really expect to see MPX run through this kind of a fabric (although I wouldn't mind being wrong).
Sounds a lot like reading the RapidIO propoganda what with all the switches and fabrics. This is probably the kind of thing we'll see in the future, but I don't really expect to see MPX run through this kind of a fabric (although I wouldn't mind being wrong).</strong><hr></blockquote>
Yeah, in the future I agree, but it is nice to know that the possibility exists for the MPX to evolve into this type of arrangement. Thanks.
<strong>I will eat a pro speaker, and my pro mouse if a G5 comes out at Macworld. If it is available immediately, I will eat my firewire cable and dustcloth for dessert. If the G6 is announced I will perform a intimate act with JYD. I don't think we will see a G5 until MWSF '03, and anyone that thinks otherwise is delusional.</strong><hr></blockquote>
Mac journalists and users were shocked when steve Jobs announced the the Power Macintosh models at MacWorld will be powered by what he called the "G6". Apple insiders were non-plussed considering the new processors are Motorola 7470's. One report claims this strange Terminology Jump had to do with a rumor site posting. "Steve read the Rumor site report like he does every Monday and all the sudden he got this evil look and started saying "YES! YES! YES!" It was really freaky." reported one insider. John Dvorak claimed the "Word Bump" is a desperate act by Apple to conceal their continuing loss of the Speed race.....
<strong>Because it is expensive. I don't know the specs on Sun's bus offhand, but it is probably fast and wide which makes the boards and chips quite costly to produce. The bus interface has to be built into each processor in a shared bus system. Also, the farther you spread a bus, the slower it is going to get -- do you have a link to specs on Sun's bus, I'm curious what they have.
A NUMA will generally be faster as long as a processor can work out of its local pool most of the time.</strong><hr></blockquote>
Question: Would it be possible to address some of these design expenses/challenges by using daughter cards?? Where each daughter card would have its' own memory and cache and a shared pipe to the system chip, for disk access, firewire access, and other services. Adding capacity would mean adding more daughter cards. There by avoiding having to design one board with multiple memory paths. Just using RIO to the system chip and the daughter cards for communication. The system chip could have some memory as well to boost disk performance, and other communication tasks.
Why doesn't apple just stick a 2.4Ghz P4 in their boxes, unconnected to anything else on the mother board and do a commercial that says "we've got GHZ galore"? That way they could take their time and develop the G5 w/o worry.
<strong>Why doesn't apple just stick a 2.4Ghz P4 in their boxes, unconnected to anything else on the mother board and do a commercial that says "we've got GHZ galore"? That way they could take their time and develop the G5 w/o worry.
I'm only 3/4's joking. All right, 7/8's
Thoth</strong><hr></blockquote>
better yet, they could make airport standard and tell people that there is "2.4Ghz in these towers!"
<strong>Since my topic was closed, I will ask this here.
Quad-processors?
This is complete and pure speculation on my part, but I would like to hear the feasibility from the hardware gurus here, on the possibility of doing this in a desktop machine.
[ 07-03-2002: Message edited by: Ti X ]</strong><hr></blockquote>
See my Topic "Blade Runner - Modular PowerMac" for my "speculation" on how Apple might implement it's blade technology into a modular workstation class tower configuration. These blades are coming soon, and they will use a NUMA like technique to manage memory bandwidth, Small step to move from the 3U server they will be introduced in, to the killer workstatiion we have all been waiting for.
I've always thought Apple should just put in a 8 GHz oscillator or something, and either leave it unconnected; or run the output through a frequency divider before going to the G4 itself.
If Motorola ever has another 18 month hiatus, then Apple can just bump up the oscillator frequency, but increase the amount of frequency division.
Comments
<strong>Technical question:
Is it possible to have more Altivec co-processors or to widen the amount of registers on a single unit from 128 to say...256(2x) or 512(4x)?
Assuming the main processor and memory could feed and retrieve from them effiecently, would this not be a low cost way to gain great linear performance growth in all Altivec coded APIs.
A 512bit wide Altivec unit that could run PhotoShop filters(+Quicktime +MPG encode) 4x current speeds would be pretty nice.</strong><hr></blockquote>
If the register size is changed then the programming model is broken (or an additional one is introduced and you start having backward compatibility problems). Personally, having seen the bloody mess in the x86 world, I think that is a bad idea, at least until a better way to write vector code is developed. The existing AltiVec unit is already outrunning the available bandwidth by quite a bit at 1 GHz, so this shouldn't be an issue unless there is a huge bandwidth increase with no corresponding increase in processor clock rate. If this happens then adding another couple vector execution units should solve that "problem" (oh if only that would be the problem! what a fine problem to have indeed!). A much much better idea would be to leave the AltiVec unit as-is and add another scalar FPU so that existing double precision code would run twice as fast (or nearly so) at the same clock rate. This would require no code changes and would immediately benefit math heavy non-AltiVec applications. Combined with a bandwidth increase the PowerPC would once again be a contender for the top performance honours, despite Intel's silly clock rates.
As has been discussed many times before: quad processors don't make any sense until the bandwidth problem is fixed. The best known fix for this is a NUMA-style architecture where each processor primarily uses its own local memory pool.
<strong>
As has been discussed many times before: quad processors don't make any sense until the bandwidth problem is fixed. The best known fix for this is a NUMA-style architecture where each processor primarily uses its own local memory pool.</strong><hr></blockquote>
I know I am getting in over my head, but is there no distributed memory architecture available? That is what I am envisioning, quad processors hooked to a distributed memory architecture. Although I am not sure of the limitations, which there obviously are, or they would have implemented this earlier. The reason I ask, is that I am more familiar with the way network switches share a distributed memory architecture. Each blade contains a SparcLite processor which is plugged into the backplane thus sharing available memory with all the boards in the chassis. Why isn't there something similar in the desktop world?
[ 07-03-2002: Message edited by: Ti X ]</p>
<strong>There?s more evidence that Britney Spears has singing talent than the G5 is imminent. But I guess that?s what rumor boards are for.</strong><hr></blockquote>
It all depends on how you interpret what "G5" means. It literally means "a 5th generation PowerPC processor". Since the G4 is a "4th generation PowerPC processor", the next processor that Apple uses that has a new internal architecture is a G5. That's a pretty fuzzy line, especially considering the fairly minor change from G2 -> G3. Slapping a on-chip memory controller & RapidIO bus on the G4 core would certainly qualify if you go by that metric. Alternatively, a brand new core (64-bit, POWER4 or e500 derived, whatever) would also qualify.
I have no idea how imminent any such processor is, but prevailing opinion seems to be '03 ... probably mid to late.
<strong>I know I am getting in over my head, but is there no distributed memory architecture available? That is what I am envisioning, quad processors hooked to a distributed memory architecture. Although I am not sure of the limitations, which there obviously are, or they would have implemented this earlier. The reason I ask, is that I am more familiar with the way network switches share a distributed memory architecture. Each blade contains a SparcLite processor which is plugged into the backplane thus sharing available memory with all the boards in the chassis. Why isn't there something similar in the desktop world?
[ 07-03-2002: Message edited by: Ti X ]</strong><hr></blockquote>
Because it is expensive. I don't know the specs on Sun's bus offhand, but it is probably fast and wide which makes the boards and chips quite costly to produce. The bus interface has to be built into each processor in a shared bus system. Also, the farther you spread a bus, the slower it is going to get -- do you have a link to specs on Sun's bus, I'm curious what they have.
A NUMA will generally be faster as long as a processor can work out of its local pool most of the time.
<strong>
If the register size is changed then the programming model is broken (or an additional one is introduced and you start having backward compatibility problems). Personally, having seen the bloody mess in the x86 world, I think that is a bad idea, at least until a better way to write vector code is developed. The existing AltiVec unit is already outrunning the available bandwidth by quite a bit at 1 GHz, so this shouldn't be an issue unless there is a huge bandwidth increase with no corresponding increase in processor clock rate. If this happens then adding another couple vector execution units should solve that "problem" (oh if only that would be the problem! what a fine problem to have indeed!). A much much better idea would be to leave the AltiVec unit as-is and add another scalar FPU so that existing double precision code would run twice as fast (or nearly so) at the same clock rate. This would require no code changes and would immediately benefit math heavy non-AltiVec applications. Combined with a bandwidth increase the PowerPC would once again be a contender for the top performance honours, despite Intel's silly clock rates.
As has been discussed many times before: quad processors don't make any sense until the bandwidth problem is fixed. The best known fix for this is a NUMA-style architecture where each processor primarily uses its own local memory pool.</strong><hr></blockquote>
Actually, widening Altivec to 256 bits might be useful. I realise it creates some backward compatibility problems, although not enormous ones, as many of the algorithms could soon use a greater resolution (ie. going from 16 bit to 32 bit integers for colour etc.), and I personally could use double precision units in Altivec very effectively in Fourier transforms etc. It would, of course, need at least a recompile to take advantage.
Whether the added complexity is worthwhile or not, I am not in a position to say, and I suspect that, for most uses, extra execution units might be a better solution at the moment. Not, though, that extra execution units increases the complexity of the scheduler, register access etc., and may slow down the processor, whereas wider units just need lots of extra badwidth.
Notwithstanding, greater bandwidth is a must, and I'm right behind you on needing a new approach to vector programming.
Michael
<strong>
The gestion of the memory of the power4 is far more sophisticated also: ever eard of 32 MB L3 cache ?
Actually it has 128MB of shared L3 cache per MCM
<strong>
Because it is expensive. I don't know the specs on Sun's bus offhand, but it is probably fast and wide which makes the boards and chips quite costly to produce. The bus interface has to be built into each processor in a shared bus system. Also, the farther you spread a bus, the slower it is going to get -- do you have a link to specs on Sun's bus, I'm curious what they have.
A NUMA will generally be faster as long as a processor can work out of its local pool most of the time.</strong><hr></blockquote>
I have nothing from Sun, but here is a blurb on one of my companies routers about distributed memory.
Parallel Access Shared MemoryTM switching architecture to deliver wire-speed performance within very high-capacity routing switches. This architecture is not only fast, it provides the ideal foundation for QoS and wire-speed multicast delivery on frame-based networks.
Parallel Access Shared Memory is a shared memory architecture design ? all ports share a central memory location. However, unlike traditional bus-based shared memory architectures, Parallel Access Shared Memory gives every port on every module a dedicated simultaneous path into and out of a central memory fabric ? eliminating the need for a bus arbitration device. Parallel Access Shared Memory, coupled with a completely non-blocking switching fabric, enables the Routing Switches to guarantee full wire-speed performance on all ports, even during periods of high congestion.
The switching fabric in this case is 52Gbps, which would equate to the DDR in a different scenario.
<strong>Actually, widening Altivec to 256 bits might be useful. I realise it creates some backward compatibility problems, although not enormous ones...</strong><hr></blockquote>
Any compatibility problems Apple introduces is simply going to reduce the effectiveness of developers writing code for Apple's machines. They have a hard enough time already getting people to optimize for the PPC/AV, if they complicate matters at all the problem will just get worse. This is far more important than being able to eek more performance out of a new kind of dedicated functional unit. Increasing register size would also increase the cost of stacking the registers, require wider internal pathways, and would mean that you only get 1 register per cache line (and require 32-byte alignment, not just 16). All manageable, of course, but I just don't see this being enough of a benefit for the potential payback.
<strong>Parallel Access Shared MemoryTM switching architecture to deliver wire-speed performance within very high-capacity routing switches. This architecture is not only fast, it provides the ideal foundation for QoS and wire-speed multicast delivery on frame-based networks.
Parallel Access Shared Memory is a shared memory architecture design ? all ports share a central memory location. However, unlike traditional bus-based shared memory architectures, Parallel Access Shared Memory gives every port on every module a dedicated simultaneous path into and out of a central memory fabric ? eliminating the need for a bus arbitration device. Parallel Access Shared Memory, coupled with a completely non-blocking switching fabric, enables the Routing Switches to guarantee full wire-speed performance on all ports, even during periods of high congestion.
The switching fabric in this case is 52Gbps, which would equate to the DDR in a different scenario.</strong><hr></blockquote>
Sounds a lot like reading the RapidIO propoganda what with all the switches and fabrics. This is probably the kind of thing we'll see in the future, but I don't really expect to see MPX run through this kind of a fabric (although I wouldn't mind being wrong).
<strong>
Sounds a lot like reading the RapidIO propoganda what with all the switches and fabrics. This is probably the kind of thing we'll see in the future, but I don't really expect to see MPX run through this kind of a fabric (although I wouldn't mind being wrong).</strong><hr></blockquote>
Yeah, in the future I agree, but it is nice to know that the possibility exists for the MPX to evolve into this type of arrangement. Thanks.
<strong>I will eat a pro speaker, and my pro mouse if a G5 comes out at Macworld. If it is available immediately, I will eat my firewire cable and dustcloth for dessert. If the G6 is announced I will perform a intimate act with JYD. I don't think we will see a G5 until MWSF '03, and anyone that thinks otherwise is delusional.</strong><hr></blockquote>
Mac journalists and users were shocked when steve Jobs announced the the Power Macintosh models at MacWorld will be powered by what he called the "G6". Apple insiders were non-plussed considering the new processors are Motorola 7470's. One report claims this strange Terminology Jump had to do with a rumor site posting. "Steve read the Rumor site report like he does every Monday and all the sudden he got this evil look and started saying "YES! YES! YES!" It was really freaky." reported one insider. John Dvorak claimed the "Word Bump" is a desperate act by Apple to conceal their continuing loss of the Speed race.....
<strong>Because it is expensive. I don't know the specs on Sun's bus offhand, but it is probably fast and wide which makes the boards and chips quite costly to produce. The bus interface has to be built into each processor in a shared bus system. Also, the farther you spread a bus, the slower it is going to get -- do you have a link to specs on Sun's bus, I'm curious what they have.
A NUMA will generally be faster as long as a processor can work out of its local pool most of the time.</strong><hr></blockquote>
Question: Would it be possible to address some of these design expenses/challenges by using daughter cards?? Where each daughter card would have its' own memory and cache and a shared pipe to the system chip, for disk access, firewire access, and other services. Adding capacity would mean adding more daughter cards. There by avoiding having to design one board with multiple memory paths. Just using RIO to the system chip and the daughter cards for communication. The system chip could have some memory as well to boost disk performance, and other communication tasks.
I'm only 3/4's joking. All right, 7/8's
Thoth
<strong>Why doesn't apple just stick a 2.4Ghz P4 in their boxes, unconnected to anything else on the mother board and do a commercial that says "we've got GHZ galore"? That way they could take their time and develop the G5 w/o worry.
I'm only 3/4's joking. All right, 7/8's
Thoth</strong><hr></blockquote>
better yet, they could make airport standard and tell people that there is "2.4Ghz in these towers!"
<strong>
better yet, they could make airport standard and tell people that there is "2.4Ghz in these towers!"</strong><hr></blockquote>
Yea, but it would still be slower then my 2.5ghz phone
<strong>Since my topic was closed, I will ask this here.
Quad-processors?
This is complete and pure speculation on my part, but I would like to hear the feasibility from the hardware gurus here, on the possibility of doing this in a desktop machine.
[ 07-03-2002: Message edited by: Ti X ]</strong><hr></blockquote>
See my Topic "Blade Runner - Modular PowerMac" for my "speculation" on how Apple might implement it's blade technology into a modular workstation class tower configuration. These blades are coming soon, and they will use a NUMA like technique to manage memory bandwidth, Small step to move from the 3U server they will be introduced in, to the killer workstatiion we have all been waiting for.
<strong>...it would still be slower then my 2.5ghz phone
BIG ol' difference between UHF frequencies & CPU clock cycles.... <img src="graemlins/oyvey.gif" border="0" alt="[No]" />
If Motorola ever has another 18 month hiatus, then Apple can just bump up the oscillator frequency, but increase the amount of frequency division.
<strong>
BIG ol' difference between UHF frequencies & CPU clock cycles.... <img src="graemlins/oyvey.gif" border="0" alt="[No]" /> </strong><hr></blockquote>
I guess the sarcasim flew clear over your head <img src="graemlins/oyvey.gif" border="0" alt="[No]" />