TOE NIC's In All Future PowerMacs?

overtoasty · January 25, 2005 9:38PM

Try this on.

Given that Apple is doing way-snazzy things in the realm of Video, Audio and Networks (I'm personally still happily freaked over Logic Nodes) - with the X-SAN stuff, and no doubt, future "plug-and-play" (definitely for lack of a better term!) distributed processing integration with most future versions of their multi-media apps ...

... there seems to be one big thing missing from the mix, hardware wise, and that seems to be TCP/IP offload NIC's on the mobo of the actual PowerMac's themselves - rather than forcing folks who wish to go with a SAN to buy and install them separately after the fact. I realize TOE NIC's are generally thought of only in the realm of SAN access, but I suspect their benefit may be felt in distributed processing too.

I especially think Apple could be the first company to successfully bring this hi-end workstation technology down to the ease of use of virtually any moderately skilled user. Which, my guess says, would be a big bonus in the direction of Mac's in the hi end and mid multimedia world - particulairly if Apple can offer a simple "plug-and-play" SAN option (software AND hardware), especially to the medium range video market. (pick up a few X-RAIDs, an X-Serve, X-SAN and a few new PowerMacs with FCP - boom, you're in business, without ever opening a single case, just like folks who had to pay 10x as much before, but without anywhere near the hassle - and, it's really easy to administrate).

I also suspect (with my limited knowlege on this), that sticking TOE NIC's on the mobo in future PowerMac's may beef up much of the distributed processing Apple almost certainly has coming down the pipe in their apps - much as they currently do with Logic Nodes. The advantage here is, you never need even think of a SAN, as long as you have other Mac's on the network operating as CPU nodes, you can borrow their un-used cycles to greatly enhace your productivity over a standard NIC, SAN X-RAID/SERVE hardware or not.

Also Apple, to the best of my knowledge, was either the first, or one of the first, to put Gig Ethernet on the Mobo on their machines - so they have a bit of a tradition of being ahead of the curve in that manner.

Anyway, it's a thought, with a huge upside - I guess the only consideration is cost and heat - would it cost an arm and a leg to put TOE NIC's on the Mobo of all future PowerMacs? If not, and if heat isn't much of an issue (these things are basically small parallel processors), then I would not be at all surprised if Uncle Steve boasts Apple as being the only computer company having a complete distributed processing SAN solution, where the end user never needs to crack a case, which doesn't require a trained technician on site to maintain, and can leverage their TOE NIC's for all kinds of distributed processing beyond just SAN use alone.

If nothing else, it'll help intelligent, self-aware creative types (as Mac users generally are) get more content on the air, so that things

such as this

can be avoided in the future.

Thoughts?

rhumgod · January 25, 2005 10:09PM

Quote:

Originally posted by OverToasty

Thoughts?

TOE NICs only benefit are for high-bandwidth, sustained connections. A SAN storage server perhaps, but a PowerMac? I don't see the benefit for the end user.

For those not too informed on the matter, follow this link.

overtoasty · January 25, 2005 11:25PM

Quote:

Originally posted by Rhumgod

TOE NICs only benefit are for high-bandwidth, sustained connections. A SAN storage server perhaps, but a PowerMac? I don't see the benefit for the end user.

For those not too informed on the matter, follow this link.

I guess the issue I'm getting at above is, even if you have Gigabit ethernet on your Mac, you can't edit video over it - since your mac is spending too much time dealing with the IP stack to properly handle cycle hungy video at the same time; to get around this, folks typically have to buy a separate PCI TOE card to edit video over a SAN. Pain in the butt.

This kinda defeats a lot of the ease of use, especially when Apple went to so much trouble to create user friendly X-SAN - it also means users have to go to the trouble of paying the money after the fact to reap the benefit, which they probably won't, since all this cycle sharing won't be something they'll get into unless it's made very easy, and the benefits are clearly made for developers and users alike.

So, stick TOE cards on all new PowerMac's (hoping it can be done cheap enough to make it worth while), and reap the benefits of this - people will just think "Oh, Mac's can do that?" - SAN and other hi-bandwidth side benefits too - much as Logic Node currently does - but pushing far more over the pipe.

All at the cost of a mobo TOE NIC.

It's a winner if they can do it cheaply enough.

mmmpie · January 25, 2005 11:39PM

Quote:

Originally posted by OverToasty

folks typically have to buy a separate PCI TOE card to edit video over a SAN.

Arent SANs typically defined by running over a medium other than IP/Ethernet? Isnt that the world of Fibre Channel and Infiniband? Is the xserve raid doing IP over FC? or a custom SAN protocol?

I thought that using IP/Ethernet you were running a NAS. I certainly wouldnt considered a NAS an appropriate solution for high bandwidth tasks like video editing. I can see how putting a NAS on its own subnet/vlan/cable with a TOE NIC would improve the situation. Is that cheaper than just using a SAN?

Seems to me that it is likely as the computing environment improves SANs will be replaced by NAS', and that in the future you will see TOE NICs showing up in all machines. I also think that that is some time away. But the ubiquity of IP/ethernet always seems to overcome its limitations.

programmer · January 25, 2005 11:44PM

POWER5 has FastPath technology, one part of which was handling of the TCP stack without requiring processor interrupts. That gets you most of the benefit of a TOE without requiring an extra part, and (best of all) it runs at processor clock rate and uses the processor's fast memory subsystem and cache resources.

rhumgod · January 26, 2005 9:16AM

I also remember reading somewhere that Altivec can handle this as well. Not sure how effectively, however. I think it was in their Developer section.

overtoasty · January 26, 2005 9:22AM

Quote:

Originally posted by mmmpie

Arent SANs typically defined by running over a medium other than IP/Ethernet? Isnt that the world of Fibre Channel and Infiniband? Is the xserve raid doing IP over FC? or a custom SAN protocol?

I thought that using IP/Ethernet you were running a NAS. I certainly wouldnt considered a NAS an appropriate solution for high bandwidth tasks like video editing. I can see how putting a NAS on its own subnet/vlan/cable with a TOE NIC would improve the situation. Is that cheaper than just using a SAN?

Seems to me that it is likely as the computing environment improves SANs will be replaced by NAS', and that in the future you will see TOE NICs showing up in all machines. I also think that that is some time away. But the ubiquity of IP/ethernet always seems to overcome its limitations.

Yeup, you're correct - Fiber Channel is the way it's done now, and perhaps eventually it'll work it's way over to NAS ... but I wasn't so concerened with the server end of things, nice if it comes along for the inexpesive ride too (thanks for the clarification though) ... I was more concerned with the workstation, and getting the most of out it vis a vis piping huge amounts of data over IP, while doing lot's of heavy internal processing at the same time. I can see a need for this as two worlds appear to be converging - Network Storage & Cycle Sharing - Vis Storage, to greatly help workgroup workflows, such as in the video realm, where keeping all the video data on a single local machine is just silly ... and for the purposes of cycle sharing across machines, which I suspect is going to greatly grow in importance, especially for Apple, since they can leverage this so easily - which BTW, is particulairly helpful in the media realm (think Apple, again).

overtoasty · January 26, 2005 9:40AM

Quote:

Originally posted by Programmer

POWER5 has FastPath technology, one part of which was handling of the TCP stack without requiring processor interrupts. That gets you most of the benefit of a TOE without requiring an extra part, and (best of all) it runs at processor clock rate and uses the processor's fast memory subsystem and cache resources.

WOW!

Well this is good news, I certainly hope Apple takes advantage of this: I was at a conference a few years back on Pooch technology (simple Multi-Computer Cycle sharing/clustering) and one of the big advantages Mac's had was that they where a known quantity - you could simply install Pooch on a group of Standard Mac's, and boom - instant cluster: as opposed to x86 boxes, where who knows what they're made of and what's on them - in short, for Microsoft to try to play on this field would be a hell of a lot more work, just as the Linux Beowulf doods.

So - if TOE technology (or it's FastPath counterpart) is as useful to heavy media processor clustering & storage as I think it might be ... then Apple should play this card.

Hmmmmmmm

programmer · January 26, 2005 10:13AM

Quote:

Originally posted by Rhumgod

I also remember reading somewhere that Altivec can handle this as well. Not sure how effectively, however. I think it was in their Developer section.

Not exactly -- AltiVec can be used to make the processor's main thread do this work faster, but the processor's main thread still does the work. The FastPath is an additional section of the processor that you can think of as a dedicated thread which operates completely in parallel with the main thread (or threads in the case of POWER5). This is a huge advantage because the cost on interrupting a modern pipelined processor to handle an incoming message is very significant, and that overhead alone can bring a processor to its knees in a busy network environment.

rhumgod · January 26, 2005 11:06PM

Quote:

Originally posted by Programmer

Not exactly -- AltiVec can be used to make the processor's main thread do this work faster, but the processor's main thread still does the work. The FastPath is an additional section of the processor that you can think of as a dedicated thread which operates completely in parallel with the main thread (or threads in the case of POWER5). This is a huge advantage because the cost on interrupting a modern pipelined processor to handle an incoming message is very significant, and that overhead alone can bring a processor to its knees in a busy network environment.

Ah, found the article I was referring to - it was IBM's library, not Apple's.

Link here.

programmer · January 28, 2005 11:25PM

Quote:

Originally posted by Rhumgod

Ah, found the article I was referring to - it was IBM's library, not Apple's.

Link here.

Yes, that allows the conventional method of having the processor handle network messages to run somewhat faster. Unfortunately the processor still has to handle the network messages. That means it has to interrupt whatever it was doing, process the message, and the go back to what is was doing before -- and all of that requires time and tends to pollute caches and such. The TOEs that this thread is about (and the POWER5's FastPath) handle the network messages without requiring processor attention at all. Much better.

overtoasty · January 29, 2005 7:33PM

Quote:

Originally posted by Programmer

Yes, that allows the conventional method of having the processor handle network messages to run somewhat faster. Unfortunately the processor still has to handle the network messages. That means it has to interrupt whatever it was doing, process the message, and the go back to what is was doing before -- and all of that requires time and tends to pollute caches and such. The TOEs that this thread is about (and the POWER5's FastPath) handle the network messages without requiring processor attention at all. Much better.

Which leads me to suspect that, for a central point feeding a cluster, there's a ballance point between the offloading of cycles vs interrupts on the processor - after a certain point, no matter how many other processors you've got kicking around, the amount of interrupts coming back at you, would completely outweigh the benefit of the offloading (with cache clearing and pipeline bubbling).

Naturally, there's more to this phenomena (and, in fact, there's almost certainly a name for it) however, with some form of TOE-ish NIC's, Mac's could push that ballance-point down to the next bottle-necking factor, that could perhaps be better by a factor of 2, perhaps an order of magnitude? (beats me) - whatever it may be, the nice thing about this is - if you've got the processors kicking around to take advantage of the new pipe - you'll be able to do things easily with clustered Mac's that most non-mac shops could only dream of.

If so, this would give one hell of a competitive advantage to a Mac shop.

8)

splinemodel · January 29, 2005 9:37PM

Quote:

Originally posted by Rhumgod

TOE NICs only benefit are for high-bandwidth, sustained connections. A SAN storage server perhaps, but a PowerMac? I don't see the benefit for the end user.

For those not too informed on the matter, follow this link.

I'm a bit confused at the whole wonderment behind "TOE." Maybe I'm incorrect about the nature of the concept, but it doesn't seem like a big deal to me. A fairly cheap microcontroller can easily handle a TCP/IP stack, even at gigabit speeds.

programmer · January 29, 2005 11:53PM

Quote:

Originally posted by Splinemodel

I'm a bit confused at the whole wonderment behind "TOE." Maybe I'm incorrect about the nature of the concept, but it doesn't seem like a big deal to me. A fairly cheap microcontroller can easily handle a TCP/IP stack, even at gigabit speeds.

Not if that gigabit network is at maximum capacity with all the packets coming at the machine in question. At that kind of load level, even high end servers start to buckle under the strain. Conventional systems get to the point that they are interrupted so often they can't manage to do any real work.

splinemodel · January 30, 2005 12:30AM

Quote:

Originally posted by Programmer

Not if that gigabit network is at maximum capacity with all the packets coming at the machine in question. At that kind of load level, even high end servers start to buckle under the strain. Conventional systems get to the point that they are interrupted so often they can't manage to do any real work.

They buckle because they aren't designed specifically to manage a TCP/IP stack. The actual rate of stack update is irrelevant as long as the maximally long path in the code is less than the required time, which incidentally is a requirement for it being GigE. It just means that the chip won't get much idle time. More specifically, you select a high clock rate MCU with a simple set of integer and conditional instructions. Presumably you only need to make sure that it can do a data I/O every 30ns, perhaps even slower, and then mate it to a really high speed shift register. What we have now are people who are doing just this, and then padding up the cost a ton because they know how overpriced anything in the server market is.

Anyway, the bottom line is that you could put it all on a relatively tiny piece of silicon, at least compared to what you see in servers. Eventually, I bet that all ethernet controllers will be single chip TOE, and it's not unlikely that we'll see them in Powermacs first, since PC's still don't tend to have autosensing ports, DVI-out (in laptops) or anything that would make you think you're living in 2005.

programmer · January 30, 2005 2:41PM

Quote:

Originally posted by Splinemodel

It just means that the chip won't get much idle time.

Well this is pretty much exactly the point -- what is the use in handling all the packets if you can't actually do any work on them because you're so busy handling them?

Sounds like we agree, however... implementing this isn't terribly complex and it could easily be integrated into either the chipset or the processor with a cost delta of approximately zero.

splinemodel · January 30, 2005 4:02PM

As usual, Programmer, you are very astute. I guess my point is that I think all ethernet should be TOE, and that I'm a big fan of decentralization of processing. Additionally, I don't think it's out of the question for some party to pursue a single chip TOE ethernet solution, and I bet something of that nature is in development now.

mmmpie · January 30, 2005 4:15PM

Quote:

Originally posted by Splinemodel

They buckle because they aren't designed specifically to manage a TCP/IP stack. The actual rate of stack update is irrelevant as long as the maximally long path in the code is less than the required time, which incidentally is a requirement for it being GigE. It just means that the chip won't get much idle time. More specifically, you select a high clock rate MCU with a simple set of integer and conditional instructions. Presumably you only need to make sure that it can do a data I/O every 30ns, perhaps even slower, and then mate it to a really high speed shift register. What we have now are people who are doing just this, and then padding up the cost a ton because they know how overpriced anything in the server market is.

I went and read some interesting articles about the performance of tcp/ip stacks, and TOE.

* Saturating a gigE requires a pentium IV 2.4 GHz

* Memory bandwidth is quite high receiving packets due 3 memory copies.

Using a TOE relieves the load on the cpu, and if it is well designed it can dramatically reduce the memory bandwidth. The silicon cost depends on how much of the stack is implemented, maintaining the state of TCP connections is expensive.

* 16mb of RAM to maintain the state for all TCP ports.

* RAM is required to buffer the incoming data. How much is pretty much an open question. iSCSI has very precise predefinition of memory requirements, and it seems that TOE chips are going to be targeted at those needs.

* 10gigE is going to require 10 times as much memory for its buffers.

It is pretty easy to envisage a TOE card having 256 or 512mb of RAM.

I dont think any of this is to expensive, certainly isnt going to make it into a consumer machine any time soon.

For the original context of the discussion ( streaming media for editing over TCP ) FC was intended as a replacement for SCSI. It is the appropriate medium for this sort of work. It can do 2gb a second without any of the issues that TCP has ( for that matter, Im sure you could do SCSI over ethernet - but I havent seen any mention of it, people want routable protocols ). The issue is really one of asset management. It doesnt make a whole lot of sense to have hundreds of workstations with their own terabyte FC raids. Not only is it a maintenance nightmare, but its also an asset management nightmare. The solution to that is to use the workstations as a front end to a compute server, which is what I think most graphics effect companies do. The desire then is to create the compute server out of off the shelf products ( xserves ) which brings you full circle.

Conclusion: TOE is inevitable. Although technologies like FC are cheaper to implement, their poor market penetration burdens them with high costs. Eventualy TOE is going to be integrated into all ethernet chips, simply because it is possible ( initially a value add ). I dont see TOE being a huge benefit for the general market - its memory requirements are large - but for people running iSCSI, or fibre channel over ethernet ( theres about four protocols for that ) it will be invaluable.

Will Apple get an early start on this? They dont create their own ethernet chips. They will use TOE when their suppliers integrate it into their products. Apple has had a good history of early adoption of ethernet technology, and I do expect to see it shipping TOEs as standard before the general x86 workstation market does. The large amount of memory required makes it an obvious choice for a shared memory architecture. When you look at it that way it may well be cheaper ( initially ) to just use dual cpu machines. One cpu does the TCP stack while the other does the compute. With dual core this becomes even more likely.

TOE seems like a technology that is between a rock and a hard place. High end machines can just throw cpus at the problem, low end machines cant afford to begin with, and by the time it is cheap enough cpus will be dual core. 10gigE will change that, I guess I see the TOE vendors waiting for 10gigE to really hit the market.

splinemodel · January 30, 2005 9:12PM

Quote:

Originally posted by mmmpie

I went and read some interesting articles about the performance of tcp/ip stacks, and TOE.

* Saturating a gigE requires a pentium IV 2.4 GHz

* Memory bandwidth is quite high receiving packets due 3 memory copies.

yadda yadda yadda

You're thinking about this the wrong way. A TCP/IP stack can be written comfortably in 64K of program space. That is, no OS, no over complex BS. Using a Pentium 4, or for that matter a G4, to do TOE is akin to killing a frog with a tank rather than a BB gun. As I said, all you need to do is make sure that a complete I/O loop and the maximally long stack operation can be performed in 30ns, perhaps even 120ns if you want to use 128bit bus rather than 32bit bus. Now, doing it the easy way, that is, using higher level coding paradigms and then giving it enough clock and memory until it works, may require a P4 with 16MB of RAM. Doing it the elegant way will not.

programmer · January 30, 2005 11:16PM

Hence the advantage of IBM's choice to put that functionality into the POWER5 -- leverage the processors clock rate and bus(es). No extra chips in the system either. Just a bit of hardware added to the CPU that replaces some software that used to run on the programmable part of the CPU.

mmmpie · January 30, 2005 11:27PM

Quote:

Originally posted by Splinemodel

You're thinking about this the wrong way. A TCP/IP stack can be written comfortably in 64K of program space. That is, no OS, no over complex BS. Using a Pentium 4, or for that matter a G4, to do TOE is akin to killing a frog with a tank rather than a BB gun. As I said, all you need to do is make sure that a complete I/O loop and the maximally long stack operation can be performed in 30ns, perhaps even 120ns if you want to use 128bit bus rather than 32bit bus. Now, doing it the easy way, that is, using higher level coding paradigms and then giving it enough clock and memory until it works, may require a P4 with 16MB of RAM. Doing it the elegant way will not.

Im not discussing the complexity of the stack.

Im not a TCP expert, the article I read claims that it takes ~256 bytes to maintain the state information for one open port ( sliding window, syn number, missing packets etc ).

There are 65536 ports, giving you a total of 16mb of space required to maintain the state information for a system with all available ports open.

Now, it is fair to say that most systems wont encounter a situation where all their ports are open, but it certainly could happen.

Note, that the high performance cost is for receiving data.

That 16mb is just state, and doesnt cover the amount of memory required to buffer the incoming data while it is assembled.

TOE NIC's In All Future PowerMacs?

Comments