Concept: A Cluster faster than Big Mac

Jump to First Reply
Posted:
in Future Apple Hardware edited January 2014
This was spurred on by a recent commentary over at Mac OS Rumors about Apple putzing around with the latest Firewire 1600 and 3200 chipsets.



The ingredients of this speculation are:

The inevitable availability of Firewire 1600 and 3200

TCP/IP over firewire available in Panther now. It's in Networking under Port configurations. Select NEW and look at the list. For those who didn't notice.



One of the things that slowed the installation of the Big Mac Cluster was the team had to install 1100 2Mb infiniband fiber channel network cards because 1000 Base T wasn't fast enough. This ate up several days in hardware installation not to mention the cost of additional hardware. One other issue to be aware of was that the fiber channel cards were optimized in a way that nearly saturated the PCI-X slot they were plugged into. A potential bottleneck in the future.



Disclaimer: I, not being familiar with fiber channel, have no Idea of what kind of throughput is possible with fiber channel. Are there faster flavors of it available?



Now let the "What if's" begin.



If Firewire 1600 is 50% faster than 1000 Base T and Firewire 3200 is 60% faster than 2Mb fiber channel. What kind of performance would this kind of network fabric offer? What sort of network topology would be necessary to prevent any one node from saturating the 400MB/s limit of Firewire 3200 or 200MB/s of Firewire 1600.



One downside would be the lack of firewire switches for this kind of network. Now there's an idea for a startup business.



Have fun.



Discuss.
«1

Comments

  • Reply 1 of 29
    g-newsg-news Posts: 1,107member
    there's more to fiberchannel than just pure bandwidth.

    Plus, they used infiniband interconnects, NOT Fiberchannel.

    Infiniband is again a LOT faster than fiberchannel (talking about several GB/sec being moved, not just gbit/sec).

    I don't know what latencies firewire 3200 would have, but I doubt it's anywhere near infiniband or similar clustering interfaces.
     0Likes 0Dislikes 0Informatives
  • Reply 2 of 29
    Also, Apple needs to get a model with ECC ram and then they could be taken a bit more seriously (or so that is my unerstanding). some have speculated that a G5 xServe will have ECC ram...
     0Likes 0Dislikes 0Informatives
  • Reply 3 of 29
    flounderflounder Posts: 2,674member
    third place seems pretty serious to me.....



    I don't know anything about this topic. What is "serious" about this ram?
     0Likes 0Dislikes 0Informatives
  • Reply 4 of 29
    Quote:

    Originally posted by Flounder

    third place seems pretty serious to me.....



    I don't know anything about this topic. What is "serious" about this ram?




    Some quotes from MacNN on the subject of ECC:



    Quote:

    G5 has non-ECC memory. Therefore it's possible to occur calculation failure. It's no problem for HPL, because HPL make sure correctness at the end of calculation, but it might be a fatal point in "real" calculation. During HPL tunings, there were some diffect node and pretty difficult to find it(sometimes incorrect calculation, sometimes dead lock of thread ..). It G5 had ECC memory, it would be much easier...



    Quote:

    If anything it would slow it down slightly. However, ECC will self-correct some memory errors that non-ECC won't. ie. If there is a silent error, ECC will correct it silently and the calculations will merrily continue, correctly. OTOH, non-ECC memory will simply let the error propagate and the whole calculation will be garbage. This isn't a big deal for a person running 4 GB RAM in Photoshop. Indeed, Dr. V said it's not a problem for their cluster running 4 Terabytes of RAM doing calculations for a few hours either. However, it's potentially a huge deal for somebody running 4 Terabytes of memory in a cluster and trying to do a calculation that's supposed to go on for a week or whatever.



    And this, from Dr. Srinidhi Varadarajan, Director of the Terascale Computing Facility at VT:



    Quote:

    However, I do agree that lack of ECC is an issue, particularly when our fault tolerance work now enables large runs that can last as long as the application chooses.. We are switching to an ECC platform soon, so this will not be a long-term issue either.



    Which sounds like the G5s are a temporary measure. Either VT knows that new machines from Apple are coming with ECC, or they are planning on siwtching to Opterons or something. This seems odd that they would spend $5 million on this setup, and then scrap it shortly after...
     0Likes 0Dislikes 0Informatives
  • Reply 5 of 29
    g-newsg-news Posts: 1,107member
    they just bought a cluster for 5.2 million plus infrastructure and are switching to an ECC platform SOON?

    WTF?
     0Likes 0Dislikes 0Informatives
  • Reply 6 of 29
    Quote:

    Originally posted by G-News

    they just bought a cluster for 5.2 million plus infrastructure and are switching to an ECC platform SOON?

    WTF?




    That is what I was thinking. To paraprase someone over at macnn, they basicly think this was used to get on the list. Then they will resell the G5s to students, and purchase new G5 xServes which will have ECC support. The infiband and all that shodul just work on the xServes (right?) the only thing would be the racks they use now wouldn't be compatible.



    All in all the post by the Director makes absolutly no sense.
     0Likes 0Dislikes 0Informatives
  • Reply 7 of 29
    flounderflounder Posts: 2,674member
    thanks for the info, Kupan
     0Likes 0Dislikes 0Informatives
  • Reply 8 of 29
    g-newsg-news Posts: 1,107member
    I simply don't believe that, I'm sorry.
     0Likes 0Dislikes 0Informatives
  • Reply 9 of 29
    chuckerchucker Posts: 5,089member
    Quote:

    Originally posted by kupan787

    To paraprase someone over at macnn, they basicly think this was used to get on the list. Then they will resell the G5s to students, and purchase new G5 xServes which will have ECC support.







     0Likes 0Dislikes 0Informatives
  • Reply 10 of 29
    krassykrassy Posts: 595member
    Quote:

    Originally posted by G-News

    I simply don't believe that, I'm sorry.



    i 2nd that.
     0Likes 0Dislikes 0Informatives
  • Reply 11 of 29
    powerdocpowerdoc Posts: 8,123member
    Quote:

    Originally posted by Krassy

    i 2nd that.



    So do I. i doubt that the university will give them more money. If they had one, they will have choosen the Opteron.



    I am not a mathematical expert, but there is a way to make software in order to detect error with key control. If they want to have secure calculations they can do it.

    Even ECC memory do not prevent all errors, and are not the warrant of a 100 % proof calculation.
     0Likes 0Dislikes 0Informatives
  • Reply 12 of 29
    Quote:

    Originally posted by G-News

    I simply don't believe that, I'm sorry.



    Do you think I am making it up? Here is the link to the page over at MacNN:



    http://forums.macnn.com/showthread.p...0&pagenumber=3



    It is about half way down, and posted by Eug.



    And one more post from over there:



    Quote:

    ]It probably means they're gonna simply buy Opterons or G5 Xserves or something in the near future.



    I'm thinking they got the G5 Power Macs as an interim solution, to get on the list, and because they can be used in a number of different situations. ie. Use them for now, esp. for smaller calculations and then use them in small clusters or as desktops when they get more money to build a dedicated supercluster with ECC-endowed servers, which could in fact be G5 Xserves. It would make sense for Apple to go with ECC, esp. now that VT has proven clustering G5s is a viable option in terms of functionality, provides a good price performance ratio, and that Apple is a good reliable hardware vendor.



    The question many are asking is why they didn't go with Linux Opteron right off the bat, but it seems they had issues with supply and support, as well as quoted costs. The G5 was the best solution at the time for their needs, but it's not a perfect solution either. Fortunately for them, they seem to have a lot of money to play with.



     0Likes 0Dislikes 0Informatives
  • Reply 13 of 29
    krassykrassy Posts: 595member
    Quote:

    Originally posted by kupan787

    Do you think I am making it up?



    not but i don't believe the authors of your mentioned quotes... the only thing i can imagine is that they're switching to newer g5s some day but not earlier than 12 months from now - they bought a whole cooling system and made up these places for the towers.... and they are porting (and already have) their software to mac os x so they won't take opterons for sure. but anyway - i don't know this for sure ... and i don't believe this whole ecc-ram stuff is such a big problem ...
     0Likes 0Dislikes 0Informatives
  • Reply 14 of 29
    g-newsg-news Posts: 1,107member
    this "in the near future" may be relative to supercomputer livecycles. In that case "the near future" could also be in 5 years.



    I think this is simply a rumormonger going postal.
     0Likes 0Dislikes 0Informatives
  • Reply 15 of 29
    stoostoo Posts: 1,490member
    A cluster doesn't need insane years-long uptimes. Physical memory errors are sufficiently rare that 1100 G5s should last a couple of days or weeks. Even if one machine does return incorrect results, that would presumably just show up as an erroneous datapoint, as happens in empirical experiments.
     0Likes 0Dislikes 0Informatives
  • Reply 16 of 29
    chagichagi Posts: 284member
    Virginia Tech is going to be building a brand new center on campus, with completion around 2005/2006 - Institute for Critical Technology and Applied Science (ICTAS).



    So when Dr. Varadarajan is talking about ECC "soon", I think it's logical to assume that they are already beginning to look ahead to the system(s) that will be implemented in the new facility.
     0Likes 0Dislikes 0Informatives
  • Reply 17 of 29
    Quote:

    Originally posted by Stoo

    A cluster doesn't need insane years-long uptimes. Physical memory errors are sufficiently rare that 1100 G5s should last a couple of days or weeks. Even if one machine does return incorrect results, that would presumably just show up as an erroneous datapoint, as happens in empirical experiments.





    As I understood it, the software error correction they are using just reissues the suspect data packet to a different node for reprocessing.



    If one or two RAM errors occurred per week, as I have heard others state, as the "normal frequency of occurrence" I don't see how this could have much effect on the total time of processing on a long run of data.
     0Likes 0Dislikes 0Informatives
  • Reply 18 of 29
    Quote:

    Originally posted by Plague Bearer

    As I understood it, the software error correction they are using just reissues the suspect data packet to a different node for reprocessing.



    If one or two RAM errors occurred per week, as I have heard others state, as the "normal frequency of occurrence" I don't see how this could have much effect on the total time of processing on a long run of data.




    All I know is that over at Ars, people were up in arms about the fact VT was using a machine with no ECC. I heard stories that the whole cluster couldn't be used for more than an hour before an error in memory would screw up teh calculation. However, smaller segments running would be fine (say 25-50 machines out of the 1100).
     0Likes 0Dislikes 0Informatives
  • Reply 19 of 29
    g-newsg-news Posts: 1,107member
    Well, the battlefront in Ars is often more flamed than factual.

    I don't think Varadarajanrananan or whatever his wicked name is would have bought an ECC-less system, if he wouldn't have a solution for the issue.
     0Likes 0Dislikes 0Informatives
  • Reply 20 of 29
    krassykrassy Posts: 595member
    Quote:

    Originally posted by kupan787

    All I know is that over at Ars, people were up in arms about the fact VT was using a machine with no ECC. I heard stories that the whole cluster couldn't be used for more than an hour before an error in memory would screw up teh calculation. However, smaller segments running would be fine (say 25-50 machines out of the 1100).



    as i said - i don't believe this... this would be a waste of money for VT. do you think they're that stupid? me not.
     0Likes 0Dislikes 0Informatives
Sign In or Register to comment.