What's the big deal about ECC memory in the G5 Xserve?

Posted:
in Genius Bar edited January 2014
What's the big deal about ECC memory in the G5 Xserve?



Virgina Tech is stoked!



Will this RAM be more expensive?

Comments

  • Reply 1 of 6
    EEC memory has special hardware on the RAM that check to see that the information coming out of the DIMM is that same that went in (and correct it in some cases). Occasionally memory locations can change their value, sometime dues to cosmic rays, but far more often because of defects in the materials of the RAM chips themselves.



    About 10 years ago this was a major problem in RAM chips, but that has been by and large corrected, and now most ECC systems are employed because the systems engineers have bad memories of those old systems and feel much safer with ECC RAM (thing of it as a blakny for system engineers).



    With systems as big as Virginia-Tech's I can understand the desire for this again. It will probably not make a difference in the accuracy of the calculations, but it will make the system administrators worry less about that potential source of errors. And with that complex a system you have enough other problems to worry about.



    But they are probably far more interested in the smaller form factor and easier maintenance-in-place of the XServes (you can pull them out like a drawer and make changes, there is a really nice hardware monitoring system, and nice little touches like a light on the front that can be turned on from the monitoring software that makes finding a particular XServe in a cluster much easier).
  • Reply 2 of 6
    akacakac Posts: 512member
    Actually VTech has said that because of the non-ECC RAM they will run some calcs TWICE to ensure its correctness.



    That's a 50% speed decrease.



    Also another engineer said that ECC RAM stores logs of memory bit corrections and on their large scale server cluster they find it happens about once a week with current hardware in their entire cluster.



    So the ECC RAM means they can trust the results of their calculations.
  • Reply 3 of 6
    cosmonutcosmonut Posts: 4,872member
    Quote:

    Originally posted by Karl Kuehn

    \\About 10 years ago this was a major problem in RAM chips,



    Was that the reason for parity RAM?
  • Reply 4 of 6
    Parity is the simplest type of check in ECC memory. It simply goes down the line on a string of addresses and adds them up (remembering only the last digit). Then it compares this to a bit it has stored, if they are the same it assumes it is fine. This works if there is only a single bit error (or an odd number of errors). This is the most common sort of error, but if you have two errors, parity does not catch it. There are more complex ways of doing this that will catch many more situations.



    And on a once-a-week error I would think it safe to say that that level of error would fade into the background when you compare all the other points where errors are going to creep in: logic mistakes, bit flips in transit or inside the processor (cache or working), bugs in system or program design, data problems, hard disk degradation, etc...



    Like I said before, it is a nice blanky to ease the system administrators mind so that he/she/they can be deathly afraid of every other source of errors...
  • Reply 5 of 6
    wmfwmf Posts: 1,164member
    Quote:

    Originally posted by Karl Kuehn

    And on a once-a-week error I would think it safe to say that that level of error would fade into the background when you compare all the other points where errors are going to creep in: logic mistakes, bit flips in transit or inside the processor (cache or working), bugs in system or program design, data problems, hard disk degradation, etc...





    Actually no. The caches and buses in the system also have ECC, so they don't get errors. Hard disks have even more sophisticated error-correction. People have done the math on this and the weak spot is the RAM, simply because there's so much of it (e.g. Big Mac has maybe 1100 hard disks, 2200 CPUs, and 140800 RAM chips).
  • Reply 6 of 6
    123123 Posts: 278member
    Quote:

    Originally posted by Karl Kuehn

    This works if there is only a single bit error (or an odd number of errors). This is the most common sort of error, but if you have two errors, parity does not catch it.



    ECC:

    Correct all one-bit errors, detect all 2-bit errors (+some more).
Sign In or Register to comment.