Facebook says 'faulty configuration change' to blame for 6-hour outage

Posted:
in General Discussion edited October 2021
Facebook late Monday apologized for a six-hour outage that impacted the company's flagship social network, as well as ancillary services, blaming the downtime on a "faulty configuration change."

Facebook


Facebook and its related services, including Instagram, WhatsApp, Messenger and Oculus VR, went offline at around 11:30 a.m. Eastern and remained inaccessible for about six hours. Subsequent reports suggested that a bad Border Gateway Protocol (BGP) update was to blame for the outage, and a new statement from Facebook seemingly confirms the theory.

In a blog post, Facebook VP of Engineering and Infrastructure Santosh Janardhan apologized for the "inconvenience" and explained that router configuration changes caused an interruption between its data centers.

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication," Janardhan said. "This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."

The explanation jibes with information provided by Cloudflare, which earlier in the day traced the issue back to a BGP mishap that impacted traffic routing. At the time, some speculated that a simple DNS configuration error was behind the downtime, though that explanation was abandoned after certain DNS services were found to be functional but unresponsive.

Janardhan also confirmed reports that Facebook's internal tools were impacted by the outage, complicating efforts to diagnose and solve the problem. According the The New York Times, security engineers were unable to gain physical access to affected servers because their digital badges were rendered inoperable.

Apparently fearful of rumors that its system was hacked, Facebook in the blog post reiterates that the outage was caused by a "faulty configuration change" and notes that no user data was compromised as a result of the downtime.

Read on AppleInsider
«1

Comments

  • Reply 1 of 25
    Can we make that config permanent? Thank you.    
    roakeikirFred257bluefire1lkruppGeorgeBMacdocno42williamlondonwatto_cobrajony0
  • Reply 2 of 25
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    Fred257lkruppllamadocno42williamlondonwatto_cobra
  • Reply 3 of 25
    22july201322july2013 Posts: 3,571member
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    I think that IT guy will never make that mistake again, so he would be a great employee now.
    Fred257GeorgeBMac
  • Reply 4 of 25
    Fred257Fred257 Posts: 237member
    This outage was done on purpose. On Sunday evening 60 minutes did a major story on Facebook from an actual whistleblower.  To get that story out of the media what better way than to create a major diversion.  Zuckerberg is a covert psychopath narcissist and I know how exactly he thinks because I have a lifetime experience with them.  Zuckerberg is a sick person who extracts energy and money from destroying our personal lives with his extraction tool called Facebook. 
    williamlondonJapheywatto_cobrajony0
  • Reply 5 of 25

    security engineers were unable to gain physical access to affected servers because their digital badges were rendered inoperable. 

    That is why most places use seperate networks for access control and building maintenance/HVAC.
    docno42watto_cobrajony0
  • Reply 6 of 25
    fred1fred1 Posts: 1,112member
    Facebook: making the world a better place. And then they came back online. At least we can reminisce. 
    watto_cobra
  • Reply 7 of 25
    dewmedewme Posts: 5,362member
    The concept behind Facebook is glorious if all people were nice. But all people aren’t nice. Some people are evil. Facebook has found a way to monetize evil and reap incalculable financial rewards from doing so. But as many others have said, if Facebook wasn’t doing it someone else would. The social media genie can never be put back in the bottle.
    watto_cobrajony0
  • Reply 8 of 25
    Fred257 said:
    This outage was done on purpose. On Sunday evening 60 minutes did a major story on Facebook from an actual whistleblower.  To get that story out of the media what better way than to create a major diversion.  Zuckerberg is a covert psychopath narcissist and I know how exactly he thinks because I have a lifetime experience with them.  Zuckerberg is a sick person who extracts energy and money from destroying our personal lives with his extraction tool called Facebook. 
    This theory is insanity - the downtime not only draws more attention to Facebook, but also deprives facebook of the advertising revenue lost during the downtime.

    Furthermore the news about Facebook is already widely published and this topic will continue to be news due to the ongoing congressional review of Facebook's activities.


    williamlondonwatto_cobra
  • Reply 9 of 25
    lkrupplkrupp Posts: 10,557member
    So the engineers couldn’t get into the server room because their badges went down like the rest of the company? Wow. That sounds like an Austin Powers movie in the making. I can see the engineers peering through the server room windows and frantically pounding on the doors as the servers went berserk. “Somebody get me a fire ax,” yelled the head engineer while Zuck sat in his lair petting a white cat and intoning “give me a frickin break", will you.
    edited October 2021 dewmesdw2001docno42williamlondonwatto_cobrajony0
  • Reply 10 of 25
    GeorgeBMacGeorgeBMac Posts: 11,421member
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    I think that IT guy will never make that mistake again, so he would be a great employee now.

    The one forbidden word for surgeons and network engineers is:   "Ooops!"

    I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
    The blame game will have reached new, unheard of levels by now....
    watto_cobra
  • Reply 11 of 25
    dewmedewme Posts: 5,362member
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    I think that IT guy will never make that mistake again, so he would be a great employee now.

    The one forbidden word for surgeons and network engineers is:   "Ooops!"

    I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
    The blame game will have reached new, unheard of levels by now....

    It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.
    beowulfschmidtwatto_cobra
  • Reply 12 of 25
    sdw2001sdw2001 Posts: 18,016member
    Couldn't have happened to a more evil, corrupt company.  Sorry for all the small biz that was impacted.  I personally got off FB early in the year and it is the BEST decision ever.  
    docno42williamlondonwatto_cobra
  • Reply 13 of 25
    GeorgeBMacGeorgeBMac Posts: 11,421member
    sdw2001 said:
    Couldn't have happened to a more evil, corrupt company.  Sorry for all the small biz that was impacted.  I personally got off FB early in the year and it is the BEST decision ever.  

    I'm not disagreeing.   But they also do a lot of good.
    I use it to keep in touch with my local running community on organized runs and events.  I also belong to a "runners over 70" group with older runners from all over the world:  We seem to be a pretty rare breed of older adults working hard to stay fit and healthy (and mostly succeeding!) so the group provides encouragement, support and advise that simply would not be available anywhere else.
  • Reply 14 of 25
    GeorgeBMacGeorgeBMac Posts: 11,421member
    dewme said:
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    I think that IT guy will never make that mistake again, so he would be a great employee now.

    The one forbidden word for surgeons and network engineers is:   "Ooops!"

    I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
    The blame game will have reached new, unheard of levels by now....

    It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.

    The problem may have been that they locked themselves out of their own buildings and systems along with everybody else!

  • Reply 15 of 25
    sdw2001sdw2001 Posts: 18,016member
    sdw2001 said:
    Couldn't have happened to a more evil, corrupt company.  Sorry for all the small biz that was impacted.  I personally got off FB early in the year and it is the BEST decision ever.  

    I'm not disagreeing.   But they also do a lot of good.
    I use it to keep in touch with my local running community on organized runs and events.  I also belong to a "runners over 70" group with older runners from all over the world:  We seem to be a pretty rare breed of older adults working hard to stay fit and healthy (and mostly succeeding!) so the group provides encouragement, support and advise that simply would not be available anywhere else.

    I'm not saying it has no uses.  I used to be pretty involved in all kinds of groups, from humor to politics to special interest, to neighborhood, etc.  But the fact that it has some use does not excuse the company's behavior.  Their data mining and tracking alone is terrible.  Then we get into banning, shadow banning, dethrottling, suppressing information, etc.  
    docno42williamlondonJapheywatto_cobra
  • Reply 16 of 25
    GeorgeBMacGeorgeBMac Posts: 11,421member
    sdw2001 said:
    sdw2001 said:
    Couldn't have happened to a more evil, corrupt company.  Sorry for all the small biz that was impacted.  I personally got off FB early in the year and it is the BEST decision ever.  

    I'm not disagreeing.   But they also do a lot of good.
    I use it to keep in touch with my local running community on organized runs and events.  I also belong to a "runners over 70" group with older runners from all over the world:  We seem to be a pretty rare breed of older adults working hard to stay fit and healthy (and mostly succeeding!) so the group provides encouragement, support and advise that simply would not be available anywhere else.

    I'm not saying it has no uses.  I used to be pretty involved in all kinds of groups, from humor to politics to special interest, to neighborhood, etc.  But the fact that it has some use does not excuse the company's behavior.  Their data mining and tracking alone is terrible.  Then we get into banning, shadow banning, dethrottling, suppressing information, etc.  

    Or being used to spread misinformation and as a forum for terrorists to recruit more terrorists and plan their attacks.
    watto_cobra
  • Reply 17 of 25
    docno42docno42 Posts: 3,755member
    GeorgeBMac said:
    Or being used to spread misinformation and as a forum for terrorists to recruit more terrorists and plan their attacks.
    lol - the new "but think of the children". 

    Yes Zuck, save us from ourselves!  

    Pathetic...
    Japhey
  • Reply 18 of 25
    sflocalsflocal Posts: 6,093member
    Fred257 said:
    This outage was done on purpose. On Sunday evening 60 minutes did a major story on Facebook from an actual whistleblower.  To get that story out of the media what better way than to create a major diversion.  Zuckerberg is a covert psychopath narcissist and I know how exactly he thinks because I have a lifetime experience with them.  Zuckerberg is a sick person who extracts energy and money from destroying our personal lives with his extraction tool called Facebook. 
    You should consider a better-quality tinfoil for your next hat.  The silliness factor of your conspiracy-post is high.
    watto_cobra
  • Reply 19 of 25
    zimmiezimmie Posts: 651member
    dewme said:
    The concept behind Facebook is glorious if all people were nice. But all people aren’t nice. Some people are evil. Facebook has found a way to monetize evil and reap incalculable financial rewards from doing so. But as many others have said, if Facebook wasn’t doing it someone else would. The social media genie can never be put back in the bottle.
    Is it, though? The concept behind Facebook was to rate female Harvard students' hotness.

    dewme said:
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    I think that IT guy will never make that mistake again, so he would be a great employee now.

    The one forbidden word for surgeons and network engineers is:   "Ooops!"

    I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
    The blame game will have reached new, unheard of levels by now....

    It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.
    They do have the ability to undo changes ... at the level of the management platform. When the management platform can no longer reach the device to manage it, it needs manual intervention to restore that communication. Then the configuration can be restored to what it was before the change. They also lost external access to the management platform, but that's simple enough to fix (if somebody doesn't confirm the change, roll it back automatically). The access from the management to the managed devices is more difficult.
    dewmeGeorgeBMacwatto_cobra
  • Reply 20 of 25
    dewmedewme Posts: 5,362member
    zimmie said:
    dewme said:
    The concept behind Facebook is glorious if all people were nice. But all people aren’t nice. Some people are evil. Facebook has found a way to monetize evil and reap incalculable financial rewards from doing so. But as many others have said, if Facebook wasn’t doing it someone else would. The social media genie can never be put back in the bottle.
    Is it, though? The concept behind Facebook was to rate female Harvard students' hotness.

    dewme said:
    For six hours we were able to have meaningful in person conversations.   Maybe we need to start a GoFundMe for the poor IT guy that blew it today?  It was probably his last day.  
    I think that IT guy will never make that mistake again, so he would be a great employee now.

    The one forbidden word for surgeons and network engineers is:   "Ooops!"

    I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
    The blame game will have reached new, unheard of levels by now....

    It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.
    They do have the ability to undo changes ... at the level of the management platform. When the management platform can no longer reach the device to manage it, it needs manual intervention to restore that communication. Then the configuration can be restored to what it was before the change. They also lost external access to the management platform, but that's simple enough to fix (if somebody doesn't confirm the change, roll it back automatically). The access from the management to the managed devices is more difficult.

    Good information. That makes sense. So they basically severed the primary communication link between the management platform and the managed devices. So what they also needed was a remote out-of-band communication channel, or what they probably ended up doing, a bunch of engineers with laptops and a bunch of physical security keys to bypass the badge readers protecting the servers. Interesting cascading failure mode.
    watto_cobra
Sign In or Register to comment.