Facebook says 'faulty configuration change' to blame for 6-hour outage
Facebook late Monday apologized for a six-hour outage that impacted the company's flagship social network, as well as ancillary services, blaming the downtime on a "faulty configuration change."

Facebook and its related services, including Instagram, WhatsApp, Messenger and Oculus VR, went offline at around 11:30 a.m. Eastern and remained inaccessible for about six hours. Subsequent reports suggested that a bad Border Gateway Protocol (BGP) update was to blame for the outage, and a new statement from Facebook seemingly confirms the theory.
In a blog post, Facebook VP of Engineering and Infrastructure Santosh Janardhan apologized for the "inconvenience" and explained that router configuration changes caused an interruption between its data centers.
"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication," Janardhan said. "This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."
The explanation jibes with information provided by Cloudflare, which earlier in the day traced the issue back to a BGP mishap that impacted traffic routing. At the time, some speculated that a simple DNS configuration error was behind the downtime, though that explanation was abandoned after certain DNS services were found to be functional but unresponsive.
Janardhan also confirmed reports that Facebook's internal tools were impacted by the outage, complicating efforts to diagnose and solve the problem. According the The New York Times, security engineers were unable to gain physical access to affected servers because their digital badges were rendered inoperable.
Apparently fearful of rumors that its system was hacked, Facebook in the blog post reiterates that the outage was caused by a "faulty configuration change" and notes that no user data was compromised as a result of the downtime.
Read on AppleInsider

Facebook and its related services, including Instagram, WhatsApp, Messenger and Oculus VR, went offline at around 11:30 a.m. Eastern and remained inaccessible for about six hours. Subsequent reports suggested that a bad Border Gateway Protocol (BGP) update was to blame for the outage, and a new statement from Facebook seemingly confirms the theory.
In a blog post, Facebook VP of Engineering and Infrastructure Santosh Janardhan apologized for the "inconvenience" and explained that router configuration changes caused an interruption between its data centers.
"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication," Janardhan said. "This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."
The explanation jibes with information provided by Cloudflare, which earlier in the day traced the issue back to a BGP mishap that impacted traffic routing. At the time, some speculated that a simple DNS configuration error was behind the downtime, though that explanation was abandoned after certain DNS services were found to be functional but unresponsive.
Janardhan also confirmed reports that Facebook's internal tools were impacted by the outage, complicating efforts to diagnose and solve the problem. According the The New York Times, security engineers were unable to gain physical access to affected servers because their digital badges were rendered inoperable.
Apparently fearful of rumors that its system was hacked, Facebook in the blog post reiterates that the outage was caused by a "faulty configuration change" and notes that no user data was compromised as a result of the downtime.
Read on AppleInsider
Comments
That is why most places use seperate networks for access control and building maintenance/HVAC.
Furthermore the news about Facebook is already widely published and this topic will continue to be news due to the ongoing congressional review of Facebook's activities.
It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.
I'm not saying it has no uses. I used to be pretty involved in all kinds of groups, from humor to politics to special interest, to neighborhood, etc. But the fact that it has some use does not excuse the company's behavior. Their data mining and tracking alone is terrible. Then we get into banning, shadow banning, dethrottling, suppressing information, etc.
Or being used to spread misinformation and as a forum for terrorists to recruit more terrorists and plan their attacks.
Yes Zuck, save us from ourselves!
Pathetic...
They do have the ability to undo changes ... at the level of the management platform. When the management platform can no longer reach the device to manage it, it needs manual intervention to restore that communication. Then the configuration can be restored to what it was before the change. They also lost external access to the management platform, but that's simple enough to fix (if somebody doesn't confirm the change, roll it back automatically). The access from the management to the managed devices is more difficult.
Good information. That makes sense. So they basically severed the primary communication link between the management platform and the managed devices. So what they also needed was a remote out-of-band communication channel, or what they probably ended up doing, a bunch of engineers with laptops and a bunch of physical security keys to bypass the badge readers protecting the servers. Interesting cascading failure mode.