Facebook says 'faulty configuration change' to blame for 6-hour outage

AppleInsider · October 5, 2021 3:36AM

Facebook late Monday apologized for a six-hour outage that impacted the company's flagship social network, as well as ancillary services, blaming the downtime on a "faulty configuration change."

Facebook and its related services, including Instagram, WhatsApp, Messenger and Oculus VR, went offline at around 11:30 a.m. Eastern and remained inaccessible for about six hours. Subsequent reports suggested that a bad Border Gateway Protocol (BGP) update was to blame for the outage, and a new statement from Facebook seemingly confirms the theory.

In a blog post, Facebook VP of Engineering and Infrastructure Santosh Janardhan apologized for the "inconvenience" and explained that router configuration changes caused an interruption between its data centers.

"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication," Janardhan said. "This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."

The explanation jibes with information provided by Cloudflare, which earlier in the day traced the issue back to a BGP mishap that impacted traffic routing. At the time, some speculated that a simple DNS configuration error was behind the downtime, though that explanation was abandoned after certain DNS services were found to be functional but unresponsive.

Janardhan also confirmed reports that Facebook's internal tools were impacted by the outage, complicating efforts to diagnose and solve the problem. According the The New York Times, security engineers were unable to gain physical access to affected servers because their digital badges were rendered inoperable.

Apparently fearful of rumors that its system was hacked, Facebook in the blog post reiterates that the outage was caused by a "faulty configuration change" and notes that no user data was compromised as a result of the downtime.

Read on AppleInsider

wonkothesane · October 5, 2021 5:13AM

Can we make that config permanent? Thank you.

appleufmyi · October 5, 2021 6:18AM

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

22july2013 · October 5, 2021 6:38AM

AppleUfmyI said:

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

I think that IT guy will never make that mistake again, so he would be a great employee now.

fred257 · October 5, 2021 8:21AM

This outage was done on purpose. On Sunday evening 60 minutes did a major story on Facebook from an actual whistleblower. To get that story out of the media what better way than to create a major diversion. Zuckerberg is a covert psychopath narcissist and I know how exactly he thinks because I have a lifetime experience with them. Zuckerberg is a sick person who extracts energy and money from destroying our personal lives with his extraction tool called Facebook.

phaseangle · October 5, 2021 10:21AM

security engineers were unable to gain physical access to affected servers because their digital badges were rendered inoperable.

That is why most places use seperate networks for access control and building maintenance/HVAC.

fred1 · October 5, 2021 10:55AM

Facebook: making the world a better place. And then they came back online. At least we can reminisce.

dewme · October 5, 2021 12:00PM

The concept behind Facebook is glorious if all people were nice. But all people aren’t nice. Some people are evil. Facebook has found a way to monetize evil and reap incalculable financial rewards from doing so. But as many others have said, if Facebook wasn’t doing it someone else would. The social media genie can never be put back in the bottle.

esquirecats · October 5, 2021 12:21PM

Fred257 said:

This outage was done on purpose. On Sunday evening 60 minutes did a major story on Facebook from an actual whistleblower. To get that story out of the media what better way than to create a major diversion. Zuckerberg is a covert psychopath narcissist and I know how exactly he thinks because I have a lifetime experience with them. Zuckerberg is a sick person who extracts energy and money from destroying our personal lives with his extraction tool called Facebook.

This theory is insanity - the downtime not only draws more attention to Facebook, but also deprives facebook of the advertising revenue lost during the downtime.

Furthermore the news about Facebook is already widely published and this topic will continue to be news due to the ongoing congressional review of Facebook's activities.

lkrupp · October 5, 2021 1:36PM

So the engineers couldn’t get into the server room because their badges went down like the rest of the company? Wow. That sounds like an Austin Powers movie in the making. I can see the engineers peering through the server room windows and frantically pounding on the doors as the servers went berserk. “Somebody get me a fire ax,” yelled the head engineer while Zuck sat in his lair petting a white cat and intoning “give me a frickin break", will you.

edited October 2021

georgebmac · October 5, 2021 2:38PM

22july2013 said:

AppleUfmyI said:

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

I think that IT guy will never make that mistake again, so he would be a great employee now.

The one forbidden word for surgeons and network engineers is: "Ooops!"

I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...

The blame game will have reached new, unheard of levels by now....

dewme · October 5, 2021 3:03PM

GeorgeBMac said:

22july2013 said:

AppleUfmyI said:

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

I think that IT guy will never make that mistake again, so he would be a great employee now.

The one forbidden word for surgeons and network engineers is: "Ooops!"

I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
The blame game will have reached new, unheard of levels by now....

It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.

sdw2001 · October 5, 2021 3:42PM

Couldn't have happened to a more evil, corrupt company. Sorry for all the small biz that was impacted. I personally got off FB early in the year and it is the BEST decision ever.

georgebmac · October 5, 2021 3:59PM

sdw2001 said:

Couldn't have happened to a more evil, corrupt company. Sorry for all the small biz that was impacted. I personally got off FB early in the year and it is the BEST decision ever.

I'm not disagreeing. But they also do a lot of good.

I use it to keep in touch with my local running community on organized runs and events. I also belong to a "runners over 70" group with older runners from all over the world: We seem to be a pretty rare breed of older adults working hard to stay fit and healthy (and mostly succeeding!) so the group provides encouragement, support and advise that simply would not be available anywhere else.

georgebmac · October 5, 2021 4:02PM

dewme said:

GeorgeBMac said:

22july2013 said:

AppleUfmyI said:

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

I think that IT guy will never make that mistake again, so he would be a great employee now.

The one forbidden word for surgeons and network engineers is: "Ooops!"

I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
The blame game will have reached new, unheard of levels by now....

It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.

The problem may have been that they locked themselves out of their own buildings and systems along with everybody else!

sdw2001 · October 5, 2021 5:34PM

GeorgeBMac said:

sdw2001 said:

Couldn't have happened to a more evil, corrupt company. Sorry for all the small biz that was impacted. I personally got off FB early in the year and it is the BEST decision ever.

I'm not disagreeing. But they also do a lot of good.
I use it to keep in touch with my local running community on organized runs and events. I also belong to a "runners over 70" group with older runners from all over the world: We seem to be a pretty rare breed of older adults working hard to stay fit and healthy (and mostly succeeding!) so the group provides encouragement, support and advise that simply would not be available anywhere else.

I'm not saying it has no uses. I used to be pretty involved in all kinds of groups, from humor to politics to special interest, to neighborhood, etc. But the fact that it has some use does not excuse the company's behavior. Their data mining and tracking alone is terrible. Then we get into banning, shadow banning, dethrottling, suppressing information, etc.

georgebmac · October 5, 2021 6:35PM

sdw2001 said:

GeorgeBMac said:

sdw2001 said:

Couldn't have happened to a more evil, corrupt company. Sorry for all the small biz that was impacted. I personally got off FB early in the year and it is the BEST decision ever.

I'm not disagreeing. But they also do a lot of good.
I use it to keep in touch with my local running community on organized runs and events. I also belong to a "runners over 70" group with older runners from all over the world: We seem to be a pretty rare breed of older adults working hard to stay fit and healthy (and mostly succeeding!) so the group provides encouragement, support and advise that simply would not be available anywhere else.

I'm not saying it has no uses. I used to be pretty involved in all kinds of groups, from humor to politics to special interest, to neighborhood, etc. But the fact that it has some use does not excuse the company's behavior. Their data mining and tracking alone is terrible. Then we get into banning, shadow banning, dethrottling, suppressing information, etc.

Or being used to spread misinformation and as a forum for terrorists to recruit more terrorists and plan their attacks.

docno42 · October 5, 2021 6:48PM

GeorgeBMac said:
Or being used to spread misinformation and as a forum for terrorists to recruit more terrorists and plan their attacks.

lol - the new "but think of the children".

Yes Zuck, save us from ourselves!

Pathetic...

sflocal · October 5, 2021 6:56PM

Fred257 said:

This outage was done on purpose. On Sunday evening 60 minutes did a major story on Facebook from an actual whistleblower. To get that story out of the media what better way than to create a major diversion. Zuckerberg is a covert psychopath narcissist and I know how exactly he thinks because I have a lifetime experience with them. Zuckerberg is a sick person who extracts energy and money from destroying our personal lives with his extraction tool called Facebook.

You should consider a better-quality tinfoil for your next hat. The silliness factor of your conspiracy-post is high.

zimmie · October 5, 2021 7:17PM

dewme said:

The concept behind Facebook is glorious if all people were nice. But all people aren’t nice. Some people are evil. Facebook has found a way to monetize evil and reap incalculable financial rewards from doing so. But as many others have said, if Facebook wasn’t doing it someone else would. The social media genie can never be put back in the bottle.

Is it, though? The concept behind Facebook was to rate female Harvard students' hotness.

dewme said:

GeorgeBMac said:

22july2013 said:

AppleUfmyI said:

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

I think that IT guy will never make that mistake again, so he would be a great employee now.

The one forbidden word for surgeons and network engineers is: "Ooops!"

I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
The blame game will have reached new, unheard of levels by now....

It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.

They do have the ability to undo changes ... at the level of the management platform. When the management platform can no longer reach the device to manage it, it needs manual intervention to restore that communication. Then the configuration can be restored to what it was before the change. They also lost external access to the management platform, but that's simple enough to fix (if somebody doesn't confirm the change, roll it back automatically). The access from the management to the managed devices is more difficult.

dewme · October 5, 2021 7:58PM

zimmie said:

dewme said:

The concept behind Facebook is glorious if all people were nice. But all people aren’t nice. Some people are evil. Facebook has found a way to monetize evil and reap incalculable financial rewards from doing so. But as many others have said, if Facebook wasn’t doing it someone else would. The social media genie can never be put back in the bottle.

Is it, though? The concept behind Facebook was to rate female Harvard students' hotness.

dewme said:

GeorgeBMac said:

22july2013 said:

AppleUfmyI said:

For six hours we were able to have meaningful in person conversations. Maybe we need to start a GoFundMe for the poor IT guy that blew it today? It was probably his last day.

I think that IT guy will never make that mistake again, so he would be a great employee now.

The one forbidden word for surgeons and network engineers is: "Ooops!"

I can only imagine the "conversations" going on today between FaceBook engineers, managers, etc...
The blame game will have reached new, unheard of levels by now....

It does seen odd that they would not employ some sort of checkpointing scheme on their configuration database to allow them to roll back to the last known good state. This is a very common technique for high availability systems and even some individual products, e.g., take a snapshot of the configuration settings before performing a software or firmware update. While I have zero love for Facebook, I'm sure that its stakeholders don't appreciate the financial losses incurred during the protracted downtime.

They do have the ability to undo changes ... at the level of the management platform. When the management platform can no longer reach the device to manage it, it needs manual intervention to restore that communication. Then the configuration can be restored to what it was before the change. They also lost external access to the management platform, but that's simple enough to fix (if somebody doesn't confirm the change, roll it back automatically). The access from the management to the managed devices is more difficult.

Good information. That makes sense. So they basically severed the primary communication link between the management platform and the managed devices. So what they also needed was a remote out-of-band communication channel, or what they probably ended up doing, a bunch of engineers with laptops and a bunch of physical security keys to bypass the badge readers protecting the servers. Interesting cascading failure mode.

Facebook says 'faulty configuration change' to blame for 6-hour outage

Comments