The Facebook, Instagram and WhatsApp Outage: how it happened, and what to do when your online life vanishes

The three pillars of Facebook crumbled on Monday. Some users screamed ‘Hallelujah!’ Others realized how many of their social connections rely on a big company’s servers.

But more than the The Wall Street Journal’s Facebook Files series revealing the impact of the social network on society, democracy, children and more, yesterday’s outage made a point clearer than ever … one giant tech company controls the bulk of our online social lives. Just look at how much of our data and personal connections run through Mark Zuckerberg’s empire. This isn’t about missing a few cute cat photos or the latest celebrity gossip: For many, the bulk of their interpersonal communication  – essential contact lists, chats and groups – is maintained in channels controlled by Facebook Inc.

 

5 October 2021 (Milos, Greece) – Facebook went down yesterday. It took Instagram and WhatsApp along with it. Perhaps you heard.

Starting around 9AM PT, and lasting until around 2:30PM, the company’s services were completely inaccessible around the world. The challenge of rebooting a global communications network meant that connectivity issues would persist for awhile. The issues appeared to relate to the Domain Name Service, a kind of phonebook that helps your browser find websites, and the Border Gateway Protocol, which exchanges routing information between networks when you try to visit one. Yesterday, Facebook’s BGP routes disappeared, so its DNS stopped working.

NOTE: last night I was able to follow the doings through Twitter, following Brian Krebs, Doug Madory (the brilliant internet analyst at Kentikinc) plus a few others who were Tweeting and accessing their “trusted sources” who were spot-on, plus a long-time contact of mine at Facebook who, though a bit lower level, also had sterling information. 

As I will (briefly) explain below in a bit more detail, any browser that went looking for a Facebook service could no longer find it. The services still existed, but the series of tubes that would normally take you there had ceased to function. Or as explained on one social media site:

Basically, the outage was caused by changes made to the Facebook network infrastructure. Many of the recent high-profile outages you’ve read about (including Facebook) have been caused by similar network-level events. The network changes also prevented engineers from remotely connecting to resolve the issues, delaying resolution, and employees from entering data centers (the latter confirmed by my Facebook contact).

NOTE: Notably, as I have written before, organizations now define their physical infrastructure as code, but do not apply the same level of testing rigour when they change that code, as they would when changing their core business logic. So, crash and burn. Happens every day.

Of course, Mark was on the case:

And according to Reuters, with Facebook’s stock price going down 6% yesterday, taking the total decline over the past month to 15%, it has cost Mark Zuckerberg $7 billion dollars – barely a blip in his total wealth, hardly an inconvenience to him financially in any way.

But it still astounds me how many people (including many of my colleagues who “do tech”) still do not understand that the internet is held together with bubblegum and string. Also duct tape, badly spliced copper, poorly run fiber lines, sometimes carrier pigeons, as well as the tears of many, many support engineers. Although Crazy Glue has made it into the toolbox. At a minimum … a minimum … I urge you to read Tubes: A Journey to the Center of the Internet and follow Brian Krebs and Steve King. Oh, there is lots more you can read and folks to follow. Email you and I’ll send you my reading/follow list.

And of course there were the conspiracy theories: a cyber attack of some sort; a false flag operation designed to distract from 60 Minutes; something related to a data breach; or Facebook using it to erase more incriminating evidence. As best as anyone can tell, none of these are true. The primary cause? Here is the one I loved:

So what you had was a case of Facebook network engineers pushing a configuration change that inadvertently locked them out, meaning that the fix had to come from data center technicians with local, physical access to the routers in question. Complicating matters was the fact that the incident took down many of Facebook’s internal systems, including the ones that power office badges, in some cases preventing engineers from accessing physical servers.

Or, in bullet point fashion:
• Facebook’s routers mistakenly announced to the rest of the Internet — which are just a bunch of networks themselves — that the Facebook network was no longer online; this meant that none of Facebook’s URLs could resolve to an IP address.
• Facebook’s internal network is entirely self-contained; this meant that Facebook employees could no longer access anything inside of the Facebook network, including the routers with the bad config file.
• Facebook engineers had to gain physical access to the routers to reset their configuration, which was made extra challenging by the fact that things like badges no longer worked to give access.
It’s Hanlon’s Razor: never attribute to malice that which is adequately explained by human error or stupidity.

Let’s dig a bit into the tubes and and bubblegum and string.

Yesterday’s events were a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together. That trust, standardization, and cooperation between entities are at the center of making it work for almost five billion active users worldwide.

Facebook published a blog post giving some details of what happened internally. Externally, we saw the BGP and DNS problems (I’ll get to those two terms shortly) outlined in its post but the problem actually began, as I noted above, with a configuration change that affected the entire internal backbone. That cascaded into Facebook and other properties disappearing and staff internal to Facebook having difficulty getting service going again

The company’s family of apps effectively fell off the face of the internet at 11:40 am ET, according to when its Domain Name System records became unreachable. DNS is often referred to as the internet’s phone book; it’s what translates the host names you type into a URL tab – like facebook.com – into IP addresses, which is where those sites live.

DNS mishaps are common enough, and when in doubt, they’re the reason why a given site has gone down. Too often too many people scream “HACK!!” They can happen for all kinds of wonky technical reasons, often related to configuration issues, and can be relatively straightforward to resolve. In this case, though, something more serious appears to be afoot.

Facebook’s outage was caused by DNS; however that’s a just symptom of the problem. The fundamental issue is that Facebook had withdrawn the so-called Border Gateway Protocol (BGP) route that contains the IP addresses of its DNS nameservers. If DNS is the internet’s phone book, BGP is its navigation system; it decides what route data takes as it travels the information superhighway.  Last night, on Twitter, Angelique Medina, director of product marketing at the network monitoring firm Cisco ThousandEyes, explained it this way:

“You can think of it like a game of telephone, but instead of people playing, it’s smaller networks letting each other know how to reach them. They announce this route to their neighbor and their neighbor will propagate it out to their peers.”

It’s a lot of jargon, and you need to spend some time with it to “get it” but easy to put plain: Facebook had fallen off the internet’s map. If you had tried to ping those IP addresses? The packets ended up in a black hole. Lost. Gone.

The obvious and still unresolved question was why those BGP routes disappeared in the first place. It’s not a common ailment, especially at this scale or for this duration. During the outage, Facebook didn’t say beyond a tweet that it’s “working to get things back to normal as quickly as possible.” After service came trickling back late Monday night and this morning they finally issued a statement but it lacked any real technical detail so the pipes-tubes-string-and-glue experts who know how this stuff works opined on Twitter.

As I noted above, all the internet infrastructure experts I spoke with last night and this morning told me it was all due to misconfiguration on Facebook’s part. Facebook did something to their routers, the ones that connect the Facebook network to the rest of the internet. The internet is essentially a network of networks, each advertising its presence to the other. For once, Facebook has stopped advertising.

Which also means that more than just Facebook’s external services were affected. You could not use “Login with Facebook” on third-party sites, for instance. And since the company’s own internal networks can’t reach the outside internet, its employees reportedly were not getting much done yesterday either.

This was the reason it took so long to get back up and running. Brian Krebs, in a Tweet, reminded us that in 2019 a Google Cloud outage prevented Google engineers from getting online to fix the Google Cloud outage keeping them offline. Facebook was stuck in a similar catch-22, unable to reach the internet to fix the BGP routing issue that would let it reach the internet.

Meanwhile, the rest of the internet felt Facebook’s absence. And talk about addiction. Last night, in the U.S., despite all that is happening in the world, the Facebook breakdown was the lead story on every news network. Well, maybe addiction is not the key word here. More like global integration. As I noted above, so many websites/apps use Facebook’s backend API’s to allow you to log in as a user to their system that the “integration” is enormous.

“The Great Social Outage of 2021” isn’t without its lessons. Not only can we better prepare ourselves for the next one, but we can also identify and manage where our data lives, and “spread the love” of our personal networks across more services. I’m not on Facebook so most of this does not apply to me.

#1 Keep Your Own Contacts

Facebook has become the telephone book (and birthday calendar) of the Internet. It’s time to stop depending on someone’s Facebook or Instagram profile as a resource of information. Instead, we should refresh and maintain our own contact databases with the phone numbers and email addresses of people we care about. Perhaps you already do this, but there are lots of teens who depend on Insta DMs.

Now, I realize there’s some irony in recommending this option because the solutions tend to rely on Apple, Microsoft and Google – three even bigger tech giants – and the cloud, that mysterious invisible thing that failed for Facebook. But on an iPhone, when you sync contacts using Apple’s iCloud, or with Microsoft or Google, you can generally access important contact info, even if there’s an outage or your device has no internet access. Me? I have all that data in hard copy, sitting on my desk and in a safe.

Another thing you should all do now that Facebook is back online: download your Facebook data, including a list of Facebook friends – this might include alternative contact information if they have listed it. (Here’s the company’s instructions.) You can do the same for Instagram and WhatsApp. Even offline, my wife was able to at least check out her WhatsApp chats and contacts and access certain information including phone numbers.

#2 Bring Back Email and Texting

There’s a lot wrong with those ancient forms of communication, the email and text message, but at least they are reliable private messaging alternatives. Yes, Gmail can occasionally go down, but most people I know have at least two email addresses, and you can always make extra.

The phone-number part is more crucial to our messaging future. The string of 10 numbers has become our main usernames across other messaging apps, including Signal. Instead of creating specific usernames, the phone number allows you to log in and then see if you know others on the service.

This is a good moment to set up alternative messaging apps, and not only to diversify where you communicate. For one thing, experts recommend Signal because it’s considered more secure and private.

For larger group chats, one up-and-coming service is Discord. While once primarily for gamers, it’s expanded to host groups with a range of interests.

#3 Take Bloody Breaks

Here comes the totally-cliché-part-of-a-social-media-advice-column, the part where I impart the true lesson of the outage: Maybe it’s time to spend less time on Facebook, Instagram and WhatsApp, and more time doing, well, anything else. You probably do not have the kind of free time I do, but goddamn – take some time off!

Anyway, it’s best to prepare for the next big outage. You know it’s coming.

Leave a Reply

Your email address will not be published. Required fields are marked *

scroll to top