(updated Oct 5, 2021, 9:30AM PDT)
Facebook and its other networks Instagram and WhatsApp suffered their largest outage on Monday since 2008. By mid-day, The Verge speculated that DNS had caused the problem, and referred back to Slack’s outage last week to claim that “it’s always DNS.” We’re not going to speculate on what caused Facebook’s misfortune, but we will answer some of the most common questions about DNS.
What is DNS?
DNS stands for Domain Name System, which is akin to the internet’s phone book—it’s actually a single worldwide, massively distributed, and highly scaled application that implements the phone book.
Let’s break that down. Normally, systems on the internet, whether they are running websites, email systems, or other applications such as Facebook, need an address. Physically, the address is an “IP address,” and it is a series of numbers that looks something like this: “192.168.15.21.” These are too hard for people to remember, so DNS was invented to give the address a name that’s easier to read, like “www.mywebsite.com.” This is called a domain name.
The easiest way to think about it is that the domain name is the human-readable name, and the IP address is the real address of the website or application.
How does DNS work?
DNS, which has been around since the 1980s—long before the internet was popularly and widely deployed—is like a phone book. It maps the human-readable name of a system to the actual IP address, just like your phone book maps the name of a friend to their actual phone number. If I want to call my friend John Smith, I look him up in my phone book, and find that his number is 425-555-1234. Dialing the number makes the call.
Now, what happens if someone or something changes your phone book, either by mistake or intentionally? What happens if they change the phone number of John Smith to 425-555-4321? If you try to dial John Smith, you’ll either get the wrong person answering, or you’ll get a message saying that the number is out of service.
That’s what happens when a DNS entry gets corrupted. If someone is trying to contact “facebook.com” and the IP address in the DNS phone book is incorrect, you can’t connect to Facebook.
Who owns or runs DNS?
There is no single owner of DNS. Every company has some ownership, and the backbone is operated by the major internet providers around the world. There is no single owner and no single point of failure, so DNS is a tribute to the concept of “architected for scale” and “architected for high availability.”
But how do DNS entries get changed?
DNS is highly reliable and rarely has any issues, beyond small, localized problems. Never massive worldwide issues, but there are many ways a DNS entry can get changed. It could have been changed accidentally or intentionally, by either someone inside the company or by an outside bad actor. It could have been a valid change that went bad, or it could have been simply an accident. Or it could have been an intentional change.
As of right now—at 2:00p Pacific Time on Monday, Oct. 4—we just don’t know. But we will find out more as time goes on, and we’ll update this post once we have more information. Stay tuned.
Update: It appears that an incorrect change to the BGP routing rules was potentially the root of the DNS problem. I know, this is a nerdy answer. BGP is a protocol used on the internet to determine how internet traffic is routed from one location to another. In other words, it’s the thing that routes the IP address to the right device. Or in our analogy above, it is the thing that says that the phone number “425-555-1234” is to go to this specific phone. In the case of BGP, a change by Facebook to the BGP routing rules probably prevented DNS from working, causing the DNS entries for Facebook to effectively disappear from the internet.
Bottom line: the root of the problem was BGP, not DNS. But DNS was the most visibly impacted piece at the time. However, even if DNS wouldn’t have failed, Facebook would have still been down since the BGP problem effectively took Facebook off of the internet.
System problems often have multiple layers to them such as this. The details aren’t really important here (unless of course you are working in the low level networking space, in which case they matter immensely). Suffice it to say that at the highest level, it’s the same core issue—it appears to have been a human error of some form at Facebook. Whether that human error was an accident or intentional will only be known as deeper investigations occur inside of Facebook. At this point, there is absolutely no reason to assume it wasn’t just a bad accident.
Why did it take so long to fix the problem? The reason it took so long for them to recover is that the problem took down everything at Facebook, including the tools needed by the teams that were trying to fix the problem. Oops. This actually is a common problem that many enterprises fail to sufficiently plan for, and it’s something I talk about in my book Architecting for Scale in chapter 2 titled “Two Mistakes High–Having Room to Recover from Mistakes”. This chapter is primarily about controlling availability in highly scaled web applications, such as Facebook. If you want to avoid long duration problems, such as the Facebook problem, you need to plan “Two Mistakes High” in all aspects of your application architecture.
(updated Oct 5, 2021, 9:30AM PDT)