On 4 October, Facebook (FB) – along with Instagram, WhatsApp, Messenger – went down globally. Some have speculated that the timing is quite interesting, since the night before is when the FB whistleblower interview aired on 60 Minutes. There have been conspiracy theories ranging from a disgruntled employee to a cyber-attack, such as a distributed denial of service (DDoS) attack, all tied back to the whistleblower’s comments. For the time being, FB is stating that the outage was due to a routine BGP update gone wrong, which made their DNS fall apart. More details usually come out post-incident, but that’s the story FB is sticking with.
Border Gateway Protocol (BGP) is a mechanism by which Internet service providers (ISPs) of the world share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses. Today, essentially every network of global scale runs BGP as both their external and internal routing protocol. Both BGP and DNS routing allow an organization to control how traffic reaches its systems. Essentially, FB’s BGP update removed all their BGP routes - no packets could reach FB’s servers because there were no announced routes to the servers.
Additionally, some of FB's internal systems were also knocked offline, adding another obstacle to the company's efforts to get its services working again. For example, reportedly employee keycards wouldn't work, impacting FB engineers working on the outage gaining access.
FB quickly apologized to users and investors alike for its “worst ever downtime”, and they spent last week in full crisis management mode trying to minimize the monetary impact of this disaster (as well as the impact or the whistleblower’s testimony to Congress). The revenue impact of this outage is difficult to ascertain. However, there have been attempts to extrapolate FB’s losses. Assuming the services were down for close to 14 hours, the loss to FB revenue can be put somewhere in the ballpark of $90 million.
Since the FB outage appears to be a configuration management or human error issue and not a cyber-attack, why, you may ask, am I writing about it?
FB has built a scalable, reliable, global service to support the more than 3.5 billion people across the world that use its apps. These are not just to share gossip and watch cat videos (though an occasional cat video is a fun to watch). These applications are hardwired into how we communicate, how businesses operate, how we access other services (using OAuth single sign-on) and in some parts of the world – for example India and many parts of Southeast Asia – FB is synonymous with the Internet.
The outage has put a spotlight on the complex network of functions and services reliant on the availability and resilience of a single service provider. According to the New York Times, users reported being unable to access Internet-connected smart devices like smart TVs and thermostats – not provided by FB but accessed via FB credentials. FB and Instagram are part of the economic fabric too. Businesses around the world are reliant on FB platforms to drive orders, essentially ceased to trade while the platforms were offline.
FB is not unique in its complexity. Most large enterprises are operating on hybrid architectures that have grown organically over time, resulting in this kind of complexity driven vulnerability. Enterprises now operate thousands of applications across thousands of workloads, making identifying, resolving, and further preventing an issue incredibly difficult.
Last week’s outage shows how easy it is for enterprises like FB to fail on a global scale, with wide reaching ripple effects. These are the kinds of outages that trouble regulators and are the motivation behind potential new regulations governing resilience, especially in businesses that serve critical infrastructures.