Fastly CDN Server Issue Brings Down Swathes of Websites [Update: Solved]

ururk · Jun 8, 2021

Stephen.R said:
Which amazon site was affected? (Not being facetious I don't really use most of the sites listed in the original post so I don't know which of them may be Amazon owned)

Amazon.com images weren't loading in the iOS app, some actions had no effect (clicking on a button did nothing) and at a minimum their .com website was not loading css files and images, but probably also js files and other assets. I'm a bit surprised, actually, that they weren't using their own cloud services for CDN (or do they own a stake in Fastly?).

---

Aside from Amazon, a firebase-based site I run was also down during this time. Using Fastly for Google might be a holdover from the original owners, but it surprised me. I could have sworn they used a different CDN provider.

JosephAW · Jun 8, 2021

I wonder if sites will have to recache with fastly?

Stephen.R · Jun 8, 2021

ururk said:
Amazon.com images weren't loading in the iOS app, some actions had no effect (clicking on a button did nothing) and at a minimum their .com website was not loading css files and images, but probably also js files and other assets. I'm a bit surprised, actually, that they weren't using their own cloud services for CDN (or do they own a stake in Fastly?).

---

Aside from Amazon, a firebase site we run was also down during this time. Using Fastly for Google might be a holdover from the original owners, but it surprised me. I could have sworn they used a different CDN provider.

I'd be surprised if Amazon isn't using all AWS, from a dogfooding POV, but from a "we just want to sell stuff POV" the smart setup would be to use several. It's possibly just a coincidence, or it may be more nuanced. Sellers may choose to use images they host themselves (and thus might be served by a different CDN), or the site itself may be using one of the "JS CDN" services, which could also be backed by fastly.

The firebase thing is curious, but given how rube-goldbergy most "cloud" offerings from Amazon/Google/Microsoft are getting, it's not impossible that they had their own separate outage at the time.

kc9hzn · Jun 8, 2021

LiE_ said:
Can someone explain how this has killed the internet? What is fastly?

It’s a content distribution network. What that means is that (let’s use Apple as an example and the CDN it uses, Akamai, and lets say you live in NY state) instead of your HTTP request for Apple.com traveling to Apple servers in an Apple owned data center or in Cupertino, your request goes to an Akamai server that might be located in your metro area (NYC or the NJ part of the NYC metro) or at least in a closer metro area than the nearest Apple data center. It also means that the content is cached, so, if the site goes down, you can still get the last content update at the URL you’re visiting.

The reason so many sites break because of this issue is that (to continue the Apple example), the Apple.com domain doesn’t point to Apple servers, it points to Akamai servers, at least when caching is on. In other words, it’s a lot like when an individual site goes down, but, because it’s a CDN, it brought down all the sites depending on the CDN.

jonnysods · Jun 8, 2021

It really is crazy how delicate our entire internet ecosystem is. Things like this just bring that to light!

ururk · Jun 8, 2021

Stephen.R said:
I'd be surprised if Amazon isn't using all AWS, from a dogfooding POV, but from a "we just want to sell stuff POV" the smart setup would be to use several. It's possibly just a coincidence, or it may be more nuanced. Sellers may choose to use images they host themselves (and thus might be served by a different CDN), or the site itself may be using one of the "JS CDN" services, which could also be backed by fastly.

It shocked me but I didn't have enough time to investigate the image origins, might look into it later today. I know they used to host their own images (they ran them through a routine to dynamically add badges, resize, etc...). I suppose they could use Fastly in the background for their edge services that do the resizing (ie, I suppose Fastly could be "supplying" the raw images)... that being said who knows. An S3->Lambda->S3->CloudFront sort of arrangement would make more sense.

Stephen.R said:
The firebase thing is curious, but given how rube-goldbergy most "cloud" offerings from Amazon/Google/Microsoft are getting, it's not impossible that they had their own separate outage at the time.

Quite possible - Firebase had some downtime a week or two ago, and perhaps this was just coincidence... but other than that other time and this morning, its uptime for me has been stellar.

ripe_banana · Jun 8, 2021

******* was not affected, so big chunk of the internet remained intact afterall,..

MarkC426 · Jun 8, 2021

The problem was too many people charging their cars at once, blew the power grid...... 😆

btrach144 · Jun 8, 2021

Stephen.R said:
Because what passes for 'good architecture' in modern ops/infra practices tends to be a list of tickboxes with "are we using an external service provider and thus don't need to hire anyone experienced in this technology?" next to them all.

You have no idea what you’re talking about. Quit trying to armchair online services.

Stephen.R · Jun 8, 2021

btrach144 said:
You have no idea what you’re talking about

What a compelling counter point you make.

Krizoitz · Jun 8, 2021

Stephen.R said:
I love how every time I point out that relying on single external vendors for something is a problem, someone with zero ****ing clue about the topic - in this case infrastructure in general, or load balancing and failover in particular - gets an internet boner about telling me I'm wrong.

You are still wrong. If the cost of maintaining an entirely separate failover to a separate CDN network exceeds the cost of the downtime, it would be a terrible business decision to pay for that. You're arguing that websites should DOUBLE their CDN costs to pay for a service they would use, what, for maybe 30 minutes or less in an entire year (or more) to reduce the odds of downtime? Thats insane.

There is no way to guarantee 100% uptime. You can add redundancy after redundancy and make the chances get better and better, but its still never going to be 100%. Something catastrophic can happen. A rational business balances the cost of uptime vs. the odds of downtime.

So yeah, you are wrong. Sometimes a single vendor is the wise financial and strategic decision. You might want to look up something called diminishing returns.

subjonas · Jun 8, 2021

ruka.snow said:
Fastly is one of the big CDN's on the internet. Basically they give websites DDoS protection, cache assets around the world, allow webpages to be cached (with smaller calls to the server for dynamic areas). Almost all of the traffic on the internet goes through Fastly and Cloudflare. Then you have the big old one: Akamai.

The only part of this I kind of understood was “allow webpages to be cached”

senttoschool · Jun 8, 2021

subjonas said:
The only part of this I kind of understood was “allow webpages to be cached”

You want fast performance and low load for your servers. Fastly lets you easily do that by caching your server's responses across geolocations.

Many companies also store things like streaming videos on CDNs.

subjonas · Jun 9, 2021

senttoschool said:
You want fast performance and low load for your servers. Fastly lets you easily do that by caching your server's responses across geolocations.

Many companies also store things like streaming videos on CDNs.

Thanks, that helped. I don’t know what CDN stands for but I suppose I’ll learn that when I need to lol.

Stephen.R · Jun 9, 2021

Krizoitz said:
You are still wrong. If the cost of maintaining an entirely separate failover to a separate CDN network exceeds the cost of the downtime, it would be a terrible business decision to pay for that. You're arguing that websites should DOUBLE their CDN costs to pay for a service they would use, what, for maybe 30 minutes or less in an entire year (or more) to reduce the odds of downtime? Thats insane.

Most CDN's charge primarily based on transfers - some combination of number of requests, and bytes transferred.

So, if you're using them purely in an active/warm 'fail over' setup - that is, you use one all the time, and the other only when the first has a failure, you're maybe paying the minimum account fee on the second service, but for most that's going to be $0 or close to it.

If you're using them in an active/active 'load sharing + fail over' setup - that is, traffic is delivered by both of them, and re-routed to just one if the other has failures, you're paying half the bandwidth costs for each of them.

Stephen.R · Jun 9, 2021

subjonas said:
Thanks, that helped. I don’t know what CDN stands for but I suppose I’ll learn that when I need to lol.

Content Delivery Network.

senttoschool · Jun 9, 2021

Stephen.R said:
Most CDN's charge primarily based on transfers - some combination of number of requests, and bytes transferred.

So, if you're using them purely in an active/warm 'fail over' setup - that is, you use one all the time, and the other only when the first has a failure, you're maybe paying the minimum account fee on the second service, but for most that's going to be $0 or close to it.

If you're using them in an active/active 'load sharing + fail over' setup - that is, traffic is delivered by both of them, and re-routed to just one if the other has failures, you're paying half the bandwidth costs for each of them.

No one does this.

First, when you're this big and at this scale, it's not simply cost/transfer. These companies have large multi-million dollar contracts. It's not $0. If it's $0, your company/app is too small to warrant backups.

Second, you can't just magically have your applications work with two or more CDNs. Each CDN has its own unique implementation that usually integrates deeply with the deployment and software stack. Having two CDNs would exponentially increase software development complexity. Businesses do not think this is worth it.

Third, the time that it takes you to re-route your application, get global DNS to populate new addresses, etc. is usually slower than for the CDN company to fix its issues. It took an hour for Fastly to fix this issue. You're not going to be able to just switch to a backup CDN that much quicker when you're at this scale.

Fourth, CDNs have plenty of redundancy built-in already. Whatever knocked out Fastly was a likely fluke incident.

Fifth, companies do have backup plans and some are better than others which is why some companies went back up faster. No company this big depends purely on Fastly.

Stephen.R · Jun 9, 2021

senttoschool said:
No one does this.

No company practices realistic HA. Ok then, sure.

senttoschool said:
First, when you're this big and at this scale, it's not simply cost/transfer. These companies have large multi-million dollar contracts. It's not $0. If it's $0, your company/app is too small to warrant backups.

If you're spending millions on a glorified distributed proxy server, the amount lost due to down an hour's down time is going to be significant.

senttoschool said:
Each CDN has its own unique implementation that usually integrates deeply with the deployment and software stack.

At it's core a CDN is a caching proxy. Sure, they all add on other bells and whistles that you can use, just like AWS has it's own specific internal IaaS stacks that you can use in place of installing a service.

You absolutely do not need to use those 'extra' add-ons to get the benefit of a CDN.

senttoschool said:
Third, the time that it takes you to re-route your application, get global DNS to populate new addresses, etc. is usually slower than for the CDN company to fix its issues.

The minimum recommended TTL on a DNS record is 30 seconds. Below that value. Some intermediary resolvers may still ignore the 30 second limit too, but that's big difference from 60 minutes.

senttoschool said:
Fourth, CDNs have plenty of redundancy built-in already. Whatever knocked out Fastly was a likely fluke incident.

AWS has "plenty of redundancy" and sees regular massive outages due to "fluke" incidents.

senttoschool said:
Fifth, companies do have backup plans and some are better than others which is why some companies went back up faster. No company this big depends purely on Fastly.

So... your final "point" to argue that me saying "companies relying on just one CDN is a stupid design", is to say "big companies don't just depend on this one CDN"

Unsupported · Jun 9, 2021

One Fastly customer triggered internet meltdown

The unnamed customer was not to blame but changing settings triggered a software bug, Fastly says.

www.bbc.com

Krizoitz · Jun 9, 2021

Stephen.R said:
Most CDN's charge primarily based on transfers - some combination of number of requests, and bytes transferred.

So, if you're using them purely in an active/warm 'fail over' setup - that is, you use one all the time, and the other only when the first has a failure, you're maybe paying the minimum account fee on the second service, but for most that's going to be $0 or close to it.

If you're using them in an active/active 'load sharing + fail over' setup - that is, traffic is delivered by both of them, and re-routed to just one if the other has failures, you're paying half the bandwidth costs for each of them.

Even in the case where they aren’t paying for the bandwidth costs, there is a non-trivial amount of work and cost that goes in to setting up your entire operation to failover to a completely different provider, testing the system, maintaining the system, etc. All for handling a very infrequent event like this one? When was the last time Fastly had a similar outage?

Further, the time and risk associated with switching over to a secondary CDN might be greater than simply waiting for the primary to come back online. And that risk and cost is going to go up the bigger and more complex an organizations operation is. So again, it’s not as simple as “always have redundancy or your dumb”.

kc9hzn · Jun 10, 2021

Krizoitz said:
Even in the case where they aren’t paying for the bandwidth costs, there is a non-trivial amount of work and cost that goes in to setting up your entire operation to failover to a completely different provider, testing the system, maintaining the system, etc.

That’s true, it’s a lot like trying to run a microservice on both AWS and Azure, even you’ve got Docker and the same container pushed to both services. It means you have to do double the DevOps work, troubleshoot and maintain the build process on two different environments, replicate any ElasticBeanstock or Lambda functionality in Azure, not to mention keeping the databases in sync between the two providers, all as insurance in the off chance that AWS goes down. That’s a lot of ongoing work for very minimal gain, it’s like installing redundant telephone lines in an office building that are serviced by a different phone company just in case some catastrophic failure happens at your main phone company that doesn’t impact the other.

Any company with uptime requirements that justify that kind of Frankenstein environment can afford to have a backup system they can switch to at will (like the BBC’s main site, which switched soon after the outage). For most businesses, even the cost of downtime doesn’t justify these extreme efforts at guaranteeing uptime (plus, such firms could pay for higher availability service tiers with their service providers). But very few firms (especially outside of the real time financial clearing or equities industries or perhaps the health services industry even have uptime requirements this extreme.

Fastly CDN Server Issue Brings Down Swathes of Websites [Update: Solved]

macrumors member

macrumors 604

Suspended

macrumors 68000

macrumors G3

macrumors member

macrumors newbie

macrumors 68040

macrumors 68040

Suspended

macrumors 68000

macrumors 604

macrumors 68030

macrumors 604

Suspended

Suspended

macrumors 68030

Suspended

macrumors 6502a

macrumors 68000

macrumors 68000

Our Staff