Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Which amazon site was affected? (Not being facetious I don't really use most of the sites listed in the original post so I don't know which of them may be Amazon owned)
Amazon.com images weren't loading in the iOS app, some actions had no effect (clicking on a button did nothing) and at a minimum their .com website was not loading css files and images, but probably also js files and other assets. I'm a bit surprised, actually, that they weren't using their own cloud services for CDN (or do they own a stake in Fastly?).

---

Aside from Amazon, a firebase-based site I run was also down during this time. Using Fastly for Google might be a holdover from the original owners, but it surprised me. I could have sworn they used a different CDN provider.
 
Last edited:
  • Like
Reactions: joelhinch
Amazon.com images weren't loading in the iOS app, some actions had no effect (clicking on a button did nothing) and at a minimum their .com website was not loading css files and images, but probably also js files and other assets. I'm a bit surprised, actually, that they weren't using their own cloud services for CDN (or do they own a stake in Fastly?).

---

Aside from Amazon, a firebase site we run was also down during this time. Using Fastly for Google might be a holdover from the original owners, but it surprised me. I could have sworn they used a different CDN provider.
I'd be surprised if Amazon isn't using all AWS, from a dogfooding POV, but from a "we just want to sell stuff POV" the smart setup would be to use several. It's possibly just a coincidence, or it may be more nuanced. Sellers may choose to use images they host themselves (and thus might be served by a different CDN), or the site itself may be using one of the "JS CDN" services, which could also be backed by fastly.


The firebase thing is curious, but given how rube-goldbergy most "cloud" offerings from Amazon/Google/Microsoft are getting, it's not impossible that they had their own separate outage at the time.
 
Can someone explain how this has killed the internet? What is fastly?
It’s a content distribution network. What that means is that (let’s use Apple as an example and the CDN it uses, Akamai, and lets say you live in NY state) instead of your HTTP request for Apple.com traveling to Apple servers in an Apple owned data center or in Cupertino, your request goes to an Akamai server that might be located in your metro area (NYC or the NJ part of the NYC metro) or at least in a closer metro area than the nearest Apple data center. It also means that the content is cached, so, if the site goes down, you can still get the last content update at the URL you’re visiting.

The reason so many sites break because of this issue is that (to continue the Apple example), the Apple.com domain doesn’t point to Apple servers, it points to Akamai servers, at least when caching is on. In other words, it’s a lot like when an individual site goes down, but, because it’s a CDN, it brought down all the sites depending on the CDN.
 
  • Like
Reactions: LiE_
I'd be surprised if Amazon isn't using all AWS, from a dogfooding POV, but from a "we just want to sell stuff POV" the smart setup would be to use several. It's possibly just a coincidence, or it may be more nuanced. Sellers may choose to use images they host themselves (and thus might be served by a different CDN), or the site itself may be using one of the "JS CDN" services, which could also be backed by fastly.
It shocked me but I didn't have enough time to investigate the image origins, might look into it later today. I know they used to host their own images (they ran them through a routine to dynamically add badges, resize, etc...). I suppose they could use Fastly in the background for their edge services that do the resizing (ie, I suppose Fastly could be "supplying" the raw images)... that being said who knows. An S3->Lambda->S3->CloudFront sort of arrangement would make more sense.

The firebase thing is curious, but given how rube-goldbergy most "cloud" offerings from Amazon/Google/Microsoft are getting, it's not impossible that they had their own separate outage at the time.
Quite possible - Firebase had some downtime a week or two ago, and perhaps this was just coincidence... but other than that other time and this morning, its uptime for me has been stellar.
 
The problem was too many people charging their cars at once, blew the power grid...... 😆
 
Because what passes for 'good architecture' in modern ops/infra practices tends to be a list of tickboxes with "are we using an external service provider and thus don't need to hire anyone experienced in this technology?" next to them all.
You have no idea what you’re talking about. Quit trying to armchair online services.
 
I love how every time I point out that relying on single external vendors for something is a problem, someone with zero ****ing clue about the topic - in this case infrastructure in general, or load balancing and failover in particular - gets an internet boner about telling me I'm wrong.
You are still wrong. If the cost of maintaining an entirely separate failover to a separate CDN network exceeds the cost of the downtime, it would be a terrible business decision to pay for that. You're arguing that websites should DOUBLE their CDN costs to pay for a service they would use, what, for maybe 30 minutes or less in an entire year (or more) to reduce the odds of downtime? Thats insane.

There is no way to guarantee 100% uptime. You can add redundancy after redundancy and make the chances get better and better, but its still never going to be 100%. Something catastrophic can happen. A rational business balances the cost of uptime vs. the odds of downtime.

So yeah, you are wrong. Sometimes a single vendor is the wise financial and strategic decision. You might want to look up something called diminishing returns.
 
  • Like
Reactions: senttoschool
Fastly is one of the big CDN's on the internet. Basically they give websites DDoS protection, cache assets around the world, allow webpages to be cached (with smaller calls to the server for dynamic areas). Almost all of the traffic on the internet goes through Fastly and Cloudflare. Then you have the big old one: Akamai.
The only part of this I kind of understood was “allow webpages to be cached”
 
The only part of this I kind of understood was “allow webpages to be cached”
You want fast performance and low load for your servers. Fastly lets you easily do that by caching your server's responses across geolocations.

Many companies also store things like streaming videos on CDNs.
 
  • Like
Reactions: subjonas
You want fast performance and low load for your servers. Fastly lets you easily do that by caching your server's responses across geolocations.

Many companies also store things like streaming videos on CDNs.
Thanks, that helped. I don’t know what CDN stands for but I suppose I’ll learn that when I need to lol.
 
You are still wrong. If the cost of maintaining an entirely separate failover to a separate CDN network exceeds the cost of the downtime, it would be a terrible business decision to pay for that. You're arguing that websites should DOUBLE their CDN costs to pay for a service they would use, what, for maybe 30 minutes or less in an entire year (or more) to reduce the odds of downtime? Thats insane.

Most CDN's charge primarily based on transfers - some combination of number of requests, and bytes transferred.

So, if you're using them purely in an active/warm 'fail over' setup - that is, you use one all the time, and the other only when the first has a failure, you're maybe paying the minimum account fee on the second service, but for most that's going to be $0 or close to it.

If you're using them in an active/active 'load sharing + fail over' setup - that is, traffic is delivered by both of them, and re-routed to just one if the other has failures, you're paying half the bandwidth costs for each of them.
 
Most CDN's charge primarily based on transfers - some combination of number of requests, and bytes transferred.

So, if you're using them purely in an active/warm 'fail over' setup - that is, you use one all the time, and the other only when the first has a failure, you're maybe paying the minimum account fee on the second service, but for most that's going to be $0 or close to it.

If you're using them in an active/active 'load sharing + fail over' setup - that is, traffic is delivered by both of them, and re-routed to just one if the other has failures, you're paying half the bandwidth costs for each of them.
No one does this.

First, when you're this big and at this scale, it's not simply cost/transfer. These companies have large multi-million dollar contracts. It's not $0. If it's $0, your company/app is too small to warrant backups.

Second, you can't just magically have your applications work with two or more CDNs. Each CDN has its own unique implementation that usually integrates deeply with the deployment and software stack. Having two CDNs would exponentially increase software development complexity. Businesses do not think this is worth it.

Third, the time that it takes you to re-route your application, get global DNS to populate new addresses, etc. is usually slower than for the CDN company to fix its issues. It took an hour for Fastly to fix this issue. You're not going to be able to just switch to a backup CDN that much quicker when you're at this scale.

Fourth, CDNs have plenty of redundancy built-in already. Whatever knocked out Fastly was a likely fluke incident.

Fifth, companies do have backup plans and some are better than others which is why some companies went back up faster. No company this big depends purely on Fastly.
 
Last edited:
  • Like
Reactions: Krizoitz
No one does this.
No company practices realistic HA. Ok then, sure.

First, when you're this big and at this scale, it's not simply cost/transfer. These companies have large multi-million dollar contracts. It's not $0. If it's $0, your company/app is too small to warrant backups.
If you're spending millions on a glorified distributed proxy server, the amount lost due to down an hour's down time is going to be significant.

Each CDN has its own unique implementation that usually integrates deeply with the deployment and software stack.
At it's core a CDN is a caching proxy. Sure, they all add on other bells and whistles that you can use, just like AWS has it's own specific internal IaaS stacks that you can use in place of installing a service.

You absolutely do not need to use those 'extra' add-ons to get the benefit of a CDN.

Third, the time that it takes you to re-route your application, get global DNS to populate new addresses, etc. is usually slower than for the CDN company to fix its issues.
The minimum recommended TTL on a DNS record is 30 seconds. Below that value. Some intermediary resolvers may still ignore the 30 second limit too, but that's big difference from 60 minutes.

Fourth, CDNs have plenty of redundancy built-in already. Whatever knocked out Fastly was a likely fluke incident.
AWS has "plenty of redundancy" and sees regular massive outages due to "fluke" incidents.

Fifth, companies do have backup plans and some are better than others which is why some companies went back up faster. No company this big depends purely on Fastly.

So... your final "point" to argue that me saying "companies relying on just one CDN is a stupid design", is to say "big companies don't just depend on this one CDN"

Ahuh.gif
 
Most CDN's charge primarily based on transfers - some combination of number of requests, and bytes transferred.

So, if you're using them purely in an active/warm 'fail over' setup - that is, you use one all the time, and the other only when the first has a failure, you're maybe paying the minimum account fee on the second service, but for most that's going to be $0 or close to it.

If you're using them in an active/active 'load sharing + fail over' setup - that is, traffic is delivered by both of them, and re-routed to just one if the other has failures, you're paying half the bandwidth costs for each of them.
Even in the case where they aren’t paying for the bandwidth costs, there is a non-trivial amount of work and cost that goes in to setting up your entire operation to failover to a completely different provider, testing the system, maintaining the system, etc. All for handling a very infrequent event like this one? When was the last time Fastly had a similar outage?

Further, the time and risk associated with switching over to a secondary CDN might be greater than simply waiting for the primary to come back online. And that risk and cost is going to go up the bigger and more complex an organizations operation is. So again, it’s not as simple as “always have redundancy or your dumb”.
 
Even in the case where they aren’t paying for the bandwidth costs, there is a non-trivial amount of work and cost that goes in to setting up your entire operation to failover to a completely different provider, testing the system, maintaining the system, etc.
That’s true, it’s a lot like trying to run a microservice on both AWS and Azure, even you’ve got Docker and the same container pushed to both services. It means you have to do double the DevOps work, troubleshoot and maintain the build process on two different environments, replicate any ElasticBeanstock or Lambda functionality in Azure, not to mention keeping the databases in sync between the two providers, all as insurance in the off chance that AWS goes down. That’s a lot of ongoing work for very minimal gain, it’s like installing redundant telephone lines in an office building that are serviced by a different phone company just in case some catastrophic failure happens at your main phone company that doesn’t impact the other.

Any company with uptime requirements that justify that kind of Frankenstein environment can afford to have a backup system they can switch to at will (like the BBC’s main site, which switched soon after the outage). For most businesses, even the cost of downtime doesn’t justify these extreme efforts at guaranteeing uptime (plus, such firms could pay for higher availability service tiers with their service providers). But very few firms (especially outside of the real time financial clearing or equities industries or perhaps the health services industry even have uptime requirements this extreme.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.