Lesson for ATT - Vicious Cycles and Triage

jtara · Jun 15, 2010

ATTs problems today stem from it's inability and lack of history dealing with sudden traffic events. It appears that Apple has weathered it fine, save for their interface to ATT account information. This, despite the fact that Apple's traffic has likely been several times what it would have been had ATT's system worked perfectly.

Vicious Cycles

ATT has fallen into what can be called a "vicious cycle". A customer goes to the ATT (or Apple) website, tries to upgrade their phone, and encounters problems. So, they repeat. And repeat. And repeat. Thus, multiplying traffic, both external and internal by several times.

It should be noted that Apple would suffer the same thing, as long as the are dependent on ATT account verification services. But, overall, Apple has suffered no wide-spread outage of their online services - it just fails at the point of account verification, with no other ill effect. (Apple could have done better with error messages, though.)

Triage

So, what do IT organizations do in such a situation? The answer is triage. Shut-down unnecessary services. Bring on additional resources. Shut off geographical regions, or even sections of the alphabet. ("ATT has announced that those whose last name begin with A-C can pre-order tomorrow..." would be a workable, though shocking and not-gonna-happen response).

Having a plan

The most important thing, though, is to have a plan. You have to plan out in advance what you will do if systems are over-loaded. Without a plan, you are lost.

I was present at the (beta) bring-up of a major console game with on-line playing capability. The team had done benchmarks, and new how many servers were needed for how many players. There were server problems for serveral hours, but it was overcome. They did lots of planning, but one thing was missing from their plan. "What do we do if we were wrong?"

What they failed to grasp was the magnitude of the initial onslaught and the corresponding vicious cycles. To their credit, the team triaged by turning off unnecessary services, announcing that certain features would be initially unavailable. Nobody thought, though (at 4 in the morning, with groggy heads) to simply spin-up more servers. (And, frankly, the team leader is too bull-headed to have accepted that his projections were off, and it was the one thing that people were afraid to suggest. So, as low man on the totem-pole, I certainly didn't say anything, especially when my stuff Just Worked...) Yes, it was cloud-hosted. And none of this was planned-out in advance. Had there been a plan, what turned out to be a several-hour outage likely would have been minimized.

It's pretty clear that ATT lacked a plan. And it's as clear that Apple had one.

Steve: get another carrier. One worthy of your company.

Search

Search

Lesson for ATT - Vicious Cycles and Triage

jtara

macrumors 68020

Our Staff