In-Depth
Worst-Case Scenarios: When Disaster Strikes
Leading IT experts help answer the question: What if?
When Hurricane Katrina devastated the Gulf Coast and New Orleans in August
2005, federal officials said no one had ever conceived of such large-scale destruction,
much less planned for it. From the painfully slow response to the galling red
tape, it seemed as if the scope and scale of Katrina took the government by
surprise.
It didn't. In August 2004, a FEMA-led exercise sought to model out the effects
of a slow-moving, Category 3 hurricane striking New Orleans. The five-day project,
which involved local, state, and federal agencies, as well as numerous private
enterprises, forecast many of the most destructive aspects of the fictional
storm, named Hurricane Pam. By all accounts, it was a forward-thinking exercise
that should have prepared both the region and the nation for the fury of Katrina.
That exercise prompted Redmond magazine to run an exercise of its own.
We formed a panel of leading IT experts to help us simulate four likely IT-related
disasters. The chief information officer at Intel Corp., Stacy Smith, has concocted
a timeline that reveals how a large financial services firm in San Francisco
might weather a magnitude 6.8 earthquake. Johannes Ullrich, chief technical
officer of IT training and security monitoring firm The SANS Institute, walks
us through a savvy bot-net attack against an online retailer. Jack Orlove, vice
president of professional services at IT consultancy Cyber Communication, shows
how a localized bird flu outbreak might convulse a regional manufacturer. Finally,
Madhu Beriwal, CEO of Innovative Emergency Management, offers a chilling glimpse
at how a carefully crafted worm-and-virus assault could cripple an enterprise,
and leave millions of people without access.
Each scenario models a viable threat, and offers IT planners insight into what
to expect when the worst case becomes the real case.
Shaken Up
Distributed infrastructure and robust planning help ride out a major earthquake
Feb. 20 - Feb. 23, San Francisco, Calif.
Stacy Smith, CIO, Intel Corp.
The scene: A 6.8 magnitude earthquake, lasting 4 minutes and 20 seconds, strikes
San Francisco at 4:20 a.m. on Monday, Feb. 20, 2005, causing major damage to
the area. Hundreds of buildings and thousands of homes in the downtown and surrounding
area suffer damage. Several water mains burst, disrupting service to major parts
of downtown. Fires ignite in places where natural gas lines have broken. Power
is out in many areas. Several highway overpasses on major commuter routes are
damaged and others need to be inspected. The city government pleads with citizens
to stay off roads so emergency services can get through. More than 3,700 people
are injured, 81 dead.
Acme Financial maintains three office buildings within the hardest hit area.
The only people in the buildings at the time of the quake were security guards.
Trained in earthquake drills, they stop, drop and take cover as trained, avoiding
injury. After the quake subsides, they use their computers' auto-dialer function
to send alerts to the company's local Emergency Response Team (ERT).
Acme's two earthquake-resistant buildings sustain minimal damage, though the
skybridge between them has fallen. No power is available to either building,
but backup generators restore power to the data center and keep critical business
functions running. Emergency power lighting enables the ERT team to assess the
situation.
The older building sustains major damage. The backup generator fails, leaving
the basement-located data center without power. Fortunately, the company houses
data centers in three cities as part of its disaster recovery strategy. Data
is continuously mirrored in real time between data centers in San Francisco,
Los Angeles and Tokyo. The biggest issues facing IT and the Business Continuity
team are the closure of the city to commuters, overloaded communications networks,
and the need to locate and verify the safety of local employees. In a stroke
of good fortune, none of the bridges (Golden Gate, Oakland Bay and San Mateo)
used for carrying high-speed data communication lines are damaged.
For backup communications, the company has developed Internet and intranet
sites, a toll-free 800 number, and a team of amateur radio operators trained
to move into emergency operations centers (EOCs) and establish communications.
An emergency alternative workspace outside the city is equipped with Internet
access and telecommunications services, including a WiMAX (802.16) access point
through which any 802.11g wireless access point in the building can connect
to a T1 service.
On Monday at 4:30 a.m., crisis management team members access local copies
of encrypted business continuity documents on their notebooks. Using DSL and
cable services, team members begin communicating from their homes via Microsoft
Instant Messaging.
The crisis management team leader is stranded in another city, so the second
in command steps in. IT, as part of the crisis management team, posts a notice
on the Web and in a recorded phone message telling all employees to stay out
of the area and work on laptops from home. Corporate policy requiring employees
to take laptops home every night means most employees should have their computers
with them.
Fallout:
Shaken Up |
- Losses in equipment and IT infrastructure: $1.1 million
- Distributed infrastructure saved the day, thanks to remote
data backup and Web services-based applications.
- Joint business continuity and IT earthquake planning
enabled post-quake transition for home-based workforce.
- Company will consider satellite broadband as a backup
alternative for EOC and homes of key employees.
|
|
|
Employees needing computing equipment are directed to report to the alternative
workspace the next day or as soon as they can. Directions are posted on the
intranet. At 6:20 a.m., an EOC is established in the alternative workspace and
one of the amateur radio operators has arrived to supply communications.
By 8 a.m. the regional phone system and local ISPs are overloaded. E-mail messages
from employees trickle in through the Internet, but few phone calls make it
through. By the end of the day, 50 percent of local employees have contacted
the ERT to confirm that they can work from home. The next morning, nearly all
members of IT and business continuity have been contacted and, unless in personal
crisis, have either mobilized for recovery efforts or are performing normal
job functions from home.
At 8:30 a.m. on Feb. 21, the backup generators are down to about 24 hours of
fuel. Using amateur radio, IT employees warn other company data center locations
to prepare for a server and application failover. They also request replacement
backups of all critical San Francisco data from mirrored sites as additional
insurance. The company uses Microsoft Cluster Server and Application Center/Network
Load Balancing Service for load balancing and handling the extra load on the
remote servers.
At 9 a.m., the company experiences heavy Internet traffic loads as credit card
and brokerage customers check to see if their accounts are still available,
and suppliers on the extranet post inquiries to their direct contacts. IT routes
calls and Internet traffic to the company's Los Angeles offices. Employees working
from home struggle with slow ISP connections and an overloaded phone system.
At 11 a.m., reports of fires, water outages, street damage and looting downtown
make it obvious it will be several days before more members of the ERT can enter
the city to further assess damage. Freight and messenger companies are notified
via the Web to make deliveries to the EOC.
On Wednesday morning, Feb. 22, ERT members report that they can't find fuel
and the generators have run out. They are able to properly shut down the network,
however. Around 4 p.m. that afternoon ERT teams are finally able to use local
streets to access company buildings. The older headquarters is taped off due
to severe structural and water damage. The other two buildings have sustained
minor damage (broken windows, cracked drywall), though utility services need
to be restored. They refuel the backup generator and restart it, taking the
transactional load off servers in other locations.
Company officers decide on Thursday to rent another building to temporarily
replace the older headquarters building. IT begins assessing power, telecommunications
and assets requirements, including the use of WiMAX to distribute wireless service
in the facility. Remote IT staff from both the Los Angeles and Tokyo offices
plan and assist in backup recovery, load balancing and other tasks. The server
OS and applications configurations mirrored at other locations will help ensure
a faster restore of computer operations.
By 10 a.m. Thursday, Feb. 23, local telecommunications and Internet begin to
return to normal as operators add more capacity. More San Francisco staff successfully
work at home, relieving the Los Angeles staff that has been covering for them.
Power has been restored to the main downtown buildings and IT is organizing
shifts to work on-site.
Acme continues over the next several weeks to set up a replacement facility
for its old headquarters. It moves back into the other San Francisco buildings
two weeks after the earthquake.
Ransom Demand
Sophisticated bot-net attack staggers online retailer
Nov. 1 - Dec. 13, Chicago, Ill.
Johannes Ullrich, CTO, SANS Institute Internet Storm Center
The scene: Throughout November 2006, an eastern European organized crime syndicate
uses a new zero-day exploit for Microsoft Internet Explorer to install bot software
on a large number of systems. The bot spreads via three primary vectors: Spam
e-mail carrying a malicious URL, an instant messenger worm and a number of hacked
Web sites considered trusted by many users.
After collecting about 100,000 systems, the group on Nov. 29 sends a note to an e-commerce company demanding a $50,000 ransom payment in order to avoid an attack. On Dec. 1, to prove its ability to carry out the attack, the syndicate places 100 orders from 100 different systems for random items.
The e-commerce company uses the 100 test orders
to attempt to track a pattern. Common anti-DDOS
(distributed denial of service) technology, which focuses on simple “packeting” attacks, proves unable to deter
the event -- the attackers use sophisticated scripts to simulate regular browsing and ordering. The e-commerce company considers validating orders with a “captcha” solution, which foils robotic systems by using a graphic to prompt user input. This should deter the attack, but initial tests reveal that it will also cause 20 percent of valid orders to be aborted.
On Dec. 4 the team decides to develop and test the captcha software. It takes two days to write the software and two more to perform load and regression testing. The company is fortunate that it upgraded its test systems in August, which helps minimize response time.
The next day -- Dec. 5 -- the retailer enables real-time order volume alerts to warn of an attack as quickly as possible. The timing of the attack -- during the holiday shopping season -- complicates matters, because order volumes can be volatile. The retailer typically sees 10,000 orders each day, but peak holiday season rates can spike to 30,000 orders during limited-time promotions. The team calculates an average volume of about 20 orders per minute and decides to set the alert threshold at 200 orders per minute to account for legitimate spikes. The team also considers adopting order volume change thresholds, so if orders spike anomalously from one minute to the next, the system will alert IT managers to a possible attack. The captcha solution, meanwhile, can go live within hours of an attack being detected.
Fallout:
Ransom Demand |
- Losses due to ...
- Staff time on task: $10,000-$20,000
- Lost business: $50,000
- Return of erroneously shipped orders: $100,000+
- Good change management and strong situational awareness
enabled rapid response
- Early action allowed the staff to get ahead of the threat
and defend against it.
|
|
|
The actual attack starts on Dec. 12 at 9 a.m. EST. The bots are controlled from
an IRC channel, enabling them to vary their assault and update their attack software
from a Web site. As the company typically sees an increase in orders during this
time of day, and the fake orders ramp up slowly, they go undetected for an hour.
During that time, the attacker is able to place 2,000 fake orders. The company
had expected about 3,000 valid orders during that time.
Once detected, the company moves the captcha solution in place. The switch occurs within a minute. Some customers placing orders or using shopping carts during the switch over experience odd behavior, but the disruption is limited and temporary. The captcha solution filters out fake orders, but the attacker now contacts the company demanding another $100,000.
The attack then morphs into a DDOS assault, where the bots just browse the site, add items to the cart, and search the large site for specific items. One particular search causes large database loads as it returns most of the available inventory. As a result, the site is no longer accessible to regular users.
A patch for the IE vulnerability is released on Patch Tuesday, Dec. 12.
One Dec. 13, the company analyzes logs in more detail, and disables the site search. The action reduces server loads by an order of magnitude and enables the site to resume responsive operation. To further limit the DDOS attack, the team requires captcha input to add items to the cart. Once the captcha is used to identify a “human” shopper, a cookie is placed on the user's system to avoid the need for repeated captcha identification.
After an hour of tweaking and log analysis, the site is again usable for valid
users. However, search functionality remains down until the team can implement
search rate limiting, so that each customer may only place one search every
30 seconds. The attack soon fades as the attackers realize the company won't
pay the ransom and has strengthened its defenses.
Hand in Glove
Internet worm propagates lethal zero-day exploit
April 4 - April 11, Washington, D.C.
Madhu Beriwal, President & CEO, Innovative Emergency Management Inc.
The scene: On April 4 at 8:13 a.m. EST, a terrorist organization releases an
Internet worm into the wild that uses remote code execution vulnerabilities
in Apache and IIS to inject downloadable code to the default homepage. The worm
carries a lethal payload -- a virus that exploits an unpublished vulnerability
in Microsoft Windows to randomly delete files and disable systems. The malware
combines stealthy propagation with a sudden, lethal attack for maximum affect.
To maximize impact and delay detection, the group targets specific Web servers,
including high-traffic sites such as Yahoo!, Google, CNN and ESPN, as well as
default sites like MSN and AOL. System sites like Dell and HP and popular destinations
such as Fark and Slashdot are also targeted. The targeted systems -- about 4,000
in all -- are successfully infected within a few minutes of each other. The
virus starts getting downloaded and will infect millions of Web browsers before
a patch becomes available on April 11.
Fallout:
Hand in Glove |
- Losses due to ...
- Productivity lost: $6 million
- Staff time on task: $10,000+
- Broken contracts: $100 million
- Failure to address a critical vulnerability led to unacceptable
exposure.
- Defense in depth and diverse security solutions can help
protect against emerging threats.
- Security must be regarded as a process that engages administrators,
management and users.
|
|
|
There is scant warning. On March 20, a Microsoft security bulletin warns of
a flaw in recent versions of IE that could allow remote code execution under
the right circumstances. BoBobble Corp. security staff review the bulletin and
decide that the recommended workaround -- disabling ActiveX -- would be overly
disruptive. Having avoided exposure to similar issues with IE in the past, the
BoBobble staff decide to take no action. It's a stance taken by corporations
worldwide, leaving the large majority of the worlds' PCs exposed.
By April 7, the first signs of an attack emerge. The help desk fields about
the normal number of calls about corrupted files on the release day of the virus,
but attempts to find a rogue system fail. By day 3, however, the number of infected
PCs at BoBobble Corp. spikes. Several PCs are completely disabled. The team
gets an early break when an administrator notices that one of the corrupted
files is actually a screenshot of one of his user's desktops. The company's
chief information security office (CISO) decides a virus is the culprit and
immediately orders all network switches, routers, hubs and firewalls powered
off. All PCs and servers are also powered off to preserve data until the systems
can be disinfected. The shutdown stops the spread and the effects of what appears
to the CISO to be a particularly nasty virus.
Mainstream news media pick up the story on April 8, as almost every PC with
Internet access is now infected. Most users have noticed something is wrong
with their computer, but don't know what to do about it. Internet usage grinds
to a halt. E-commerce transactions approach zero for the first time in the Internet
age.
First to recover are anti-virus vendors, who are able to concoct a cure on April 9. Their systems come back online and start distributing the software and anti-virus signatures. BoBobble Corp. initiates around-the-clock recovery operations on April 10 to get the systems back up, disinfected and reconnected. Full recovery occurs on April 11, about the same time Microsoft releases a patch to fix the IE vulnerability.
The damage, however, is done. More than 50 percent of the corporation's files
have been corrupted. Restoring files from tape or shadow copy fails, as the
sheer volume of corrupted data overwhelms the help desk. Contact lists, cost
estimates, multi-million dollar proposals, and other business-critical files
are lost, essentially halting BoBobble operations. What's more, the firm's disaster
recovery plan proved ineffective when network access and PCs were turned off.
Business processes that had been streamlined through the use of IT couldn't
be conducted manually, and productivity fell to zero. At least one BoBobble
office, which was piloting an IP telephony program, was completely cut off.
Cell phones became the only means of communication.
Ultimately, it's deemed most cost-effective to completely restore every file
to the network file servers. This effectively sets BoBobble Corp. clocks back
from April 11 to April 3, and erases a full business week of activity. Other
corporations are not so lucky, as their disaster recovery systems either were
inadequate, backup media was corrupted, or backups weren't functioning correctly.
Worldwide, travel becomes more difficult as the internal systems for major
airlines, as well as online booking systems and travel agency systems become
infected. Only a few communication, water and electric power systems are affected,
because most control systems for this type of equipment aren't linked to the
Internet. Government ability to provide services is hurt, but Continuity of
Operations (COOP) and Continuity of Government (COG) programs and disaster recovery
efforts reduce the impact somewhat.
Pandemic Panic
Suspected bird flu outbreak rocks a regional manufacturer
Feb. 19 – March 3, Eureka, Calif.
Jack Orlove, Vice President of Professional Services, Cyber Communication Inc.
The scene: On Sunday Feb. 19, St. Joseph Hospital in Eureka, Calif., confirms
the first fatal case of H5N1 Avian bird flu in the United States. Broadcast
media cover the isolated case aggressively and describe the risks of an almost
“certain” pandemic event. The Governor's office acknowledges that if human-to-human
transmission is verified, the plan will call for immediate quarantine, freezing
all traffic in and out of the city, and major disruptions in all forms of commerce.
TV stations report on school absenteeism and an “unusually light” Monday commute
across Oregon and Northern California. Acme Manufacturing CEO Jim Brown tracks
reports from management in Eureka about high absenteeism and missed outbound
shipments.
Facing unusually strong demand, Eureka has been running at 115 percent of capacity
with most employees working overtime. Production goals are being exceeded, but
getting the product to market and managing customers is another matter. Only
one driver shows up on Monday, and the call center has about one-third of its
full staff, stopping many operations. Goods in-transit continue to arrive at
the Eureka facility, stacking up at Acme's shipping and receiving docks as employees
struggle to find space.
That afternoon, Eureka facility manager Mike Smith asks for Sacramento volunteers
to drive up and augment the call center and warehousing operations, but management
is hesitant to authorize the move. Smith and executive staff decide to activate
the Crisis Management Team (CMT).
Absentee rates in Eureka reach 56.7 percent on Monday, with most of those employees
indicating that they won't be in tomorrow. Business operations are disrupted
and customers experience hour-plus hold times. Many callers don't get through
at all.
The Acme warehouse in Eureka becomes overcrowded with pallets of finished goods,
as contracted drivers fail to report to work. The company attempts to delay
shipments from suppliers, but vendors insist that Acme accept goods in-transit.
CMT seeks to contract warehouse space, but most warehouse suppliers are not
answering phones or have no space available. Management authorizes overtime
for employees willing to work.
Tuesday morning starts with reports of sick people filling hospitals as far
north as Seattle and south as Los Angeles. As anticipated, absenteeism grows
to over 70 percent, with most workers reporting to work with germ-denying masks.
The warehouse situation worsens as trucks continue to arrive. Workers report
some fistfights with delivery drivers, and many drivers simply unhook their
loads and leave their trailers in the parking lot, causing further frustration.
Acme's distributors still demand delivery to accommodate their own customers
and insist that Acme meet its service-level agreements.
With business processes breaking down, Brown activates the formal Business
Continuity Plan and convenes the team in the Sacramento emergency operations
center (EOC). There is discussion of activating a center in Reno.
The general situation worsens. Supermarkets are emptied and lines at gas stations
make it nearly impossible to fill gas tanks. All schools have officially closed,
and there's wide speculation that the National Guard will be activated to keep
the peace. There have been no other “official” cases of bird flu, but many people
are skeptical of the government reports. Rumors abound of new cases in Eureka.
On Tuesday afternoon, management decides to close the Eureka location and relocate
operations to a temporary building complex in Lincoln, Calif., 15 miles east
of the Sacramento location. A private carrier has been contracted to fly employees
from Eureka to Lincoln, but officials won't authorize flights into the Arcata-Eureka
airport.
IT attempts to establish remote employee operations, but lacking an infrastructure
for remote access, the effort is limited to key executives and managers. Technicians
deploy VPN and VNC remote software so key employees can use broadband links
to access internal PCs from their homes. Labor-intensive installation of the
remote client and VPN software slows the effort.
Tuesday evening, the FBI and Centers for Disease Control (CDC) announce that
no new cases of H5N1 flu have been seen in the last 72 hours. The only death
is traced back to an infected flock of birds handled by the victim. Local authorities
are optimistic that the virus is not human-to-human transmissible and that restrictions
may be relaxed.
Fallout:
Pandemic Panic |
- Losses due to ...
- Lost sales and revenue: $300,000
- Staff time on task and related costs: $150,000
- With 75 percent of operating capacity in one location,
Acme should distribute operations.
- Remote and home-based working alternatives using Web-based
application can help improve business continuity.
- Staff cross-training eliminates single points of knowledge.
|
|
|
The Acme EOC is now operating two 12-hour rotating shifts and an 800 telephone
number has been set up to accommodate the flood of calls from employees and
customers. In Eureka, IT efforts to maintain business processes ultimately fail,
due in large part to the loss of employees in the warehouse and other areas.
Business essentially grinds to a halt. While the IT recovery plan performed
well, the broader impacts are overwhelming. Some critical operations have been
recovered at the Acme hot site and the makeshift Lincoln operation is handling
some warehousing and the call center operations.
The crisis winds down Wednesday morning, when the California governor's office,
CDC, FEMA and the FBI begin assuring the public that the infection was an isolated
event and poses no immediate pandemic threat. A sense of normalcy begins to
return as employees are welcomed back to work at the Eureka site on Thursday
morning.
Over the next several weeks Acme works to recover its primary Eureka site.
Full call center operations are quickly restored. The Lincoln emergency site
is scheduled to shut down on March 2, after running partial operations for two
weeks.
More Information
"Worst-Case Scenarios" is now available for download as an 18-page PDF from the Tech Library. Check it out. |