The IT team at one of the nation’s largest health systems is still working through the problems caused by cybersecurity company CrowdStrike's botched software update.
On Friday, major health systems lost access to critical technology systems including electronic health records, billing tools and imaging applications when CrowdStike's defective update for devices running Microsoft Windows caused a widespread outage affecting multiple industries. At Renton, Washington-based Providence, which operates 51 hospitals across seven states, some elective procedures were rescheduled while clinicians were forced to operate with pen and paper charts.
Read more: Hospitals work through fallout from CrowdStrike outage
Providence wasn't alone. Large systems including Salt Lake City-based Intermountain Health, Cleveland Clinic and Chicago-based CommonSpirit Health all faced significant IT system downtime that lasted for days.
Wasif Jamal, Providence's global chief technology officer, said the IT teams at its headquarters, its remote employees based in India and volunteers from other departments worked through the weekend to bring critical systems such as its EHR from Epic Systems back online.
As Providence continues to work through the issues, here's a timeline of events, according to Jamal, Global Chief Information Security Officer Adam Zoller and Chief information Officer B.J. Moore.
Thursday 10:00 pm PST
Providence IT's department begins to hear initial reports of issues affecting Epic and its Citrix virtual desktop applications. The department receives reports from employees that their personal laptops are not working.
Thursday 10:30 pm PST
More problems are reported into the IT mission control team, which declares the issue is a major incident level one—the highest level. Around 70 employees across various IT and cybersecurity teams begin working on the problem.
"[Incident level one] means all the bells and whistles will go off," Jamal said. "[Major incident] notifications go out through email. It also sends an alert through text message."
Senior IT leaders and Providence CEO Dr. Rod Hochman are notified. It’s clear to the health system’s leadership team that the event is significant.
Thursday 10:50 pm PST
Once Providence determines the event is affecting all locations, it establishes an application recovery bridge, which brings in a separate incident commander to communicate with clinical staff. This allows one team to communicate with clinical staff and bring critical applications to patient care back online while the other team focuses on the technical problems.
In a short period of time, IT help desks are getting overwhelmed with reported problems from users that are still online.
"We diverted all of those to our app recovery bridge," Jamal said. "Anybody from the front line who's reporting an issue, go to the application recovery page."
Thursday 11:00 pm PST
Providence reports all 15,000 Windows servers running CrowdStrike are down. This cripples the system’s first tier applications including its Epic EHR, Citrix, scheduling and payroll application Kronos and healthcare payments tool Change Healthcare. Leaders are uncertain whether the issue is related to a cyberattack, an issue caused by a change at Providence or a third-party issue at CrowdStrike, Jamal said.
Between media reports, conversations with leaders at other health systems and examining the error messages, it soon becomes clear to the Providence team that the issue was caused by CrowdStrike.
CrowdStrike's problematic code caused computers to crash and present the "blue screen of death," which is a nickname for a common error display.
“Crowdstrike was running on pretty much everything," Zoller said. "About 50% to 60% of our ecosystem was blue screened pretty much at the same time.”
Thursday 11:20 pm PST
Providence prioritizes getting Epic back online.
"At that point we jumped right into the crown jewel, which is Epic," Jamal said. "Once you get the crown jewel up then you follow strictly the recovery through the app ranking."
Friday 12:00 AM PST
CrowdStrike pulls back the update causing the issue but much of the damage is already done. Providence begins reopening its applications in a way that bypasses CrowdStrike.
Friday 7:30 AM PST
Providence calls on its off-site teams in India to help bring Epic back online for all hospitals.
“We have 1,500 employees in India, and then we have our U.S.-based workforce," Moore said. "So yeah, we’ve been working this around the clock on the server side."
Friday 9:00 AM PST
Epic is back online. IT employees shift their focus to Citrix and get it back online later in the morning.
Throughout Friday, Providence continues to bring applications back online. Teams focus on restoring Microsoft Azure and on-premises services hosted within individual hospitals that affect maternal care, imaging and labs.
While Providence continues to deliver emergency care throughout Friday, it is forced to delay some elective procedures. As word of the outage's impact spreads, leaders at Providence find out other health systems are even less operational.
Saturday 8:00 AM PST
By Saturday morning, Providence confirms all critical applications are back online.
Providence focuses on frontline computers including desktops and laptops. A majority of IT and employees from other departments are mobilized to bring more computers online. Bringing these individual computers back online is a time-consuming process
"A human being has to physically touch every computer, and it takes between five and 20 minutes to remediate each computer," Moore said. "We estimate about 1,000 employees are working on these desktops at any point in time. That includes IT employees as well as [employee] volunteers."
Sunday 5:00 PM PST
Providence continues to manually fix the "blue screen of death" from computers. Approximately 30% of impacted computers still have the error message.
Monday 7:00 AM PST
Elective procedures and surgeries resume at most of Providence's hospitals while the IT team continues to fix affected computers.
What’s ahead?
Moore said it could be up to four weeks before all devices are functional. As of Wednesday morning, Providence reported around 10%, or 6,700 computers, remained down. Jamal said Providence purchased new laptop and desktop computers if necessary. While Jamal said the system has purchased only 60 new machines, it has the option to replace 5,000 to 7,000 if needed.
Moore said the scope of Friday’s event made it more problematic than some cybersecurity attacks.
"I've been in the tech industry for 30 years," Moore said. "This is by far the biggest and most widespread information technology impact I've ever seen in my professional career."