Should We Abandon the Cloud?

It's been a bad month for the cloud.

First there was the major Amazon EC2 (Elastic Cloud) outage April 21-22 that brought down many business and websites. Some of the data was unrecoverable and transactions were lost.

Next, the May 10-13 outage of Microsoft's cloud based email and Office services (Business Productivity Online Suite) caused major angst among its customers who thought that the cloud offered increased reliability.

Then we had the May 11-13 Google Blogger outage which brought down editing, commenting, and content for thousands of blogs.

Outages from the 3 largest providers of cloud services within a 2 week period does not bode well.

Yesterday, Twitter went down as well.

Many have suggested we abandon a cloud only strategy.

Should we abandon the cloud for healthcare? Absolutely not.

Should we reset our expectations that highly reliable, secure computing can be provided at very low cost by "top men" in the cloud? Absolutely yes.

I am a cloud provider. At my Harvard Medical School Data Center, I provide 4000 Cores and 2 petabytes of data to thousands of faculty and staff. At BIDMC, I provide 500 virtualized servers and a petabyte of data to 12,000 users. Our BIDPO/BIDMC Community EHR Private Cloud provides electronic health records to 300 providers.

I know what it takes to provide 99.999% uptime. Multiple redundant data centers, clustered servers, arrays of tiered storage, and extraordinary power engineering.

With all of this amazing infrastructure comes complexity. With complexity comes unanticipated consequences, change control challenges, and human causes of failure.

Let's look at the downtime I've had this year.

1.  BIDMC has a highly redundant, geographically dispersed Domain Name System (DNS) architecture. It theory it should not be able to fail. In practice it did. The vendor was attempting to add features that would make us even more resilient. Instead of making changes to a test DNS appliance, they accidentally made changes to a production DNS appliance. We experienced downtime in several of our applications.

2.  HMS has clustered thousands of computing cores together to create a highly robust community resource connected to a petabyte of distributed storage nodes. In theory is should be invincible. In practice it went down. A user with limited high performance computing experience launched a poorly written job to 400 cores in parallel that caused a core dump every second contending for the same disk space. Storage was overwhelmed and went offline for numerous applications.