Amazon EC2 and Sony PSN Failures Highlight Need for Education

停电 Amazon Elastic Compute Cloud (EC2) two weeks ago left businesses scrambling, and the same thing happened to consumers during last week's data breach of 索尼PlayStation网络 (PSN). 在这两起事件之间, 消费者和企业都看到了云计算颠覆性潜力的基本本质, 我们认识到,对云的依赖绝不是盲目的信仰,而是要对普通的web性能问题有很好的理解, 可靠性, 可伸缩性和安全性.

围绕每个事件的问题—amazon丢失了许多硬盘驱动器(卷),这对单个数据中心中的许多其他计算机产生了连锁反应, 索尼的安全漏洞泄露了数百万个密码和数千张信用卡——理解这一点很重要.

Sony's issue is as much a public relations one as a security breach, with the company withholding 信息 about the exposure of passwords, 信用卡, and other personal 信息 for several days. 从那时起, 据公司介绍, 在数据存储和安全特性方面,它是“手工重建网络”. 单独的服务, such as Netflix viewing on the PlayStation 3, 没有受到PSN漏洞和随后的中断的影响,因为它们是从不同的基于云的服务交付的, 比如亚马逊的EC2.

Amazon's EC2 issue was a bit more complicated, and goes to the heart of reliance on heavy-computing cloud-based services. 一个问题, raised by what one pundit dubbed the "cloud hater" crowd, 挑战了云计算比一般企业服务器群更适合可伸缩性和冗余的前提. 


这是很大一部分责任, according to "cloud hater" logic, is the marketing and sales approach to the cloud. 云的一个强大卖点是,数据可靠地保存在云中——不需要本地化备份——随时可以访问. 这几乎是一种“上传后忘记”的方式,就是说客户的数据准备好了,只要他们需要,就可以随时使用, and that customers shouldn't really worry about the inner workings of the cloud.

对这一论点的反对意见通常集中在网络中断和间歇性网络连接是消费者问题这一事实上, and few businesses face complete downtime on their internet connections. 然而, FCC的宽带研究 tell a story of uneven and, at times, unavailable connectivity.  Few consumers live in an always-on world, going through frequent times of intermittent connectivity, and both virtual and rural businesses face similarly inconsistent connectivity. 

在流媒体世界里,我们总是准备好内容,通过间歇性的网络来传递, 从早期的真正RTP流到最近的MPEG-4片段或自适应比特率视频的HTTP传输.

What hadn't been discussed at length, at least not until the Amazon EC2 outage, 是数据中心的间歇性可用性和故意在数据中心之间缺乏冗余吗.

Given the marketing around the cloud, 人们可以很容易地认为,“上传后忘记”模式为每个EC2客户提供了内置的冗余. 亚马逊的回应打破了这个神话,尽管该公司正在努力解决这个问题.

The Amazon EC2 Outage: What Happened

EC2 itself didn't go down completely, 但是,由于这个问题,对存储的网站数据的影响可以在许多网站上看到, 据亚马逊称:

,主要涉及单个可用区中Amazon弹性块存储(“EBS”)卷的一个子集.g.(数据中心).S. East Region that became unable to service read and write operations."



These incapacitated drives caused a ripple effect throughout the EC2 infrastructure, 因为每个受影响的节点(和节点集群)都会搜索具有足够存储空间的其他节点来复制数据. During the time that content is being replicated, access to it is locked out. Amazon said in its post-mortem report that

“无法找到新节点的节点在找不到空间时不会积极后退, 而是, continued to search repeatedly. There was also a race condition in the code on the EBS nodes that, with a very low probability, 导致它们在同时关闭大量复制请求时失败."

As with any redundant system, 人们会假设内容存储在异地的多个位置——这是企业服务器解决方案中的常见做法. 然而,, for all the cloud marketing, 跨多个位置或可用区域的冗余并不一定适用于EC2, since Amazon charges more for storage across multiple Availability Zones.

在其报告中, 该公司似乎将部分责任归咎于客户没有选择多区域选项, or not writing applications to take advantage of these multiple zones.

仍然, if the marketing about redundancy and 可靠性 is to believed, 客户不需要理解或跨多个可用区工作

在最近的全美广播协会(National Association of Broadcasters)在拉斯维加斯举办的一场新媒体/广播圆桌会议上,我主持了一场讨论,随后亚马逊的服务就中断了. The roundtable was sponsored by Microsoft, iStreamPlanet, 和Interxion, 后者是一家数据中心设施提供商,使用每个城市两个数据中心的方法覆盖欧洲城市.

One of the issues raised at the roundtable was that of cloud 可靠性. Even prior to the Amazon outage, 有人对关键任务应用程序的传输速度和云服务的可靠性提出了质疑. One participant even quipped that, while they relied on their technology partners to recommend tried-and-tested solutions, an issue with the cloud was establishing liability in the event of a cloud outage.

"We can't sue the cloud," the participant quipped.


然而,在亚马逊的例子中,该公司了解中断对客户的影响. 虽然为几天的中断退款并不能挽回许多公司所面临的收入损失, 亚马逊似乎会放松对在多个可用区存储数据收取额外费用的政策.

亚马逊也有自己的工作要做,既要教育潜在的EC2客户,又要纠正和扩展其软件代码, and admitted as much when it announced a series of 在线研讨会:

"The first topics we will cover will be Designing Fault-tolerant Applications, 为云架构, and Web Hosting Best Practices. 未来两周的网络研讨会将每天举办几次,以支持我们在全球多个时区的客户. We will set aside a significant portion of the 在线研讨会 for detailed Q&A. Follow-up discussions for customers or partners will also be arranged."

In addition to the 在线研讨会, Amazon is making available whitepapers on AWS architecting best practices, and will also modify its services to allow multi-zone balancing automatically, without customer intervention. 

换句话说, 亚马逊希望通过一系列的行动项目来解决停机问题,以一种大多数企业客户多年来已经习惯的方式,在云中实现自动恢复和冗余.

Rather reminds one of a variation on the old nursery rhyme: when it works, 它非常。, 很好, 但当它没有, 太可怕了.

