Sunday, July 20, 2008

Service Outage

Today we experienced a major service outage for about 7 hours due to the Amazon Simple Storage Solution going offline. The Amazon S3 service was chosen for its reliability and scalability, allowing us to use the robust Amazon back end to manage the hosting part of the Showit Sites service. We apologize for this problem, ultimately we cannot point fingers as it was our choice to go with Amazon and the questions about Amazon's stability will be ones that we will be looking at very closely.

Up until today, the Amazon S3 service has only been down a few hours over the course of two years. I don't know how many web services can say the same, it just happened at a very unfortunate time for us as we are launching version 1.0 of our Showit Sites software.

Creating backups of the S3 service would be difficult because of the amount of work they have put into building a scalable solution to store data online. Part of using the service is that you are getting the redundancy through them, so duplicating that would be a very huge operation. The downside is that when their network goes down in a crazy fluke like this, their redundancy doesn't help. Keep in mind this type of thing is a network failure, it's not entirely know but it is probably some hardware or switches or something along those lines that caused some major communication problems, but there was no loss of data meaning that once the network was fixed, all of your stored data is still there. They typically have 99.9% uptime, but as with computers and the web, unexpected stuff comes up and hardware will fail.

We know this from the past 4 years of doing business online that managing servers and networks is a nightmare, and when stuff goes down it can take a bit to fix. We've had a couple different times now for our business where servers have been down and one case where it took a week to get our store back up and we had nothing we could do about it. At least with Amazon I know that every minute is costing thousands of customers and they have their best people on getting it restored.

It did bring up a lot of issues for us on how our software and your sites respond when this happens, so we will be looking at better ways of dealing with this so you don't have a blank page on your site with no helpful info and ways to communicate better to let you know when things happen.

Again, thanks for the feedback, we will continue to work on making our service reliable enough for your website to depend on. Thanks,

Todd

1 comment:

John Rayl said...

Ok, that was a month ago.

C'mon, lets get this blog updated with some fresh info, like previews of soon to be released features!