Sorry for my three readers out there for not posting recently I've been pretty busy! And to me there hasn't been too much events in the tech world in the past month or so that have gotten me interested enough to write about them.
I was talking with a friend of mine recently he was thinking about either throwing a 1U server in a local co-location or play around with one of the cloud service providers. Since I am doing both still (been too lazy to completely move out of the co-lo...) I gave him my own thoughts, and it sort of made me think about more about the cloud in general.
What do I expect from a cloud?
When I'm talking cloud I'm mainly referring to the IaaS or Infrastructure as a Service. Setting aside cost modelling and stuff for a moment here I expect the IaaS to more or less just work. I don't want to have to care about:
- Power supply failure
- Server failure
- Disk drive failure
- Disk controller failure
- Scheduled maintenance (e.g. host server upgrades either software or hardware, or fixes etc)
- Network failure
- UPS failure
- Generator failure
- Dare I say it ? A fire in the data center?
- And I absolutely want to be able to run what ever operating system I want, and manage it the same way I would manage it if it was sitting on a table in my room or office. That means boot from an ISO image and install like I would anything else.
Hosting it yourself
I've been running my own servers for my own personal use since the mid 90s. I like the level of control it gives me and the amount of flexibility I have with running my own stuff. Also gives me a playground on the internet where I can do things. After multiple power outages over the first part of the decade, one of which lasted 28 hours, and the acquisition of my DSL provider for the ~5th time, I decided to go co-lo. I already had a server and I put it in a local, Tier 2 or Tier 3 data center. I could not find a local Tier 4 data center that would lease me 1U of space. So I lacked:
- Redundant Power
- Redundant Cooling
- Redundant Network
- Redundant Servers (if my server chokes hard I'm looking at days to a week+ of downtime here)
For the most part I guess I had been lucky, the facility had one, maybe two outages since I moved in about three years ago. The bigger issue with my server was aging and the disks were failing, it was a pain to replace them and it wasn't going to be cheap to replace the system with something modern and capable of running ESXi in a supported configuration(my estimates put the cost at a minimum of $4k). Add to that the fact that I need such a tiny amount of server resources.
Doing it right
So I had heard of Terremark from my friends over at 3PAR, and you know I like 3PAR, and they use Vmware and I like Vmware. So I decided to go with them rather than the other providers out there, they had a decent user interface and I got up and going fairly quickly.
So I've been running it for almost a year, with pretty much no issues, I wish they had a bit more flexibility in the way they provision networking stuff but nothing is perfect (well unless you have the ability to do it yourself).
From a design perspective, Terremark has done it right, whether it's providing an easy to use interface to provision systems, using advanced technology such as VMware, 3PAR, and Netscaler load balancers, and building their data centers to be even -- fire proof.
Having the ability to do things like Vmotion, or Storage vMotion is just absolutely critical for a service provider, I can't imagine anyone being able to run a cloud without such functionality at least with a diverse set of customers. Having things like 3PAR's persistent cache is critical as well to keep performance up in the event of planned or unplanned downtime in the storage controllers.
I look forward to the day where the level of instrumentation and reporting in the hypervisors allow billing based on actual usage, rather than what is being provisioned up front.
In case your a less technical user I wanted to outline a few of the abilities the technology Terremark uses offers their customers -
Memory Chip Failure (or any server component failure or change)
Most modern servers have sensors on them and for the most part are able to accurately predict when a memory chip is behaving badly and to warn the operator of the machine to replace it. But unless your running on some very high end specialized equipment (which I assume Terremark is not because it would cost too much for their customers to bare), the operator needs to take the system off line in order to replace the bad hardware. So what do they do? They tell VMware to move all of the customer virtual machines off the affected server onto other servers, this is done without customer impact, the customer never knows this is going on. The operator can then take the machine off line and replace the faulty components and then reverse the process.
Same applies to if you need to:
- Perform firmware or BIOS updates/changes
- Perform Hypervisor updates/patches
- Maybe your retiring an older type of server and moving to a more modern system
This one is pretty simple, a disk fails in the storage system and the vendor is dispatched to replace it, usually within four hours. But they may opt to wait a longer period of time for whatever reason, with 3PAR it doesn't really matter, there are no dedicated hot spares so your really in no danger of losing redundancy, the system rebuilds quickly using a many:many RAID relationship, and is fully redundant once again in a matter of hours(vs days with older systems and whole-disk-based RAID).
Storage controller software upgrade
There are fairly routine software upgrades on modern storage systems, the software feature set seems to just grow and grow. So the ability to perform the upgrade without disrupting users for too long(maybe a few seconds) is really important with a diverse set of customers, because there will probably be no good time where all customers say ok I have have some downtime. So having high availability storage with the ability to maintain performance with a controller being off line by mirroring the cache elsewhere is a very useful feature to have.
Storage system upgrade (add capacity)
Being able to add capacity without disruption and dynamically re-distribute all existing user data across all new as well as current disk resources on-line to maximize performance is a boon for customers as well.
UPS failure (or power strip/PDU failure)
Unlike the small dinky UPS you may have in your house or office UPSs in data centers typically are powering up to several hundred machines, so if it fails then you may be in for some trouble. But with redundant power you have little to worry about, the other power supply takes over without interruption.
If a server power supply blows up it has the ability to take out the entire branch or even whole circuit that it's connected to. But once again redundant power saves the day.
Uh-oh I screwed up the network configuration!
Well now you've done it, you hosed the network (or maybe for some reason your system just dropped off the network maybe flakey network driver or something) and you can't connect to your system via SSH or RDP or whatever you were using. Fear not, establish a VPN to the Terremark servers and you can get console access to your system. If only the console worked from Firefox on Linux..can't have everything I guess. Maybe they will introduce support for vSphere 4.1's virtual serial concentrators soon.
It just works
There are some applications out there that don't need the level of reliability that the infrastructure Terremark uses can provide and they prefer to distribute things over many machines or many data centers or something, that's fine too, but most apps, almost all apps in fact make the same common assumption, perhaps you can call it the lazy assumption - they assume that it will just work. Which shouldn't surprise many, because achieving that level of reliability at the application layer alone is an incredibly complex task to pull off. So instead you have multiple layers of reliability under the application handling a subset of availability, layers that have been evolving for years or decades even in some cases.
Terremark just works. I'm sure there are other cloud service providers out there that work too, I haven't used them all by any stretch(nor am I seeking them for that matter).
Public clouds make sense, as I've talked about in the past for a subset of functionality, they have a very long ways to go in order to replace what you can build yourself in a private cloud (assuming anyone ever gets there). For my own use case, this solution works.