Hitek Homeless May 11th, 2015
I’ve spent the last couple of weeks neck deep in new technology. That’s always awesome. The end result is that a couple of our humble projects are going to be running on a considerably more robust platform than the one we administered professionally for a regional telco a decade ago. Also, I spent about $10 learning. Ouch!
As a telco, we ran a very large file server that handled all of our users web sites in one central location. It was nice because we could scale out to multiple web servers for redundancy and scalability. However, when the NFS server went down, the ENTIRE cluster had to be massaged in the right order to bring it back online. Talk about your nightmare scenario: send a new tech over to trace cables and just wait for the network wide outage to occur.
Our new infrastructure is a much smaller filesystem – we only have a couple of websites on it. However, we have three (3!!!) servers acting as redundant file servers. As they are all mirrors and geographically separated from each other, we can lose two entire regions before we have a real problem.
MySQL has been the defacto standard for sql servers on a budget for years. We ran this in an ISP environment and had the forethought to have a master-slave setup. In the budget telco world of a decade past, when we lost a SQL server, everything crapped out until someone manually logged into the slave server and told it to become master. Yay… we didn’t lose everything. Boo, it was still an outage.
Our new infrastructure has clustered master-master sql servers. I can bring a new one online without even telling the existing servers about it! We also have three of these running in different geographical areas. We can lose an entire region and keep chugging along. If we lose two regions, someone has to login and tell the remaining server that it’s ok to think for yourself. It’s quite an upgrade from the old days, but not quite as nice as the fileserver.
Then there’s the whole web server thing. It needs to talk to the sql server and file server. Thanks to modern magic, it can seamlessly talk to either service even if it’s own closest server crashes. It would be hard to mention all the differences in our new implementation versus the old way of doing things, but our new web server software is a) f*cking fast as sh!t. b) has built in caching to reduce the overall work load c) is really good on memory usage.
Back in the telco days, budget was always a thing. We simply couldn’t get the money for load balancing hardware and didn’t have the people/time to do it properly in software. Today, that’s not even a concern. I can get multi-region load balancing that let’s me put all of the above infrastructure in 4 different data centers for less than the cost of a 12 pack of microbrew once a month. Compare that to round robin DNS – where an outage means that users can keep hitting ‘reload’ and get a working server if they are persistent. I’ll take dirt cheap load balancing over DNS hackery!
Want to know the best part? All of this costs about half as much as we’ve been paying every month to run a single server at our current hosting provider. That’s right.. one big server, with no redundancy, a broken serial console and backups that don’t even boot if we have an outage. I’m TERRIFIED of rebooting our server because the console doesn’t work, our backups don’t boot and tech support says ‘pay for OS support and buy more capacity’ any time I mention that their services don’t actually work.
Meanwhile, I’ve intentionally rebooted 1-2 servers on the new system in the middle of SQL imports and filesystem modifications without a hiccup. Everything just works. Better yet.. I can add systems to the cluster if we need the capacity or upgrade the existing servers to larger systems anytime I want. We can literally get 3x the redundancy and twice the capacity for the same price. Yet, I only need half the capacity we currently pay for.
We also have to ability to add auto-scaling. That means we’d boot servers that contact our sql/file server and just serve web pages. And we could do it magically as needed and only pay for it while they were needed. Fancy.
Currently, I think or backend cluster has the extra capacity to handle our web services as well. But, if we need it in the future, we can boot thin web servers on-demand and kill them overnight to avoid unnecessary charges, Talk about the icing on your chocolate covered doughnut,
Want to be like the cool kids?
Things to avoid: master/slave mysql architecture, nfs and other single points of failure, mod_php and rackspace.
Things to consider: percona for elastic master/master sql or mariadb for configurable master/master, glusterfs (not elastic yet, but with hackery, it seems almost there), nginx webserver (with caching!), lighttpd webserver, varnish caching web proxy, fast-cgi+php-fpm, google compute engine and amazon web services.
FYI, google is offering a $300/2 month credit on compute engine. Don’t be like me and sign up before heading to Baja – losing your credit!
UPDATE (8:30am, 5.5 hours after cutover completion):
Cutover went well this morning… there’s nothing like a 1am maintenance window to remind me why I don’t want to work for the telcos again.
However, I had to scale up our servers in the middle of watching a movie as they were starting to lag a bit. 20 minutes of rebooting servers, without even a blip of an outage, and we’ve scaled our capacity vertically 4-fold.
Sadly, this raises costs. Now, our hosting costs are closer to 3/5 of what they were prior to cutover. Still, a nice savings. Did you see the part where I took the fileservers and sql servers offline and replaced them each in under 5 minutes with bigger servers and couldn’t even tell anything was happening from the client side? That’s pretty fancy for a couple of home-brewed websites about camping in the woods and where to dump poop.
After looking at the costs of inter-zone bandwidth and just how much of it we were using, I got to thinking about how to lower that cost. A little filesystem hackery was enough to drop our inter-zone bandwidth to 1/3 or less and reduce the CPU overhead enough that we could scale servers back DOWN! Who’d think the lowly symlink would be capable of saving 2/3 of our server costs as well as 2/3 of our bandwidth costs. Crazy!
I’m learning that the real power in cloud computing is in getting the smallest building blocks you can and using them as efficiently as possible.