We have constructed a cluster to attack the problem from two directions. First: All nodes in the cluster are receiving increased access to resources from where they are today. Disk IO killed us as the block size increased, so we need to reduce the number of nodes fighting for disk access. RAM will be increased for each node. This is going to increase our costs, but challenge times will go down. With these two changes, reliability will go way up. Here’s the momentous change: true fail-over for nodes.
In the new cluster, if a node isn’t able to be immediately and automatically fixed, it is rebuilt from a known good copy. If a server dies, knocking its nodes offline, the nodes are automatically brought back up on random servers across three countries and two continents. During testing, nodes that were knocked offline by a simulated host failure were back online elsewhere in under 10 minutes.
The migration will begin slowly. It will take place with care and in batches to ensure that no nodes experience any more downtime than is absolutely necessary. For the time being new orders will still be provisioned in the soon to be legacy environment until everything is cut over and the last of the old environment is turned off.
Redundancy and automatic fail-over aren’t the only things we need more of. It’s become clear that email can’t go unaddressed, and Chainsaw needs more heads and hands. I’ve begun the process of looking for additional help to increase our manpower. Everything that contributed to this outage needs an upgrade.