June 7, 2013
It's been one of those kinds of weeks.
We have a handful of database servers at the datacenter, depending on their function. So, the ticketing system lives on one server, regular web hosting lives on another, etc. We also have one server that conked over Easter weekend and was waiting to be put back into service.
Now, due to assumptions we made when we built ticketing -- namely, that a majority of our clients would be interline carriers that would want to merge their routes with others' for sale -- the system was built around a central database. Given that two of our three heaviest hitters are commuter-style operations, with the third being an intercity non-interliner, that assumption turned out to be both wrong and inconvenient. The data for each non-interline carrier was basically getting in everyone else's way.
So we had a plan -- set up the system to allow multiple databases and start carving out the non-interline carriers. We set things up and did our testing with a new client and everything worked nicely. We were good to go. Plans were made.
And postponed, due to business travel or holidays. But things were plugging right along, so no real emergency.
Then, one Saturday, the system got ridiculously slow. The CPU was pegged and things were timing out everywhere. We thought that the database had hit a tipping point, where too much traffic had come in and things were now timing out and not releasing resources in a timely manner. I changed some of the server code to streamline one place we were seeing slowness (at the risk of hitting an edge case, though it'd be rare) and watched as things settled down over the next couple hours.
So that worked out nicely. And it gave us another reason to separate the database into chunks -- if one database got slammed the others wouldn't take it in the shorts quite so bad.
We decided to do the switch the second weekend of June. Then, the Saturday before, things got slow again. This was a problem, because there was nothing left for me to do and it was really intermittent. One request would go right through, repeating it would time out.
So we moved up to the next night, very early Monday morning. We took an external drive to the datacenter -- if we're going to be shoving databases around we might as well take a backup -- and then sliced away our three big clients. So the server, let's call it DB1, now had four databases in it. So we had SystemClientA, SystemClientB, SystemClientC and SystemEveryoneElse. We were in at 12:30 AM and out by 3:00. No problems. We'd delete the extraneous data in the morning when we were fresh, and do it in pieces to avoid hosing things.
At work on Monday, things were even worse. OK, new plan. The boss retrieved the dead server (DB2 for our purposes) back to the office and started replacing drives.
(When it died over Easter, it did so by crashing while we were updating an index -- a heavy I/O operation -- and appeared to have corrupted the database file. So new drives were in order.)
Drives were replaced, the RAID 1-0 was rebuilt, and everything was installed over the course of the afternoon/evening. We broke for dinner and got to the datacenter at about 11:00 PM. We put the server back in the rack, booted it up, and headed out to the quiet area to start moving the three separate clients onto it.
This time, as we were moving things we decided to purge the useless data before we let everybody back in. Tear down the indexes, run the delete, rebuild the indexes. Cinch. Could do it in my sleep.
Rebuilding the indexes on the last database, the server decided to reboot itself.
OK, DB2 can't be trusted, but we can't leave everything in DB1. So some other, non-ticketing-system servers were pressed into service. SystemClientA went to one server, Client B's to a second, and Client C's to a third. Everyone else stayed on DB1 because we didn't have any more servers that it would be acceptable to put them on.
I got home at 4:30 AM on Tuesday. Which this time of the year is about half an hour before the sky starts to get light. I rolled in to the office at about 11, and things were running great. We actually got to take a little breather, by which I mean do all the work we hadn't done on Monday.
Talking with tech support we determined that: DB2 (the dead one from Easter) had bad RAM, which caused the reboot when the database service tickled the wrong bits. DB1 had one of the drives in its RAID go bad, and the slowness was from the system trying to build the hot spare while the databases were constantly being updated. We ordered 24GB of high-end RAM for DB2 and had it next-dayed to us.
On Wednesday night, DB2 went back to the datacenter and we moved SystemEverybodyElse over to it. As an aside, I missed a Nats game for this. I'm mostly OK with that, because they got their asses handed to them.
Again it went quickly: In at midnight and out at 3. DB1 came back to have its drives and RAM replaced (they were the same parts of the same vintage that had failed on DB2, so call it preventive maintenance). Those parts were ordered Thursday and should arrive on Friday.
Which means that Friday Sunday night I'll be back at the datacenter, putting DB2 back into the rack again. Then I'll bring back Clients A, B and C from their little diaspora and we should be in good shape. At that point I'll have spent more time in the datcenter's conference room than my own bed for the past week, but who's counting?
Edit, 6:17 PM: The RAID is taking longer to set up than we thought, and the boss doesn't want to ruin the weekend. So it'll be Sunday night when I'll be at the datacenter again (again, again) making it four times in eight days.