Monday, October 23, 2006

My Deployment

Unfortunately, the deployment Saturday night / Sunday morning didn’t go well.

There were two aspects to the deployment: some changes to the database, and the actual application code itself. (The EAR file, for those of you who are familiar with J2EE.) Our system depends on some other back-end systems, which are always taken offline for maintenance between 4:00 and 5:00AM, so we had to finish our work before 4:00, or else we’d have an hour of sitting around and waiting, before we could get back to work. This isn’t usually an issue, for us, because our deployments usually only take around a half hour; if we started at 1:30, as planned, then we should have been done by 2:00.

Here’s how it went down:

  • At 1:30, we took the system offline, and began the database changes
  • The first sign that this was not a typical deployment came when 20 or 30 minutes went by, and the database changes weren’t completed. It normally only takes a few minutes for the database changes.
  • A little investigation gave us the answer: One of our scripts was looping through information in some database tables, and there was more data in production than in our testing environments. So the script, which took only a few minutes to run with 15,000 records, was taking over an hour to run with over 2 million records.
  • The database expert we had with us suggested a quick optimization we could apply to the script, which should help it run faster, so we halted execution, made the change, and started it over again.
  • It finally finished about 3:45. Dangerously close to the 4:00 window!
  • We deployed the application code, and got ready to test.
  • The code was finished deploying, and up and running, at about 4:04. Rats.
  • We decided to reconvene at 5:00.
    • During this interval, I took advantage of the delay to go to the gas station down the road, and pick up some snacks.
    • I also did a search on YouTube, for videos having to do with “Twinkies”, and was disappointed that there weren’t more. I shared one mediocre one with James, who was kind enough to keep me company on MSN Messenger.
  • At 5:00 we reconvened, and tried to test, but the system still wasn’t working. We double-checked, and verified that the back-end systems were up and running, but the code still wasn’t working.
  • After a bit of investigation, it appeared that there was a problem with the code itself.
  • At 5:45, we made the call to “roll back” the changes. (In other words, to restore the database and the application back to the state they were in before we began the deployment, reverting them back to the previous version of the application.) In other circumstances, we would have spent more time troubleshooting the problem, but we needed to have the system up and running—in some state, either with the new version of the application or the old version—by 7:00, and we didn’t think we’d be able to troubleshoot the new version by then.
  • At about 6:30, we were back up and running with the previous version of the application, and I was able to go home.