For the KARL project, the development team is primarily involved in operations. Along with Six Feet Up as the hosting provider, we are responsible for many activities in “SaaS”. Bugs are reported to us, we do the software updates, we help monitor the site, we do staging and testing of customer instances.
I really enjoy this aspect. It’s different from past involvement in open source projects, where involvement in the software is somewhat de-coupled from living with your mistakes. [wink] Stated differently, we have a direct interest in stability, performance, quality, and even in things like making KARL easy to monitor by Zenoss.
We periodically update the software. Which means restarting the app server. Which usually means, downtime. Over time, we’ve whittled that down.
First, KARL restarts fast. Like, two seconds or so. Thus, the impact is minimal. Next, we use mod_wsgi, which lets us do “graceful” restarts in Apache. Serve all your current requests and restart your processes. These combine for providing very fast updates.
There’s one aspect that’s harder though. Sometimes our updates require “evolve” scripts to update data. For example, adding an index, or fixing a value that requires waking up lots of objects.
We used to do this live in a ZEO client, but when the evolve script takes more than a few minutes, we get prone to conflict errors with the running site. Which means, shutting down the main site. Which, sucks. (I’ve become quite obsessive about performance and uptime.)
We have some ideas we think can mitigate this:
- SSD. Six Feet Up has the solid-state disks installed. Because of RAID and cabinet issues, the SSDs are going in the spare box in the rack. We’ll then move the site over. The hope is that evolve scripts get faster. If the evolve scripts are bottlenecked elsewhere (e.g. ZEO single-threading), then that’s a different issue.
- Read-only mode. Perhaps we could leave the site in read-only mode during the update, with a little banner informing the user. Preferably we could put the site in read-only mode without a restart.
Any other ideas on minimizing downtime on such applications without major changes in architecture?
