The hard part of continuous deployment

Last week I posted what sort of changes you might need to make to your deployment process to support updates multiple times a day. I said I would follow up with a description of the technical challenges involved in making your users not hate you if you actually did deploy multiple times a day.

The reasons your users will hate frequent deployments are:

  • Non-zero server downtime when a new build is released
  • Being forced to log out of their existing play session for a new build
  • Waiting for patches to download for every little change

I will describe solutions to each of these problems below.

Eliminating Lengthy Server Downtime

The database architecture in Pirates requires that every character in the database be loaded and processed whenever there is a schema change.  From the programmer’s point of view such changes are easy to make, so they occur frequently. Unfortunately this processing is why the Pirates servers take an hour to upgrade. I don’t know how common this kind of processing is in other games, but I know we weren’t the only ones who had to deal with it.

The need to process character data stems from the fact that Pirates stores persistent character data in a SQL database, but the actual data is in opaque blobs. The blobs themselves contain just the field data for each object and do not include any kind of version information. The whole system is only able to contain one version of each class at one time.

A better way to go would be to build your persistence layer so that it could handle any number of historical versions for player data. If you add a schema version to each blob and register each new schema as it is encountered. When a server needs to read an object it can perform the schema upgrades on the fly and avoid the long slow upgrade of each object.  If necessary you could even have a process that crawls the database upgrading inactive characters in the background so that they will be ready to go the next time they login.

Supporting multiple concurrent versions

Database updates are the longest section of server downtime at patch time, but they aren’t the only one.  For Pirates (and most MMOs) each world instance must be taken offline so that the new version can be installed on the server hardware and then launched. This adds some downtime for the upgrade, but most painfully, it also disconnects all active players. A better method would be to build your server clusters in such a way that they tolerate multiple simultaneous versions of the code.

It turns out that Guild Wars already supports this.  One of the comments from the last post describes a bit about how this works:

Once the new code was on all the servers in the datacenters, we flipped a switch that notified the live servers that any new ‘instances’ should be started using the latest version of the game code. At this point, any users running older builds of the client got told that they needed to update, and attempts to load into a new instance would be rejected until they updated.
To summarize how the live update worked on the server, I believe we atomically loaded the new game content onto the servers, and then loaded the newest version of the game code into the server processes alongside the older version, because each build was a single loadable DLL. That let us keep old instances alive alongside new instances on the same hardware, and isolated us from some of the hairy problems you’d get from multiple client versions running in the same world.

For players in the PvE parts of the game this method apparently worked quite well. However PvP players could only play with each other when everyone was running the same version of the game, so this caused problems with PvP events. It isn’t perfect, but this is a huge step in the right direction.

The Guild Wars approach worked for them, but would be less suitable for Pirates which has eight different types of processes making up its server clusters.  However with some changes to the Pirates Server Directory multiple versions of those exes could easily co-exist on a single server machine. When a client connects to Pirates it queries the server directory process to find an IP address and port number to connect to. That query already takes the version number into account, and could be adapted to simply never return cluster servers from an older version.

Another approach that would eliminate the need for any client downtime for many patches is to make those systems work more like the web. The next time I architect an MMO (at Divide by Zero Games, actually) I intend to move most of what the traditional game client does outside of the typical persistent client-server connection entirely. Systems like mission interaction, guild management, in-game mail, and the like don’t require the same level of responsiveness as moving and attacking. If those systems use non-persistent HTTP connections as their transport the same protocols can be re-used to support web, mobile, or social network front ends  to the same data. For chat a standarized protocol (Jabber maybe?) will let you use off the shelf servers and let chat move in and out of game easily. The more locked down these APIs are the less likely you are to affect them with your small patches.

Eliminating patch time

So you have eliminated all server downtime and even allowed outdated clients to stay online and play after a new version of your game is deployed. If players have to download a small patch and then spend a minute or two applying it to your giant pack files they are still going to be annoyed.  Well it seems that Guild Wars did a great job here too.

Almost all of the patching in Guild Wars is built right into the client and can be delayed until right before it is actually needed. When you download Guild Wars it is less than 1MB to start.  The first time you run the game it downloads everything you need to launch the game and get into character creation. After character creation it downloads enough to load the first town.  When you leave the first town and go out into the wilderness it loads what it needs for that zone before you leave the loading screen, and so on.  All of this data is stored on the user’s machine, so going back into town is fast.

The natural result of this is that if a user has outdated data when they reach the loading screen for a zone, tiny patches for the updated files are downloaded before they finishing zoning. The game never makes the user wait to patch data for sections of the world they aren’t anywhere near. When the client is updated an up to date list of the files in the latest version is downloaded and the client uses that to request updated data as necessary.

The next logical step once you have this partial patching system in place is to enable both servers and clients to load new data on the fly without shutting down. If you are doing it right, many of your new builds are going to be data-only and include no code changes. Small changes of that sort could easily be downloaded in the background and then switched on via a broadcast from the servers.

Fixing further technical problems

There are two remaining technical hurdles that are not directly visible to the players but still need to be solved if the latency between making a change and deploying it is to drop to under an hour. These two are strongly tied to the partial patching process described above: slow data file packing, and slow uploads to the datacenter patch servers.

The packed data files on Pirates are rebuilt from scratch every time a build is run. This is an artifact of my hacking in pack file support over a long weekend that could be solved with incremental pack file updates. Building them from scratch involves opening tens of thousands of files totalling 6GB in size, after compression, and compressing them out into 66 pack files and takes 80 minutes. Usually a much smaller number of files has actually changed, so an incremental build would reduce both the file opens and the data packing work itself.

That same incremental process could be applied to the patch deployment process itself.  Because of how Flying Lab’s process was arranged with SOE each packed data file had to be uploaded again in its entirity if it had any changes. That slowed the transfer to SOE considerably. I am not intimately familiar with the SOE patch servers themselves, but I suspect a similiar inefficiency existed on their end when a build was deployed.  This could also be eliminated with more meta-data of exactly what had changed, and you will need that information for the partial patching above to work anyway.

So is it worth all this trouble?

These changes represent a “never going to happen” amount of work for an existing game. While the work involved in building your game to eliminate these deployment issues is less for a game that is being build with that in mind, it still isn’t free.  Is it worth the expense to allow your game to deploy in an hour instead of seven and a half hours?

I think it is worth it from a purely operational perspective, and here’s why: One of the first few monthly builds released for Pirates included a bug that caused players to lose a ship at random whenever they scuttled a ship.  Their client would ask them to confirm that they wanted to scuttle whatever ship they clicked on, but when it got to the server it would instead delete the first ship in their list. Though it was only a one line code change to fix the bug it took many hours to deploy the changes. In the meantime quite a few players had deleted their favorite ships and all the cargo in them. We knew about the build in the morning of patch day, but couldn’t get a new build out for 8-10 hours, which put us into prime time. It caused a lot of bad blood that could have been avoided if we had been able to deploy the build faster. This is a particularly bad example, but this kind of thing happens all the time with live MMOs. (SWG had a bug go live where the command to launch fireworks would let players launch anything they liked into the air and then delete the object afterward.  That included but was not limited to fireworks, monsters, buildings, and other players.)

There are plenty of other reasons to be able to patch very quickly, and I may go into them in a future post. I think it’s worth ensuring you are able to push new builds within an hour simply to be able to fix your major screw-ups as quickly as possible and save your players grief.

~Joe


6 Responses to “The hard part of continuous deployment”

  1. Matthew Weigel replied on :

    “A better way to go would be to build your persistence layer so that it could handle any number of historical versions for player data.”

    Yep, Dungeon Runners did this, including the incremental crawler. The crawler process was implemented so early in the project, I actually had to revisit it around the beginning of last year: it was still trying to load EVERY CHARACTER into memory before doing any migration (which worked early on, and continued to work in dev, but started choking in production). I implemented a moving window of loaded characters, so a few thousand (or whatever) characters were in memory at any time, with new ones being loaded as completed migrations were saved to the DB.

    That kind of system is a LOT harder when you go for a regular relational representation in the database, something I’m still kind of contemplating for my current project. New (nullable or default value) columns aren’t too bad, but you better be sure to change the meaning of an existing column only under extreme duress.

    For server binary upgrade, NCsoft Operations had a classic (but workable) system: the root of the server install had ‘old,’ ‘live,’ and ‘new’ subdirectories. Install everything in ‘new’ while the server is up, take down the server and do database upgrades while deleting ‘old,’ renaming ‘live’ to ‘old’ and ‘new’ to ‘live’. It wasn’t as snazzy as versioned DLLs, but it avoided some of the craziness of DLLs too. This essentially removes server software upgrades from the downtime equation.

    Something like PotBS’ ServerDirectory is probably better, and also more webbish: connect to a server that’s still at the version your client is at, next time the client changes instance/zone. You could probably go a step further, and mark servers as needing to be upgraded one at a time, so that the Server Directory won’t send new clients to that server and when everyone is gone from it, you can upgrade it and have it reconnect with its new version.

    Dungeon Runners also had package files with incremental updates: that was actually implemented after the game was live, when we decided the slow start time of the client was unacceptable. I think the PlayNC Launcher also had some concept of patching existing files, so what we uploaded to the patch servers (and what clients downloaded) was essentially a binary diff too.

    I think for Dungeon Runners the real causes for downtime were bugs, database upgrades (in some cases, particularly because of the character data and other webbish initiatives), and Windows/hardware updates. Aside from that, of course, there’s the span of time between “build finished,” “build signed off,” and “build published.” We didn’t use the incremental patches for server packages, sign off generally took days (but there was no automated testing)… we solved a lot of the technical problems without actually making progress anywhere but server downtime.

    Also, somehow I missed your previous blog post on the subject… pretty cool that someone formerly at ArenaNet commented on it, their system engendered a combination of “WTF?” and “cool!” around NC Austin. :-)

  2. Joshua M. Kriegshauser thought on :

    Very interesting and informative post. We have some similarities and some differences with Ultima Online and EverQuest II.

    With UO, there was no database backend; everything persistent was stored in flat files. We also had a hodgepodge of text and binary server-side data files. When I first started working on UO we’d actually sync CVS to the servers and build it on each and every server machine. Eventually the deployment process became more sane and we would push pre-built binaries. Since the flat-files were generally opaque to everything but the game processes, the update would never touch them. The slowest part about a publish was the servers coming up and reading all of the persistent world data (since everything in the world was persistent, not just characters).

    Interestingly, with UO we would push a new client live a week or two prior to a new server, so the client had to support two network protocol versions. Unfortunately the ‘old version’ support was almost never removed.

    When the client version changed, clients would of course have to patch, but client patches on UO were often fairly tiny by today’s standards.

    —-~~~~—-

    EverQuest II is quite different. We use an Oracle DB to store our persistent data (which centralizes around characters) and we keep our server-side data in packed binary files. Our patcher system allows new client and server data to be sync’d to their respective servers while the live game stays up and running. We almost never do DB updates that require large amounts of downtime. While our character data is a mixture of columns and blobs, the game is authoritative on versioning. We could potentially have very old character versions stored in the DB as they’re not upgraded until loaded. In practice this has rarely affected us negatively but has allowed us to perform very fast updates.

    When syncing is done, servers are taken down and a switch is flipped which really causes a directory rename on the servers, followed by a startup. The longest downtime portion of our patches is QA checking out the servers before they’re unlocked.

    In the case of most hotfixes, a client publish is more or less “optional”. By this I mean that the game will connect and run if you don’t patch the client because the network protocol version hasn’t changed. However, you’ll still get booted off when the servers go down.

    While the best user experience might be trying to maintain old-version uptime even after a patch has synced, it sounds like a logistical nightmare considering that players are going to have to disconnect and patch at some point anyways.

    I’m excited to see how the client streaming system for Free Realms will play out. Unfortunately, EQII’s client-side data was not designed for streaming and there are many, many interdependencies that any sort of streaming system would have to be aware of and anticipate.

    Always fun to chat about current and upcoming MMO tech. Looking forward to LOGIN 2009!

  3. Tim Burris replied on :

    Actually at this point Pirates is no longer building pack files from scratch. We switched to a file format that supports incremental packing a milestone or two after Joe left.

    We could shave off about 5 minutes with better data organization: the first step of building the pack files is to copy all the unpacked client files to a temporary location to ensure that no server-only files get to clients. We sacrifice another 5 minutes by copying the current live pack files down, in order to start the incremental packing from a known point and give a minimal patch size. Analysis showed that a purely incremental approach bloated the patch size in certain circumstances, such as when multiple modifications to the same asset altered its size significantly. The actual packing operation takes 10 or 20 minutes depending on the extent of changes. Another 5 minutes is spent copying the full distribution to the central share.

    During the remaining hour, the (SOE-supplied) delta program generates patches, reports and sums their sizes, then throws them away. This is one of the organizational handoff inefficiencies Joe spoke of in the previous post. Obviously we could eliminate this hour or one of the patch deployment hours if we had control of the entire deployment pipeline.

    We keep the hour because a) by and large it happens during the nightly build and nobody cares whether the build finishes at 11pm or midnight, and b) it is extremely useful to know when the patch size jumps up significantly. We’ve caught quite a few art bugs this way.

  4. Rob Hale replied on :

    I’ve often thought about the possibilities of moving all non-time critical game elements out of the normal client/server communication and into secure http connections. This is largely because of the inherent benefit that your economy can function without your game servers being live and that players in the game can communicate seamlessly with those not in the game.

    From a community point of view it makes alot of sense and I imagine would take alot of strain off of the game servers as the game servers dealing with combat calculations, AI and all the “Hard” stuff won’t have to be concerned with a guild leader promoting members. It would even allow you to have multiple game servers that all feed into the same market servers or read out of the same Character Database.

  5. Kevin Gadd commented on :

    Re: HTTP, one of the benefits I haven’t seen mentioned is that there’s a lot of extremely solid software out there for proxying/load-balancing HTTP requests. We’re using a bunch of off-the-shelf software with minor modifications where I work currently (I think most of it was developed by Danga), and for the most part it’s scaled tremendously well. Developing that sort of technology from scratch for custom protocols would be a fairly large task, I suspect.

  6. Bryant replied on :

    And now that I’ve read part two: yes, that, exactly. Particularly enabling restarts without dumping players.

    A fair amount of Turbine’s non-game traffic is HTTP. It turned out to be a big win for both the obvious reasons and some unexpected ones. As Kevin notes, getting the benefit of load balancing technology is great, not just for balancing load. F5′s BigIP iRule technology turned out to be very handy for quickly developing all sorts of useful tools for handling traffic loads that would have been harder to build into the servers.

Leave a Reply