Archive for the ‘Game Industry’ Category

The hard part of continuous deployment

Last week I posted what sort of changes you might need to make to your deployment process to support updates multiple times a day. I said I would follow up with a description of the technical challenges involved in making your users not hate you if you actually did deploy multiple times a day.

The reasons your users will hate frequent deployments are:

  • Non-zero server downtime when a new build is released
  • Being forced to log out of their existing play session for a new build
  • Waiting for patches to download for every little change

I will describe solutions to each of these problems below.

Eliminating Lengthy Server Downtime

The database architecture in Pirates requires that every character in the database be loaded and processed whenever there is a schema change.  From the programmer’s point of view such changes are easy to make, so they occur frequently. Unfortunately this processing is why the Pirates servers take an hour to upgrade. I don’t know how common this kind of processing is in other games, but I know we weren’t the only ones who had to deal with it.

The need to process character data stems from the fact that Pirates stores persistent character data in a SQL database, but the actual data is in opaque blobs. The blobs themselves contain just the field data for each object and do not include any kind of version information. The whole system is only able to contain one version of each class at one time.

A better way to go would be to build your persistence layer so that it could handle any number of historical versions for player data. If you add a schema version to each blob and register each new schema as it is encountered. When a server needs to read an object it can perform the schema upgrades on the fly and avoid the long slow upgrade of each object.  If necessary you could even have a process that crawls the database upgrading inactive characters in the background so that they will be ready to go the next time they login.

Supporting multiple concurrent versions

Database updates are the longest section of server downtime at patch time, but they aren’t the only one.  For Pirates (and most MMOs) each world instance must be taken offline so that the new version can be installed on the server hardware and then launched. This adds some downtime for the upgrade, but most painfully, it also disconnects all active players. A better method would be to build your server clusters in such a way that they tolerate multiple simultaneous versions of the code.

It turns out that Guild Wars already supports this.  One of the comments from the last post describes a bit about how this works:

Once the new code was on all the servers in the datacenters, we flipped a switch that notified the live servers that any new ‘instances’ should be started using the latest version of the game code. At this point, any users running older builds of the client got told that they needed to update, and attempts to load into a new instance would be rejected until they updated.
To summarize how the live update worked on the server, I believe we atomically loaded the new game content onto the servers, and then loaded the newest version of the game code into the server processes alongside the older version, because each build was a single loadable DLL. That let us keep old instances alive alongside new instances on the same hardware, and isolated us from some of the hairy problems you’d get from multiple client versions running in the same world.

For players in the PvE parts of the game this method apparently worked quite well. However PvP players could only play with each other when everyone was running the same version of the game, so this caused problems with PvP events. It isn’t perfect, but this is a huge step in the right direction.

The Guild Wars approach worked for them, but would be less suitable for Pirates which has eight different types of processes making up its server clusters.  However with some changes to the Pirates Server Directory multiple versions of those exes could easily co-exist on a single server machine. When a client connects to Pirates it queries the server directory process to find an IP address and port number to connect to. That query already takes the version number into account, and could be adapted to simply never return cluster servers from an older version.

Another approach that would eliminate the need for any client downtime for many patches is to make those systems work more like the web. The next time I architect an MMO (at Divide by Zero Games, actually) I intend to move most of what the traditional game client does outside of the typical persistent client-server connection entirely. Systems like mission interaction, guild management, in-game mail, and the like don’t require the same level of responsiveness as moving and attacking. If those systems use non-persistent HTTP connections as their transport the same protocols can be re-used to support web, mobile, or social network front ends  to the same data. For chat a standarized protocol (Jabber maybe?) will let you use off the shelf servers and let chat move in and out of game easily. The more locked down these APIs are the less likely you are to affect them with your small patches.

Eliminating patch time

So you have eliminated all server downtime and even allowed outdated clients to stay online and play after a new version of your game is deployed. If players have to download a small patch and then spend a minute or two applying it to your giant pack files they are still going to be annoyed.  Well it seems that Guild Wars did a great job here too.

Almost all of the patching in Guild Wars is built right into the client and can be delayed until right before it is actually needed. When you download Guild Wars it is less than 1MB to start.  The first time you run the game it downloads everything you need to launch the game and get into character creation. After character creation it downloads enough to load the first town.  When you leave the first town and go out into the wilderness it loads what it needs for that zone before you leave the loading screen, and so on.  All of this data is stored on the user’s machine, so going back into town is fast.

The natural result of this is that if a user has outdated data when they reach the loading screen for a zone, tiny patches for the updated files are downloaded before they finishing zoning. The game never makes the user wait to patch data for sections of the world they aren’t anywhere near. When the client is updated an up to date list of the files in the latest version is downloaded and the client uses that to request updated data as necessary.

The next logical step once you have this partial patching system in place is to enable both servers and clients to load new data on the fly without shutting down. If you are doing it right, many of your new builds are going to be data-only and include no code changes. Small changes of that sort could easily be downloaded in the background and then switched on via a broadcast from the servers.

Fixing further technical problems

There are two remaining technical hurdles that are not directly visible to the players but still need to be solved if the latency between making a change and deploying it is to drop to under an hour. These two are strongly tied to the partial patching process described above: slow data file packing, and slow uploads to the datacenter patch servers.

The packed data files on Pirates are rebuilt from scratch every time a build is run. This is an artifact of my hacking in pack file support over a long weekend that could be solved with incremental pack file updates. Building them from scratch involves opening tens of thousands of files totalling 6GB in size, after compression, and compressing them out into 66 pack files and takes 80 minutes. Usually a much smaller number of files has actually changed, so an incremental build would reduce both the file opens and the data packing work itself.

That same incremental process could be applied to the patch deployment process itself.  Because of how Flying Lab’s process was arranged with SOE each packed data file had to be uploaded again in its entirity if it had any changes. That slowed the transfer to SOE considerably. I am not intimately familiar with the SOE patch servers themselves, but I suspect a similiar inefficiency existed on their end when a build was deployed.  This could also be eliminated with more meta-data of exactly what had changed, and you will need that information for the partial patching above to work anyway.

So is it worth all this trouble?

These changes represent a “never going to happen” amount of work for an existing game. While the work involved in building your game to eliminate these deployment issues is less for a game that is being build with that in mind, it still isn’t free.  Is it worth the expense to allow your game to deploy in an hour instead of seven and a half hours?

I think it is worth it from a purely operational perspective, and here’s why: One of the first few monthly builds released for Pirates included a bug that caused players to lose a ship at random whenever they scuttled a ship.  Their client would ask them to confirm that they wanted to scuttle whatever ship they clicked on, but when it got to the server it would instead delete the first ship in their list. Though it was only a one line code change to fix the bug it took many hours to deploy the changes. In the meantime quite a few players had deleted their favorite ships and all the cargo in them. We knew about the build in the morning of patch day, but couldn’t get a new build out for 8-10 hours, which put us into prime time. It caused a lot of bad blood that could have been avoided if we had been able to deploy the build faster. This is a particularly bad example, but this kind of thing happens all the time with live MMOs. (SWG had a bug go live where the command to launch fireworks would let players launch anything they liked into the air and then delete the object afterward.  That included but was not limited to fireworks, monsters, buildings, and other players.)

There are plenty of other reasons to be able to patch very quickly, and I may go into them in a future post. I think it’s worth ensuring you are able to push new builds within an hour simply to be able to fix your major screw-ups as quickly as possible and save your players grief.

Continuous Deployment with Thick Clients

(Update: Part two is up now. It details the technical changes that are required to make this work.)

Earlier this week Eric Ries and Timothy Fitz posted on a technique called Continuous Deployment that they use at IMVU.  Timothy describes it like this:

The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.

Those posts seem to be mostly about deploying new code to web servers. I thought it would be interesting to consider what it would take to follow a similar process for an MMO with a big installed client. I will do that in two parts with today’s part being the deployment process itself.  In a future post I will describe the architectural changes you would have to make to not have frequent updates annoy your players enough that they riot and burn down your datacenter (or go play someone else’s game instead.)

As a point of comparison, here’s how patches are typically developed and deployed for Pirates of the Burning Sea:

  1. The team makes the changes. That takes three weeks for monthly milestone builds with a lot of the integration happening in the last week. For a hotfix it usually takes a few hours to get to the root cause of the problem and then half an hour to make the change, build, and test on the developer’s machine.
  2. A build is run on the affected branch. When that’s a full build it takes 1:20 hours. Incremental builds can be ten minutes if no major headers changed. The form of build that people on the dev team use is copied to a shared server at this point.
  3.  Some quick automated tests are run to fail seriously broken builds as early as possible. These take 10 minutes.
  4. The build is packed up for deployment. This takes 80 minutes. The build is now ready for manual testing (since the testers use the packed retail files to be as close to a customer install as possible.)
  5. Slow automated tests run. These take 30 minutes.
  6. (concurrent with 5 if necessary) The testers run a suite of manual tests called BVTs. These take about half an hour. These tests run on each nightly build even if it isn’t going to be deployed to keep people from losing much time to bad builds. Most people wait for the BVTs to finish before upgrading to the previous night’s build.
  7. For milestone builds the testers spend at least two weeks testing all the new stuff. Builds are sent to SOE for localization of new content at the same time.
  8. A release test pass is run with a larger set of tests. These take about three hours.
  9. The build is uploaded to SOE for client patching and to the datacenter for server upgrades. This takes a few minutes for a hotfix or 6-8 hours for a month’s worth of changes.
  10. At this point the build is pushed to the test server. Milestone builds sit there for a couple weeks. Hot fixes might stay for a few hours or even go straight to live in an emergency.
  11. SOE pushes the patch to their patch servers without activating the patch for download. That takes between one and eight hours depending on the size of the patch.
  12. At the pre-determined downtime, the servers are taken offline and SOE is told to activate the patch. This usually happens at 1am PST and doesn’t take any time. Customers can begin downloading the patch at this point.
  13. Flying Lab operations people upgrade all the servers in parallel. This takes up to three hours for milestone builds, but more like an hour for hotfixes. They bring the servers up locked so that they can be tested before people come in.
  14. The customer service team tests each server to make sure it’s pretty much working and then operations unlocks the servers.  That takes about 20 minutes.
  15. If the build was being deployed to the test server, after some days of testing steps 11-14 will happen again for the live servers.

This may seem like a long, slow process but it actually works pretty well.  Pirates has had 11 “monthly” patches since it launched, and they go well.  Not many other MMOs are pushing content with that frequency.  Some of the slow bits (like the upload to SOE and their push to their patchers) are the result of organizational handoffs that would take quite a bit of work to change. Flying Lab also spends a fair amount of time testing manually over and above the automated tests. Those manual tests have kept thousands of bugs from going live and as far as I’m concerned they are an excellent use of resources. I am not trying to bash Flying Lab in any way, just provide a baseline for what an MMO deployment process looks like. I suspect many other games have a similar sequence of events going into each patch, and would love to hear confirmation or rebuttal from people who have experience on other games.

Assuming minimum time for a patch, these are the times we are left with (in hours and minutes):

Pirates Deployment Time

0:10 Run incremental build
0:10 Run quick automated tests
1:20 Pack the build
0:30 Run quick manual tests/slow automated tests
3:00 Run release pass smoke tests
0:05 Upload build for patching
1:00 Build is pushed to patchers
1:00 Servers are upgraded
0:20 A final quick test is made on each server
7:35 Total time to deploy


If it takes seven and a half hours to deploy a new build you obviously aren’t going to get more than one of them out in an 8 hour work day. Let’s forget for a moment that IMVU is able to do this in 15 minutes and pretend that our target is an hour. For now let’s assume that we will spend 10 minutes on building, 10 minutes on automated testing, 30 minutes on manual testing, and the last 10 minute actually deploying the build. We will also assume that this is the time to deploy to the test server and that this is the slower than the actual push from test to live.

Without going into detail I am going to assume three architectural changes to make these targets much more likely. First, the way that the game uses packed files of assets will need to be changed to allow trivial incremental changes on every file in the game without any 80 minute pack processes. I will assume that pack time can go away and be a minor part of deployment.  The other assumption is that unlike Pirates, our theoretical continuously deployed MMO will not require every object in the database to be touched when a new build is deployed so the hour spent upgrading databases on the servers is reduced to about five minutes of file copying. The third architectural change I will assume involves moving almost all of the patching process into the client itself and out of the launcher. This eliminates both the transfer of bits to Southern California and the push of pre-built patches out to the patcher clusters around the world. I will go into each of these assumptions in my next post, but let’s just pretend for now that they appear magically and reduce our time accordingly.

The amount of time spent on automated testing in this deployment process is 40 minutes. Fortunately this work is relatively easy to spread out over multiple machines to reduce the total time involved. Assuming a test farm of at least five machines we should be able to accomplish our ten minute goal.

It is possible that we could have multiple testers working in parallel to perform the manual tests more quickly. There are 230 minutes spent on testing in this sequence, so if we expect to do that amount of work with 11.5 manual testers assuming perfect division of labor. There are two things wrong with that statement. The first is that perfect division of labor among 12 people is impossible so the number is probably closer to 20.  The second problem is that having a team of 12 on staff to do nothing but test new builds with tiny incremental changes in them is not worth nearly what it costs you in salaries. A much more likely outcome is that the amount of manually testing in the build process is drastically reduced. In addition to that we can get our playerbase involved in the deployment testing if we swap things around a bit.

Trimmed Deployment Time

0:10 Run incremental build
0:10 Run automated tests on five machines
0:05 Upload only changed files to game servers and patch servers
0:05 Deploy build on test server
0:30 Manual smoke test on public test on public test server
Players can test at the same time.
1:00 Total time to deploy

In fact, if those 30 minutes of manual testing are your bottleneck and you can keep the pipeline full, you can push a fresh build every 30 minutes or 16 times a day. Forget entirely about pushing to live for a moment and consider what kind of impact that would have on your test server. Your team could focus on fixing issues that players on the test server find while those players are still online. Assuming a small change that you can make in half an hour, it would take only an hour from the start of the work on that fix to when it is visible to players. That pace is fast enough that it would be possible to run experiments with tuning values, prices of items, or even algorithms.Of course for any of this to work the entire organization needs to be arranged around responding to player feedback multiple times per day. The real advantage of a rapid deployment system is to make your change -> test -> respond loop faster. If your plans are set a month out and your developers don’t have time to work on anything other than their assigned tasks for that entire time, there is not much point in pushing builds this quickly.What do you think?  Obviously actually implementing this is a more work than moving numbers around in a table. :)

Augmented Reality Dinner at GDC

I would like to set up a dinner with other game developers interested in Augmented Reality while we’re all down in San Francisco for GDC.  If you would be interested in attending such a thing, please send me an email at joe (at) programmerjoe.com. I would like to get some idea how many people might attend before I start seeking out an appropriate restaurant.(If you know somebody who is AR-inclined and coming down for GDC but that doesn’t read my blog, please encourage them to email me.) 

2009

Last year went so well I figured I’d give it another shot.  Here’s what I think will happen in online games in 2009.

  1. This will not be a big year for MMO launches. In 2008 we had a WoW expansion pack, two games with a ton of buzz, and one pirate game that wasn’t quite as big as those other two. This year will include Champions Online and Free Realms, but none of the huge releases of previous years. Star Trek Online and Star Wars: The Old Republic have no announced dates yet, but they won’t be in 2009. *
  2. Champions Online, the first next-gen console MMO will launch this year. It will ignite a new interest in MMOs on console despite having more than a few console-related issues. I absolutely loved City of Heroes, so I’m likely to play this one a lot.
  3. This will be a big year for announcing MMOs. We will likely see big announcements from Carbine, Red 5, 38 Studios, and Trion. We might even see that Fallout Online announcement from Zenimax everyone is expecting. None of these games will launch in 2009.
  4. Somebody will buy Turbine.  I bet it’s a big media company of some sort, and not a game publisher. 
  5. Just as the “hey everybody, let’s make an MMO” gravy train ended a few years ago, the “I know! MMOs for kids!” trend is also about to end.  A bunch of kids MMOs will come out in 2009 and none of them will approach Club Penguin’s numbers. It will be much harder to get this kind of game funded as a result.
  6. The console manufacturers are going to start talking about the next generation of consoles in a very preliminary way. Nobody will get dev kits in 2009, but some details will start to come out about what’s going to be in the new consoles.  (Answer: More cores – probably around 20, and big hard drives. Digital distribution of non-game media will be a big part of the next gen game consoles.)
  7. The economy will bottom out and start to rebound. We aren’t quite at the bottom yet, but the low point will come in 2009.
  8. Not much is going to happen on the Augmented Reality front in 2009. We might see a few simple apps using smart phone camera and screens, but that will be about it, at least publically.
  9. Microsoft will start making another MMO. They will cancel this MMO in 2010 when they discover it takes 5 years and tens of millions to make an MMO these days.
  10. One of the XNA community games will hit the big time. It will be a top-ten downloadable for a while. Its name will be said at the following GDC at least a billion times.
  11. A whole lot of web startups are going to fail in 2009. There have been a couple this fall, with quite a few more will go down in flames next year.  I’m not talking about the “90%” of companies that always fail; I’m talking about companies you might have actually heard of. Everybody had enough runway to get through the holidays… the real test comes in the spring when they can’t fund their next round.
  12. The fallout from Sony Online moving under the rest of Sony’s gaming businesses will hit in 2009. This probably means significant management changes and cancelled projects. I suspect that either DC Universe or The Agency will get the axe. It could just as easily be some unannounced project we’ve never heard of, though. 
  13. Warhammer Online will merge some servers together.  They are already offering free transfers. This will make some people at FLS laugh out loud (see the second from the last paragraph). The reason will be the same as the reason PotBS merged servers: it’s hard to predict exactly how many subscribers you will have Server merges are not the beginning of the end, it’s just a matter of starting with too many servers. FWIW, the hundreds of unofficial WAR forums are full of people asking for merges.

* Obviously Star Trek Online thinks it will ship in 2009. They set their game in 2409, after all.  I think the game is going to slip into 2010.

How did I do with my 2008 predictions?

This is what I thought would happen a year ago.  Let’s see how I did.

[This is my second time writing this post.  For some reason WordPress or Firefox or somebody ate what I wrote on Thursday and was planning to publish today. *sigh*]

  1.  Pirates did, in fact, launch. We launched in the US and Europe in January, then in Australia in February, and finally in Russia in September. What I didn’t predict a year ago was that I wouldn’t be at FLS anymore.
  2. Warhammer slipped and then launched in the fall just as I (and everyone else) predicted. I was a bit off about the numbers, though. They seem to be doing well, but are definitely below the magic million. Lich King almost certainly put a dent in their numbers, though.  Personally, I enjoy Warhammer Online quite a bit and hope it keeps going well for them.
  3. Age of Conan shipped in the spring, and while the first 20 levels were quite good, everything else fell flat. Quite a few people who left PotBS to play AoC came back a month or two later. Apparently boobs and gore aren’t enough to carry an MMO all by themselves.
  4. NCsoft basically gutted their Austin studio, pushed out the Garriots, and moved power over everything outside Asia to the ArenaNet folks. That’s great news for Seattle, but not so great for all the people I know who were laid off. Wish I’d been wrong about this one.
  5. Maybe it’s just me, but it actually seems like game journalism is getting better. Maybe that’s because we’re mainstream enough for real reporters to cover game stories?  It’s arguably still in the “suck” category, but I don’t think I’ll be making this prediction for next year.
  6. At GDC08, Microsoft announced that community games would be available to people who weren’t members of the Creator’s Club. In July they announced more details. They also did it without requiring certification for every XNA game. Way to go, Microsoft!
  7. Bioware announced Star Wars: The Old Republic. A few people noticed, including the few remaining Star Wars: Galaxies players.
    1. Red 5 hasn’t announced anything.
    2. NC Orange County changed its name to Carbine (or maybe was always Carbine and I was clueless), but hasn’t announced anything.
    3. Space-Time was cut loose in the first round of cuts at NCsoft, announced themselves, spent 6 months looking for a publisher, then laid everybody off and started hiring Flash developers. 
    4. King’s Isle announced, beta’d, and launched Wizard 101. I hear it’s pretty good.
    5. The thing that Sean and Scott are working on at NC Austin was killed and never announced. Fortunately they both seem to have landed on their feet.
    6. 38 Studios hasn’t announced anything.
  8. Although lots of games have become fairly popular on Facebook and other social networks, none of they have blown the doors off. They’re still more about grabbing eyeballs than revenue. As ad rates continue to plumet thanks to the Economopalypse and funding becomes harder to secure, that may not bode well for the social game world in 2009.
  9. Metaplace hasn’t launched exactly, but they’re in a beta where anybody can invite more testers. They have come pretty far during the year. Areae also renamed itself to the people-know-how-to-pronounce-and-spell-it Metaplace.
  10. World of Warcraft actually hit 11 million subscribers a couple months ago. Wrath of the Lich King will probably give them a boost this winter too. Is this the peak?  We won’t know for quite a while; Blizzard is never going to announce a number under 11 million. [Update: Blizzard just announced 11.5M as the post Lich King number.]
  11. It seems that Cheyenne Mountain is this year’s example. They reportedly raised their money from angels rather than venture capitalists, but the problem is still the same.  Trying to build more than one game at the same time with a brand new company is stupid.
  12. I was right! Valve didn’t tell anyone they were working on an MMO in 2008!  Whether they actually are or not is more of a mystery.  They were recruiting at Austing GDC for what it’s worth.
  13. I don’t have a fully automated nanotech powered flying car, do you?
  14. (from the comments) Whirled launched. It got the Penny Arcade bump for Corpse Craft, and has quite a few people on there playing games. We’ll have to wait until Daniel shows us all the numbers at GDC to know how it’s really doing, but it seems pretty good from the outside.

That makes ten (1, 2, 3, 4, 5, 6, 7, 11, 13, 14) correct predictions, two (8, 9) incorrect predictions, and two that are impossible to determine (10, 12). I’m pretty happy with those results. Maybe next time I’ll put in fewer gimmes (1, 5, 11, 13).

The big MMO event from 2008 that I missed completely is Atari acquiring Cryptic. It makes a lot of sense for publishers other than EA to want to get into the MMO space, and acquiring an experienced developer (like Cryptic or Mythic) is the best way to go about it.  I wonder if we’ll see similar news next year from Turbine.