Archive for the ‘Engineering’ Category

Continuous Deployment with Thick Clients

(Update: Part two is up now. It details the technical changes that are required to make this work.)

Earlier this week Eric Ries and Timothy Fitz posted on a technique called Continuous Deployment that they use at IMVU.  Timothy describes it like this:

The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.

Those posts seem to be mostly about deploying new code to web servers. I thought it would be interesting to consider what it would take to follow a similar process for an MMO with a big installed client. I will do that in two parts with today’s part being the deployment process itself.  In a future post I will describe the architectural changes you would have to make to not have frequent updates annoy your players enough that they riot and burn down your datacenter (or go play someone else’s game instead.)

As a point of comparison, here’s how patches are typically developed and deployed for Pirates of the Burning Sea:

  1. The team makes the changes. That takes three weeks for monthly milestone builds with a lot of the integration happening in the last week. For a hotfix it usually takes a few hours to get to the root cause of the problem and then half an hour to make the change, build, and test on the developer’s machine.
  2. A build is run on the affected branch. When that’s a full build it takes 1:20 hours. Incremental builds can be ten minutes if no major headers changed. The form of build that people on the dev team use is copied to a shared server at this point.
  3.  Some quick automated tests are run to fail seriously broken builds as early as possible. These take 10 minutes.
  4. The build is packed up for deployment. This takes 80 minutes. The build is now ready for manual testing (since the testers use the packed retail files to be as close to a customer install as possible.)
  5. Slow automated tests run. These take 30 minutes.
  6. (concurrent with 5 if necessary) The testers run a suite of manual tests called BVTs. These take about half an hour. These tests run on each nightly build even if it isn’t going to be deployed to keep people from losing much time to bad builds. Most people wait for the BVTs to finish before upgrading to the previous night’s build.
  7. For milestone builds the testers spend at least two weeks testing all the new stuff. Builds are sent to SOE for localization of new content at the same time.
  8. A release test pass is run with a larger set of tests. These take about three hours.
  9. The build is uploaded to SOE for client patching and to the datacenter for server upgrades. This takes a few minutes for a hotfix or 6-8 hours for a month’s worth of changes.
  10. At this point the build is pushed to the test server. Milestone builds sit there for a couple weeks. Hot fixes might stay for a few hours or even go straight to live in an emergency.
  11. SOE pushes the patch to their patch servers without activating the patch for download. That takes between one and eight hours depending on the size of the patch.
  12. At the pre-determined downtime, the servers are taken offline and SOE is told to activate the patch. This usually happens at 1am PST and doesn’t take any time. Customers can begin downloading the patch at this point.
  13. Flying Lab operations people upgrade all the servers in parallel. This takes up to three hours for milestone builds, but more like an hour for hotfixes. They bring the servers up locked so that they can be tested before people come in.
  14. The customer service team tests each server to make sure it’s pretty much working and then operations unlocks the servers.  That takes about 20 minutes.
  15. If the build was being deployed to the test server, after some days of testing steps 11-14 will happen again for the live servers.

This may seem like a long, slow process but it actually works pretty well.  Pirates has had 11 “monthly” patches since it launched, and they go well.  Not many other MMOs are pushing content with that frequency.  Some of the slow bits (like the upload to SOE and their push to their patchers) are the result of organizational handoffs that would take quite a bit of work to change. Flying Lab also spends a fair amount of time testing manually over and above the automated tests. Those manual tests have kept thousands of bugs from going live and as far as I’m concerned they are an excellent use of resources. I am not trying to bash Flying Lab in any way, just provide a baseline for what an MMO deployment process looks like. I suspect many other games have a similar sequence of events going into each patch, and would love to hear confirmation or rebuttal from people who have experience on other games.

Assuming minimum time for a patch, these are the times we are left with (in hours and minutes):

Pirates Deployment Time

0:10 Run incremental build
0:10 Run quick automated tests
1:20 Pack the build
0:30 Run quick manual tests/slow automated tests
3:00 Run release pass smoke tests
0:05 Upload build for patching
1:00 Build is pushed to patchers
1:00 Servers are upgraded
0:20 A final quick test is made on each server
7:35 Total time to deploy


If it takes seven and a half hours to deploy a new build you obviously aren’t going to get more than one of them out in an 8 hour work day. Let’s forget for a moment that IMVU is able to do this in 15 minutes and pretend that our target is an hour. For now let’s assume that we will spend 10 minutes on building, 10 minutes on automated testing, 30 minutes on manual testing, and the last 10 minute actually deploying the build. We will also assume that this is the time to deploy to the test server and that this is the slower than the actual push from test to live.

Without going into detail I am going to assume three architectural changes to make these targets much more likely. First, the way that the game uses packed files of assets will need to be changed to allow trivial incremental changes on every file in the game without any 80 minute pack processes. I will assume that pack time can go away and be a minor part of deployment.  The other assumption is that unlike Pirates, our theoretical continuously deployed MMO will not require every object in the database to be touched when a new build is deployed so the hour spent upgrading databases on the servers is reduced to about five minutes of file copying. The third architectural change I will assume involves moving almost all of the patching process into the client itself and out of the launcher. This eliminates both the transfer of bits to Southern California and the push of pre-built patches out to the patcher clusters around the world. I will go into each of these assumptions in my next post, but let’s just pretend for now that they appear magically and reduce our time accordingly.

The amount of time spent on automated testing in this deployment process is 40 minutes. Fortunately this work is relatively easy to spread out over multiple machines to reduce the total time involved. Assuming a test farm of at least five machines we should be able to accomplish our ten minute goal.

It is possible that we could have multiple testers working in parallel to perform the manual tests more quickly. There are 230 minutes spent on testing in this sequence, so if we expect to do that amount of work with 11.5 manual testers assuming perfect division of labor. There are two things wrong with that statement. The first is that perfect division of labor among 12 people is impossible so the number is probably closer to 20.  The second problem is that having a team of 12 on staff to do nothing but test new builds with tiny incremental changes in them is not worth nearly what it costs you in salaries. A much more likely outcome is that the amount of manually testing in the build process is drastically reduced. In addition to that we can get our playerbase involved in the deployment testing if we swap things around a bit.

Trimmed Deployment Time

0:10 Run incremental build
0:10 Run automated tests on five machines
0:05 Upload only changed files to game servers and patch servers
0:05 Deploy build on test server
0:30 Manual smoke test on public test on public test server
Players can test at the same time.
1:00 Total time to deploy

In fact, if those 30 minutes of manual testing are your bottleneck and you can keep the pipeline full, you can push a fresh build every 30 minutes or 16 times a day. Forget entirely about pushing to live for a moment and consider what kind of impact that would have on your test server. Your team could focus on fixing issues that players on the test server find while those players are still online. Assuming a small change that you can make in half an hour, it would take only an hour from the start of the work on that fix to when it is visible to players. That pace is fast enough that it would be possible to run experiments with tuning values, prices of items, or even algorithms.Of course for any of this to work the entire organization needs to be arranged around responding to player feedback multiple times per day. The real advantage of a rapid deployment system is to make your change -> test -> respond loop faster. If your plans are set a month out and your developers don’t have time to work on anything other than their assigned tasks for that entire time, there is not much point in pushing builds this quickly.What do you think?  Obviously actually implementing this is a more work than moving numbers around in a table. :)

Computer Clubs

I’m old. Well I’m not really that old in the grand scheme of things, I just feel that way when I hang around game developers.

I got my first real computer time in the fall of 1982 by hanging around after school and hacking some stuff in BASIC on the Vic-20 in the library.  I was in 5th grade at the time, and was by far the most computer-obsessed person I knew. That christmas my parents bought me a TI-99/4A and a little black and white TV to hook it up to. Technically the computer was a present for “the family”, but in practice it didn’t really work out that way. I was obsessed with the TI, and wrote all sorts of little games and other programs on it.

A few years later I spent all my accumulated allowance and paper route money on a Commodore 64. The C64 was a big upgrade, and included such advanced features as a floppy drive and a 300 baud modem. It also had the advantage of having a manufacturer that was still in the PC business. (TI abandoned its home computer line shortly after we got ours.)  I spent quite a bit of time on the local BBSes, much to the delight of the other 4 people I shared a phone line with. Once I had a car began participating in one of the staples of the personal computer revolution: the computer club.

The local commodore user’s group met once a month in one of the classrooms at the University of Northern Colorado in Greeley. It was a group of 20-30 people, many of which came from the university or worked at the local Hewlett-Packard site. Computer enthusiasts were pretty few and far between in those days, and this was one place where we all fit in. Just about everybody in that room was a geeky, sci-fi reading, D&D playing male. Everybody could program to one degree or another, and more than a few knew their way around a soldering iron. Despite all the other things they had in common it was those last two that brought this group together: everyone wanted to do cool stuff with computers.

I don’t know if that kind of community disappeared or if I just fell out of touch with it. There are millions of programmers these days, and they are usually specialized enough that they barely speak the same language let alone program in it. Being a “hardware guy” now means that you are comfortable plugging together prebuilt components and hunting down device drivers online. The inexorable march of progress has pretty much made the computer itself disappear as something people get excited about. Nobody cares enough about specific platforms these days to even have the sort of trash-talking arguments Commodore and Apple fans used to have with each other.

Does this sort of passionate niche club still exist? The Seattle Robotics Society might fall into that category. They spend their meetings talking about various components to build robots from and what sort of code to put on microcontrollers to make their robots do interesting things. The meetings feature lots of teenagers learning things about robots that they would never have any exposure to at school. There seems to be the same mix of Boeing engineers and college students that the computer clubs had.

What about others? Are there clubs for wearable computer enthusiasts? People who design programming languages? Quantum computing fans? Or are we nearing the end of the innovative period for computing and somewhere there are developing pockets of interest around nanotech or some other technology that doesn’t really exist yet?

It’s funny that I’m so nostalgic for something that was already going extinct by the time I got involved. My experience with the computer clubs was 10-15 years after the Homebrew Computer Club spawned Apple Computer and others. The people I met in the clubs were not entrepreneurs to be, they were more like fans and maybe the occasional shareware developer. It’s been twenty years, and I’ve never seen any of those names show up as leaders of industry.

What about you? Are any of you old enough to have belonged to a computer club?  :)

StackOverflow is amazing

A couple of weeks ago, Jeff Atwood and crew launched the public beta of Stack Overflow. Stack Overflow lets programmers ask questions and other programmers answer them. That’s it.  They just did it with a lot less suck than all the other programming community help sites: The ads are unobtrusive, there is no login requirement just to see an answer, answers are listed from best to worst instead of first to last, and anyone can edit a question or answer to make it better.

For instance, look at this question I asked about boost shared pointers. I have work-arounds for the problem in my code, but figured that there had to be a better way. Turns out that the boost experts on Stack Overflow knew exactly what I needed, and answered within a few hours.  Then some other people read the question, picked the best answer, and by voting it up made that answer appear prominently.  By the time I got back to check to see if my question had been answered, there was a clear winner. To make it even more prominent, I marked that answer as “accepted” and now it’s highlighted.

If you’re a programmer, I suggest you check it out. Next time you’re looking for the answer to a programming question, see if it’s been asked on Stack Overflow. If not, ask your question. I think you’ll be pleased with the results.

(Back in July I joined a company called Divide by Zero.  Now I’m singing the praises of a site called Stack Overflow.  Next thing you know I’ll be renaming my blog “Access violation”. :) )

ServerDir 2.0

As I am putting together the architecture for the new game we’re building at Divide by Zero, I am spending a fairly significant amount of time thinking about where the weak spots in the Pirates architecture were. The servers in Pirates worked out pretty well, but I think I can do better the second time around.  This is the first of N posts describing how I intend to evolve Server Architecture v1 into Server Architecture v2.

By far the biggest scaling problem Pirates ran into right at the start of open beta was the Server Directory (ServerDir) database. This was the direct result of incredible naiveté on my part about how much load a single database could handle. The original design of ServerDir called for every process in every cluster to connect to one shared database and to update its own status in that database every five seconds. When you multiply that update by all the instanced zones in the game (plus other miscellaneous servers) you find that the database needs to handle thousands of updates per second from tens of thousands of connections. It turns out that Microsoft SQL Server is not up to the task. (There’s also the little problem that the single shared ServerDir database was a single point of failure for the entire service.)

Pirates ServerDir on a single DB

 

Original ServerDir design

When a single ServerDir was obviously not going to work, we expanded the system slightly to split that single database into up to one database per cluster. This still put quite a bit of load onto the ServerDir DB, but there were now enough of them to allow SQL Server to keep up.  This is the setup that Pirates was using when I left Flying Lab in July of 2008.

Pirates ServerDir with one DB per cluster

Final ServerDir design

Within a cluster the ServerDir database was used by a process called Big Brother to monitor the health of the cluster. Each physical server machine in the cluster has an instance of Big Brother running on it, and they automatically pick one of their number to be the primary Big Brother for the cluster. This process is responsible for deciding which other processes need to be launched, as well as clearing out the ServerDir entries for processes that have crashed. If you want to read more about the specifics of the ServerDir system, you can read all about it in Massively Multiplayer Game Development 2. I wrote an article on the Pirates architecture years before the game launched, and it really didn’t change too much.

Pirates ServerDir inside a cluster

ServerDir Inside a Cluster

ServerDir 2.0

There are several fundamental problems with the original ServerDir that I intend to fix with version 2.0. First is the reliance on a database as the point of synchronization. Databases are not built for this kind of transient data, so they handle it poorly.  The second problem is the way the Big Brothers communicate with each other via UDP (the dashed lines above indicate non-persistent or UDP connections.) This pointlessly complicated the protocol between Big Brothers by requiring them to compensate for dropped network packets. Another goal for the new ServerDir is actually driven by broader architectural changes I want to make, specifically that I want to promote “shard” from being an operations-level concept to one that is entirely in game design and UI.  That will require far more machines with far more processes per cluster, and ServerDir will need to cope. The fourth and final fix in the new ServerDir is that the old version of Big Brother actually does a pretty poor job of dealing with hung processes. We had some periods during Beta where we were getting some of those, and the operations staff had to deal with them by restarting clusters regularly and running scripts to kill all the zombies.  What follows is a sketch of my initial design for how to accomplish all this.

ServerDir v2.0

ServerDir v2.0

The biggest change here is that individual cluster processes no longer connect to ServerDir directly. Instead they open a persistent connection to their local Big Brother, and Big Brother updates ServerDir on their behalf. Part of this change is that the “every five seconds” updates never go into ServerDir at all.  ServerDir is notified of two events for processes: process started and process stopped. All of the “is this process hung” detection is now the job of each individual Big Brother. While a cluster process is up, it will send period updates to Big Brother, and if none arrive for too long a period of time, Big Brother will kill the process and clean up ServerDir.

Another significant change is that instead of the point of synchronization being a database, the point of synchronization is a web service. Whether there is a database (or multiple databases) backing up that web service is entirely invisible to the tools and to the cluster processes. Using a stateless API with no persistent connections also makes the task of scaling the ServerDir resource much easier. With load balancers and some reasonable architecture on the back end, single points of failure and scaling problems with ServerDir itself can be all but eliminated.

My next post will go into much greater detail on the new web service and how BigBrothers and operations tools interact with it. Once I’ve covered the new ServerDir plan I can get into my whacky new ideas for the game servers themselves.

What do you think? See any red flags in my high level sketch?

This is why I’m a programmer

Gustavo Duarte sums it up.