(Update: Part two is up now. It details the technical changes that are required to make this work.)
The high level of our process is dead simple: Continuously integrate (commit early and often). On commit automatically run all tests. If the tests pass deploy to the cluster. If the deploy succeeds, repeat.
Those posts seem to be mostly about deploying new code to web servers. I thought it would be interesting to consider what it would take to follow a similar process for an MMO with a big installed client. I will do that in two parts with today’s part being the deployment process itself.Â In a future post I will describe the architectural changes you would have to make to not have frequent updates annoy your players enough that they riot and burn down your datacenter (or go play someone else’s game instead.)
As a point of comparison, here’s how patches are typically developed and deployed for Pirates of the Burning Sea:
- The team makes the changes. That takes three weeks for monthly milestone builds with a lot of the integration happening in the last week. For a hotfix it usually takes a few hours to get to the root cause of the problem and then half an hour to make the change, build, and test on the developer’s machine.
- A build is run on the affected branch. When that’s a full build it takes 1:20 hours. Incremental builds can be ten minutes if no major headers changed. The form of build that people on the dev team use is copied to a shared server at this point.
- Â Some quick automated tests are run to fail seriously broken builds as early as possible. These take 10 minutes.
- The build is packed up for deployment. This takes 80 minutes. The build is now ready for manual testing (since the testers use the packed retail files to be as close to a customer install as possible.)
- Slow automated tests run. These take 30 minutes.
- (concurrent with 5 if necessary) The testers run a suite of manual tests called BVTs. These take about half an hour. These tests run on each nightly build even if it isn’t going to be deployed to keep people from losing much time to bad builds. Most people wait for the BVTs to finish before upgrading to the previous night’s build.
- For milestone builds the testers spend at least two weeks testing all the new stuff. Builds are sent to SOE for localization of new content at the same time.
- A release test pass is run with a larger set of tests. These take about three hours.
- The build is uploaded to SOE for client patching and to the datacenter for server upgrades. This takes a few minutes for a hotfix or 6-8 hours for a month’s worth of changes.
- At this point the build is pushed to the test server. Milestone builds sit there for a couple weeks. Hot fixes might stay for a few hours or even go straight to live in an emergency.
- SOE pushes the patch to their patch servers without activating the patch for download. That takes between one and eight hours depending on the size of the patch.
- At the pre-determined downtime, the servers are taken offline and SOE is told to activate the patch. This usually happens at 1am PST and doesn’t take any time. Customers can begin downloading the patch at this point.
- Flying Lab operations people upgrade all the servers in parallel. This takes up to three hours for milestone builds, but more like an hour for hotfixes. They bring the servers up locked so that they can be tested before people come in.
- The customer service team tests each server to make sure it’s pretty much working and then operations unlocks the servers.Â That takes about 20 minutes.
- If the build was being deployed to the test server, after some days of testing steps 11-14 will happen again for the live servers.
This may seem like a long, slow process but it actually works pretty well.Â Pirates has had 11 “monthly” patches since it launched, and they go well.Â Not many other MMOs are pushing content with that frequency.Â Some of the slow bits (like the upload to SOE and their push to their patchers) are the result of organizational handoffs that would take quite a bit of work to change. Flying Lab also spends a fair amount of time testing manually over and above the automated tests. Those manual tests have kept thousands of bugs from going live and as far as I’m concerned they are an excellent use of resources. I am not trying to bash Flying Lab in any way, just provide a baseline for what an MMO deployment process looks like. I suspect many other games have a similar sequence of events going into each patch, and would love to hear confirmation or rebuttal from people who have experience on other games.
Assuming minimum time for a patch, these are the times we are left with (in hours and minutes):
|0:10||Run incremental build|
|0:10||Run quick automated tests|
|1:20||Pack the build|
|0:30||Run quick manual tests/slow automated tests|
|3:00||Run release pass smoke tests|
|0:05||Upload build for patching|
|1:00||Build is pushed to patchers|
|1:00||Servers are upgraded|
|0:20||A final quick test is made on each server|
|7:35||Total time to deploy|
If it takes seven and a half hours to deploy a new build you obviously aren’t going to get more than one of them out in an 8 hour work day. Let’s forget for a moment that IMVU is able to do this in 15 minutes and pretend that our target is an hour. For now let’s assume that we will spend 10 minutes on building, 10 minutes on automated testing, 30 minutes on manual testing, and the last 10 minute actually deploying the build. We will also assume that this is the time to deploy to the test server and that this is the slower than the actual push from test to live.
Without going into detail I am going to assume three architectural changes to make these targets much more likely. First, the way that the game uses packed files of assets will need to be changed to allow trivial incremental changes on every file in the game without any 80 minute pack processes. I will assume that pack time can go away and be a minor part of deployment.Â The other assumption is that unlike Pirates, our theoretical continuously deployed MMO will not require every object in the database to be touched when a new build is deployed so the hour spent upgrading databases on the servers is reduced to about five minutes of file copying. The third architectural change I will assume involves moving almost all of the patching process into the client itself and out of the launcher. This eliminates both the transfer of bits to Southern California and the push of pre-built patches out to the patcher clusters around the world. I will go into each of these assumptions in my next post, but let’s just pretend for now that they appear magically and reduce our time accordingly.
The amount of time spent on automated testing in this deployment process is 40 minutes. Fortunately this work is relatively easy to spread out over multiple machines to reduce the total time involved. Assuming a test farm of at least five machines we should be able to accomplish our ten minute goal.
It is possible that we could have multiple testers working in parallel to perform the manual tests more quickly. There are 230 minutes spent on testing in this sequence, so if we expect to do that amount of work with 11.5 manual testers assuming perfect division of labor. There are two things wrong with that statement. The first is that perfect division of labor among 12 people is impossible so the number is probably closer to 20.Â The second problem is that having a team of 12 on staff to do nothing but test new builds with tiny incremental changes in them is not worth nearly what it costs you in salaries. A much more likely outcome is that the amount of manually testing in the build process is drastically reduced. In addition to that we can get our playerbase involved in the deployment testing if we swap things around a bit.
|0:10||Run incremental build|
|0:10||Run automated tests on five machines|
|0:05||Upload only changed files to game servers and patch servers|
|0:05||Deploy build on test server|
|0:30||Manual smoke test on public test on public test server
Players can test at the same time.
|1:00||Total time to deploy|