September 28, 2008

p

StackOverflow is amazing

Filed under: Engineering — Joe @ 8:21 am

A couple of weeks ago, Jeff Atwood and crew launched the public beta of Stack Overflow. Stack Overflow lets programmers ask questions and other programmers answer them. That’s it.  They just did it with a lot less suck than all the other programming community help sites: The ads are unobtrusive, there is no login requirement just to see an answer, answers are listed from best to worst instead of first to last, and anyone can edit a question or answer to make it better.

For instance, look at this question I asked about boost shared pointers. I have work-arounds for the problem in my code, but figured that there had to be a better way. Turns out that the boost experts on Stack Overflow knew exactly what I needed, and answered within a few hours.  Then some other people read the question, picked the best answer, and by voting it up made that answer appear prominently.  By the time I got back to check to see if my question had been answered, there was a clear winner. To make it even more prominent, I marked that answer as “accepted” and now it’s highlighted.

If you’re a programmer, I suggest you check it out. Next time you’re looking for the answer to a programming question, see if it’s been asked on Stack Overflow. If not, ask your question. I think you’ll be pleased with the results.

(Back in July I joined a company called Divide by Zero.  Now I’m singing the praises of a site called Stack Overflow.  Next thing you know I’ll be renaming my blog “Access violation”. :) )

September 18, 2008

p

Where does the money go?

Filed under: Production — Joe @ 6:23 am

Everybody knows that building an MMO involves a lot of people, takes a lot of time,  and costs a lot of money. Fewer people know how much money, how many people, and what all those people are doing for all that time.  I thought I would share a bit of my experience on Pirates and give you some idea what those big MMO budgets are spent on.  All of the data behind this post is up on Zoho Sheet; feel free to do whatever you want with it. (Also, Zoho is much cooler than the Google Docs spreadsheet.)

There are a few things I should mention up front: These are not the numbers from Pirates.  We had a gradual ramp up of people and project scope over the course of five years and that’s definitely not the right way to do it. I also rearranged people to compensate for some of the staffing shortages we had on Pirates.

This “budget” also doesn’t include anything but people’s salaries. Most costs scale up with staff, including desktop hardware, office space, software, office server capacity, taxes, benefits, etc. You would need to add 40-50% to the dollar figures to take that into account.  I mostly care about the percentages, so I didn’t bother trying to include any of those items. The budget also doesn’t include any server, hosting, or bandwidth costs. Those can be pretty significant in the beta and live phases.

Most of the salaries were drawn from the 2007 Game Developer magazine salary survey. Those are:

Programmer $83,383
Artist $66,594
Designer $63,649
Producer $78,716
Tester $39,062

There were three functions I didn’t have any salary surveys to draw from, so I just made up some numbers. I assumed Community people make about the same as Designers and that Operations people make about the same as Programmers. I put Customer Service people down at $30k on average, which is much more than front-line CSRs and forum moderators, but less than supervisors and managers.

Community $63,649
Customer Service $30,000
Operations $83,383

I broke the project down into four phases: Pre-production, Production, Beta, and Post-Launch. Pre-production is a period where tools are being built and the game is being designed, but work has not yet started on much (if any) final content. Production is the Big Expensive Part when tons of content people crank out all the final assets for the entire game. Beta is the period at the end of production where your game is exposed to external players in a significant way, and includes closed and open beta. Post-Launch is obviously the period after the game has launched.I also divided the effort into three major areas: World, Systems, and Infrastructure. World is the construction of the game world, quests, dungeons, etc.  Systems is the development of character classes or skills, character customization assets, UI, and game systems.  Infrastructure is the less-glamorous stuff behind the scenes like core server code, operations tools, scheduling, and IT. I broke it down this way for the benefit of a future post that currently exists only in my head.

Pre-production

The purpose of the pre-production phase is to figure out the answers to a bunch of questions before the team grows to its full size and things get really expensive. There is an emphasis on programmers and system designers in this phase, because they have the most questions to answer. As expected, the programmers take up the biggest chunk in pre-production. They are both numerous and expensive. System development is also a big focus in Pre-production.

Main title - http://sheet.zoho.comPre-production by Function - http://sheet.zoho.com

Production

This is where most of the money on the project will be spent, and most of that is spend on building the world:

Production by Area - http://sheet.zoho.com
Production by Function - http://sheet.zoho.com
Beta

During the beta phase other significant costs (such as operations and customer service) start to come in, but the content team is still going full bore on building out and bug fixing the world, so it doesn’t affect the numbers too much.

Beta by Area - http://sheet.zoho.com
Beta by Function - http://sheet.zoho.com

Post-Launch

The Post-Launch phase of the project is represented in this graph primarily because it takes a while to get money out of your players and through the various intermediaries involved, and into your bank account.  Even if you have 100k players on launch day you won’t see revenue from those players for a while.  If the game isn’t at least self-sufficient beyond that, then you’re in trouble.  The Live phase has a smaller number of world artists since you are not building the main world anymore at this point. Depending on the (paid or unpaid) expansion pack strategy of the game, this will vary.


Live by Area - http://sheet.zoho.com
Live by Function - http://sheet.zoho.com

Totals

Here are some graphs to break down the total amounts spent. See how world is the biggest chunk?  That’s why everyone is talking about user-generated content.

Totals by Area - http://sheet.zoho.com
Totals by Function - http://sheet.zoho.com
Totals by Phase - http://sheet.zoho.com

How closely to these numbers reflect the budgets on MMOs you’ve seen? I’d love to hear your thoughts.

September 12, 2008

p

Austin GDC 2008

Filed under: Game Industry — Joe @ 10:06 pm

I’m heading down to Austin on Sunday along with half of Divide By Zero.  This will be my first conference after leaving Flying Lab so I imagine there will be a lot of funny looks. This will also be a chance to introduce all the DbZers to all my conference friends.

I’m giving my Pirates Post-partum talk, which Jessica Mulligan picked as one of her Top Picks, so hopefully it’ll be pretty well attended. It looks like there are quite a few other great talks too… I’m looking forward to it.

Shoot me an email at joe on the programmerjoe.com domain if you’ll be down there and want to get together.  This is the first conference in a while that doesn’t have a bunch of pointless meetings with random middleware sales-people, so I will probably even have some time. :)

September 6, 2008

p

ServerDir 2.0

Filed under: Engineering — Joe @ 9:45 am

As I am putting together the architecture for the new game we’re building at Divide by Zero, I am spending a fairly significant amount of time thinking about where the weak spots in the Pirates architecture were. The servers in Pirates worked out pretty well, but I think I can do better the second time around.  This is the first of N posts describing how I intend to evolve Server Architecture v1 into Server Architecture v2.

By far the biggest scaling problem Pirates ran into right at the start of open beta was the Server Directory (ServerDir) database. This was the direct result of incredible naiveté on my part about how much load a single database could handle. The original design of ServerDir called for every process in every cluster to connect to one shared database and to update its own status in that database every five seconds. When you multiply that update by all the instanced zones in the game (plus other miscellaneous servers) you find that the database needs to handle thousands of updates per second from tens of thousands of connections. It turns out that Microsoft SQL Server is not up to the task. (There’s also the little problem that the single shared ServerDir database was a single point of failure for the entire service.)

Pirates ServerDir on a single DB

 

Original ServerDir design

When a single ServerDir was obviously not going to work, we expanded the system slightly to split that single database into up to one database per cluster. This still put quite a bit of load onto the ServerDir DB, but there were now enough of them to allow SQL Server to keep up.  This is the setup that Pirates was using when I left Flying Lab in July of 2008.

Pirates ServerDir with one DB per cluster

Final ServerDir design

Within a cluster the ServerDir database was used by a process called Big Brother to monitor the health of the cluster. Each physical server machine in the cluster has an instance of Big Brother running on it, and they automatically pick one of their number to be the primary Big Brother for the cluster. This process is responsible for deciding which other processes need to be launched, as well as clearing out the ServerDir entries for processes that have crashed. If you want to read more about the specifics of the ServerDir system, you can read all about it in Massively Multiplayer Game Development 2. I wrote an article on the Pirates architecture years before the game launched, and it really didn’t change too much.

Pirates ServerDir inside a cluster

ServerDir Inside a Cluster

ServerDir 2.0

There are several fundamental problems with the original ServerDir that I intend to fix with version 2.0. First is the reliance on a database as the point of synchronization. Databases are not built for this kind of transient data, so they handle it poorly.  The second problem is the way the Big Brothers communicate with each other via UDP (the dashed lines above indicate non-persistent or UDP connections.) This pointlessly complicated the protocol between Big Brothers by requiring them to compensate for dropped network packets. Another goal for the new ServerDir is actually driven by broader architectural changes I want to make, specifically that I want to promote “shard” from being an operations-level concept to one that is entirely in game design and UI.  That will require far more machines with far more processes per cluster, and ServerDir will need to cope. The fourth and final fix in the new ServerDir is that the old version of Big Brother actually does a pretty poor job of dealing with hung processes. We had some periods during Beta where we were getting some of those, and the operations staff had to deal with them by restarting clusters regularly and running scripts to kill all the zombies.  What follows is a sketch of my initial design for how to accomplish all this.

ServerDir v2.0

ServerDir v2.0

The biggest change here is that individual cluster processes no longer connect to ServerDir directly. Instead they open a persistent connection to their local Big Brother, and Big Brother updates ServerDir on their behalf. Part of this change is that the “every five seconds” updates never go into ServerDir at all.  ServerDir is notified of two events for processes: process started and process stopped. All of the “is this process hung” detection is now the job of each individual Big Brother. While a cluster process is up, it will send period updates to Big Brother, and if none arrive for too long a period of time, Big Brother will kill the process and clean up ServerDir.

Another significant change is that instead of the point of synchronization being a database, the point of synchronization is a web service. Whether there is a database (or multiple databases) backing up that web service is entirely invisible to the tools and to the cluster processes. Using a stateless API with no persistent connections also makes the task of scaling the ServerDir resource much easier. With load balancers and some reasonable architecture on the back end, single points of failure and scaling problems with ServerDir itself can be all but eliminated.

My next post will go into much greater detail on the new web service and how BigBrothers and operations tools interact with it. Once I’ve covered the new ServerDir plan I can get into my whacky new ideas for the game servers themselves.

What do you think? See any red flags in my high level sketch?