Archive for the ‘Engineering’ Category

ServerDir 2.0

As I am putting together the architecture for the new game we’re building at Divide by Zero, I am spending a fairly significant amount of time thinking about where the weak spots in the Pirates architecture were. The servers in Pirates worked out pretty well, but I think I can do better the second time around.  This is the first of N posts describing how I intend to evolve Server Architecture v1 into Server Architecture v2.

By far the biggest scaling problem Pirates ran into right at the start of open beta was the Server Directory (ServerDir) database. This was the direct result of incredible naiveté on my part about how much load a single database could handle. The original design of ServerDir called for every process in every cluster to connect to one shared database and to update its own status in that database every five seconds. When you multiply that update by all the instanced zones in the game (plus other miscellaneous servers) you find that the database needs to handle thousands of updates per second from tens of thousands of connections. It turns out that Microsoft SQL Server is not up to the task. (There’s also the little problem that the single shared ServerDir database was a single point of failure for the entire service.)

Pirates ServerDir on a single DB

 

Original ServerDir design

When a single ServerDir was obviously not going to work, we expanded the system slightly to split that single database into up to one database per cluster. This still put quite a bit of load onto the ServerDir DB, but there were now enough of them to allow SQL Server to keep up.  This is the setup that Pirates was using when I left Flying Lab in July of 2008.

Pirates ServerDir with one DB per cluster

Final ServerDir design

Within a cluster the ServerDir database was used by a process called Big Brother to monitor the health of the cluster. Each physical server machine in the cluster has an instance of Big Brother running on it, and they automatically pick one of their number to be the primary Big Brother for the cluster. This process is responsible for deciding which other processes need to be launched, as well as clearing out the ServerDir entries for processes that have crashed. If you want to read more about the specifics of the ServerDir system, you can read all about it in Massively Multiplayer Game Development 2. I wrote an article on the Pirates architecture years before the game launched, and it really didn’t change too much.

Pirates ServerDir inside a cluster

ServerDir Inside a Cluster

ServerDir 2.0

There are several fundamental problems with the original ServerDir that I intend to fix with version 2.0. First is the reliance on a database as the point of synchronization. Databases are not built for this kind of transient data, so they handle it poorly.  The second problem is the way the Big Brothers communicate with each other via UDP (the dashed lines above indicate non-persistent or UDP connections.) This pointlessly complicated the protocol between Big Brothers by requiring them to compensate for dropped network packets. Another goal for the new ServerDir is actually driven by broader architectural changes I want to make, specifically that I want to promote “shard” from being an operations-level concept to one that is entirely in game design and UI.  That will require far more machines with far more processes per cluster, and ServerDir will need to cope. The fourth and final fix in the new ServerDir is that the old version of Big Brother actually does a pretty poor job of dealing with hung processes. We had some periods during Beta where we were getting some of those, and the operations staff had to deal with them by restarting clusters regularly and running scripts to kill all the zombies.  What follows is a sketch of my initial design for how to accomplish all this.

ServerDir v2.0

ServerDir v2.0

The biggest change here is that individual cluster processes no longer connect to ServerDir directly. Instead they open a persistent connection to their local Big Brother, and Big Brother updates ServerDir on their behalf. Part of this change is that the “every five seconds” updates never go into ServerDir at all.  ServerDir is notified of two events for processes: process started and process stopped. All of the “is this process hung” detection is now the job of each individual Big Brother. While a cluster process is up, it will send period updates to Big Brother, and if none arrive for too long a period of time, Big Brother will kill the process and clean up ServerDir.

Another significant change is that instead of the point of synchronization being a database, the point of synchronization is a web service. Whether there is a database (or multiple databases) backing up that web service is entirely invisible to the tools and to the cluster processes. Using a stateless API with no persistent connections also makes the task of scaling the ServerDir resource much easier. With load balancers and some reasonable architecture on the back end, single points of failure and scaling problems with ServerDir itself can be all but eliminated.

My next post will go into much greater detail on the new web service and how BigBrothers and operations tools interact with it. Once I’ve covered the new ServerDir plan I can get into my whacky new ideas for the game servers themselves.

What do you think? See any red flags in my high level sketch?

This is why I’m a programmer

Gustavo Duarte sums it up.

Five Kinds of Programmers

I recently had a conversation with one of the long-time programmers on Pirates that got me thinking about how I think about programmers. Over the course of my career I’ve run into several archetypes of professional programmers. I thought it might be interesting to formalize my thinking on the subject, and this is the result.

The Researcher

These programmers are more scientist than engineer. If your organization has a research lab, it is probably stocked with Researchers. Since academia is just one giant lab, it is almost entire filled with Researchers.

The Researcher loves to find solutions to problems that are poorly understood. They are on the bleeding edge of their technological specialty. If there are no papers out there that explain how to do something they will write one.

One downside of the Researcher is that there are so many interesting problems out there that need solving that they have trouble actually finishing any solution before they move on to the next thing. When you can get these guys to check in some code it’s usually great, but it takes them far longer than it would take other kinds of programmers to actually implement anything. They are also the most likely archetype to suffer from Not Invented Here Syndrome.

The Explorer

Like the Researcher, the Explorer is unafraid of the poorly defined dark corners of technology. The key difference is that when the Explorer delves those depths it is to get things done, not for the joy of the exploration itself.

When you have a really thorny problem that you don’t know how to solve, this is the programmer you give it to. Explorers will dig into unfamiliar code-bases and problem domains with a shocking level of energy. These programmers are by far the quickest learners, and are a great resource for other programmers who are trying come behind them into new territory.

The downside of Explorers is that their single-minded practicality can make their code a little sloppy. These programmers dedicate a lot more time to putting their current task behind them than they do to writing code they would want to maintain years down the road. This doesn’t mean that the code won’t work, but that if an extra #include or circular dependency will save an hour the Explorer is always tempted to cut that corner.

The Craftsman

The highest quality code in your code-base was probably written by a Craftsman. Your QA department loves Craftsmen. They value the quality of their work above all else.

When a new system just has to work, you give it to a Craftsman. They will do a great job coding it, and then test it until it is perfect. Craftsmen are absolutely the best programmers when it comes to handling exceptional conditions and corner cases. In my experience Craftsmen also excel at writing maintainable code because they know that they’re going to have to come back to it someday.

Unfortunately all that quality comes at a price. The Craftsmen on your team are the slowest programmers you have. When they estimate tasks they generate the most accurate estimates, but also the biggest. (Partly because they always include the debugging time that everyone else hopes won’t be necessary.) Their emphasis on quality and reliability also means that Craftsmen are terrified of unfamiliar parts of the code-base or poorly defined problems.

The Activist

You know that guy on your team who is pushing Test-Driven Development, is constantly refactoring code, and actually uses the names of design patterns? That guy is your Activist. They are the driving force for architectural and process improvements on your team.

Activists want the code quality in your project to be as high as it can be. They give tough code reviews, and even tougher design reviews, but that’s a good thing. Every time someone on the team listens to the Activist, they are improving as a programmer.

On the other hand, their ceaseless pursuit of perfect code hurts the productivity of the Activist. Quick hacks are physically painful to them, even when that is exactly what the situation calls for. Paradoxically, they also often introduce bugs with their refactoring that never would have come up otherwise. (On the plus side, the refactoring makes fixing that bug far easier.)

The Workhorse

In their various ways, all of the programmers above are sacrificing some of their capacity to their particular quirks. Workhorse programmers don’t do that. They are in a single-minded pursuit of adding as much to the system as possible, and as a result end up owning vast chunks of the code-base.

If you were count lines of code per programmer, the Workhorses would come out ahead. (That’s assuming you don’t count generated code from the Activists.) Sheer output is the domain of these kind of programmers. If you have a few great Workhorses on your team you will be able to do things that other teams only dream of.

The dark flip side of what a great workhorse can accomplish is that a bad one will do absurd amounts of damage to your code-base. Workhorses don’t have any significant dedication to quality that allows them to avoid doing bad things. Sometimes make up for this by having enough time to build the system two or three times in the time that a Craftsman would build it once, but that’s always painful. A single bad Workhorse can do enough damage to negate the positive effect of one or two other programmers.

What kind of programmer are you?

You will notice that none of these archetypes are particularly bad or particularly good. There can be good or bad programmers of any archetype. All the teams I’ve ever been on have had a mix of archetypes. For that matter, very few programmers could be assigned to one archetype.

Personally, I think I’m mostly a Workhorse with a little bit of Activist and Explorer mixed in. I am put to shame by the ability of the some of the programmers around me to suss out how to do some radical new thing. I’m not hard-core enough about process or code quality to keep up with the Activists on the team. The one way I compete is on quantity, and most of that code is fortunately good enough to not doom any projects I’ve been on up to this point.

What about you? Where would you fit in this taxonomy? Do you recognize any programmers you know?

Pirates Post-partum at ION

At ION I gave a talk on our development process for Pirates. Darius Kazemi has posted a transcript of the talk. It’s also up at the Vault Network. I wonder how much buzz it’s going to get.

I’m giving the same talk at AGDC this year, so if you missed me at ION you can catch it there.

Scaling on a Dime at ION

I spent the week at the ION game conference. The first of my two speaking parts was a panel on scaling your development effort.  Darius live-blogged the thing.