As I am putting together the architecture for the new game weâ€™re building at Divide by Zero, I am spending a fairly significant amount of time thinking about where the weak spots in the Pirates architecture were. The servers in Pirates worked out pretty well, but I think I can do better the second time around.Â This is the first of N posts describing how I intend to evolve Server Architecture v1 into Server Architecture v2.
By far the biggest scaling problem Pirates ran into right at the start of open beta was the Server Directory (ServerDir) database. This was the direct result of incredible naivetÃ© on my part about how much load a single database could handle. The original design of ServerDir called for every process in every cluster to connect to one shared database and to update its own status in that database every five seconds. When you multiply that update by all the instanced zones in the game (plus other miscellaneous servers) you find that the database needs to handle thousands of updates per second from tens of thousands of connections. It turns out that Microsoft SQL Server is not up to the task. (Thereâ€™s also the little problem that the single shared ServerDir database was a single point of failure for the entire service.)
Original ServerDir design
When a single ServerDir was obviously not going to work, we expanded the system slightly to split that single database into up to one database per cluster. This still put quite a bit of load onto the ServerDir DB, but there were now enough of them to allow SQL Server to keep up.Â This is the setup that Pirates was using when I left Flying Lab in July of 2008.
Final ServerDir design
Within a cluster the ServerDir database was used by a process called Big Brother to monitor the health of the cluster. Each physical server machine in the cluster has an instance of Big Brother running on it, and they automatically pick one of their number to be the primary Big Brother for the cluster. This process is responsible for deciding which other processes need to be launched, as well as clearing out the ServerDir entries for processes that have crashed. If you want to read more about the specifics of the ServerDir system, you can read all about it in Massively Multiplayer Game Development 2. I wrote an article on the Pirates architecture years before the game launched, and it really didnâ€™t change too much.
ServerDir Inside a Cluster
There are several fundamental problems with the original ServerDir that I intend to fix with version 2.0. First is the reliance on a database as the point of synchronization. Databases are not built for this kind of transient data, so they handle it poorly.Â The second problem is the way the Big Brothers communicate with each other via UDP (the dashed lines above indicate non-persistent or UDP connections.) This pointlessly complicated the protocol between Big Brothers by requiring them to compensate for dropped network packets. Another goal for the new ServerDir is actually driven by broader architectural changes I want to make, specifically that I want to promote â€œshardâ€ from being an operations-level concept to one that is entirely in game design and UI.Â That will require far more machines with far more processes per cluster, and ServerDir will need to cope. The fourth and final fix in the new ServerDir is that the old version of Big Brother actually does a pretty poor job of dealing with hung processes. We had some periods during Beta where we were getting some of those, and the operations staff had to deal with them by restarting clusters regularly and running scripts to kill all the zombies.Â What follows is a sketch of my initial design for how to accomplish all this.
The biggest change here is that individual cluster processes no longer connect to ServerDir directly. Instead they open a persistent connection to their local Big Brother, and Big Brother updates ServerDir on their behalf. Part of this change is that the â€œevery five secondsâ€ updates never go into ServerDir at all.Â ServerDir is notified of two events for processes: process started and process stopped. All of the â€œis this process hungâ€ detection is now the job of each individual Big Brother. While a cluster process is up, it will send period updates to Big Brother, and if none arrive for too long a period of time, Big Brother will kill the process and clean up ServerDir.
Another significant change is that instead of the point of synchronization being a database, the point of synchronization is a web service. Whether there is a database (or multiple databases) backing up that web service is entirely invisible to the tools and to the cluster processes. Using a stateless API with no persistent connections also makes the task of scaling the ServerDir resource much easier. With load balancers and some reasonable architecture on the back end, single points of failure and scaling problems with ServerDir itself can be all but eliminated.
My next post will go into much greater detail on the new web service and how BigBrothers and operations tools interact with it. Once I’ve covered the new ServerDir plan I can get into my whacky new ideas for the game servers themselves.
What do you think? See any red flags in my high level sketch?