November 24, 2007

p

How to make Microsoft SQL Server cry like a baby

Filed under: Day Job, Engineering — Joe @ 3:46 pm

Earlier this year we switched from MySQL to MS SQL Server. I don’t regret the switch at all; MS SQL Server has been far more stable than MySQL was, and has lots of whizzy new features. The MySQL client library was dropping connections under load and then crashing when it reconnected. That is what pushed us to switch in the first place. Well it turns out that MS SQL Server has some scaling problems of its own. It doesn’t crash, but it does get so slow as to be non-functional. This is a helpful guide that will help you make your own installation of SQL Server whimper.

Our server boxes are 8-way 2.6GHx Xeons with 16GB RAM running Windows Server 2003 64-bit and SQL Server Enterprise Edition 64-bit. If your configuration is different your mileage may vary.

Technique #1

We are using a system called the Flogger to record gameplay event into a database. To make this happen, all server processes connect to one central DB and call a stored procedure per event. This works fine when the number of processes is low, as in under 500. When the load on a world instance grows the number of processes connecting to the flogger DB increases to 1200.

Exactly how long seems to vary from a few hours to a few days, but after a while at this load SQL server decides that it has had enough and stops accepting new connections. New processes starting up time out eventually and things generally start going badly on the servers. Once SQL Server starts timing out connections the only way we’ve found to get the database running again is to restart the SQL Server service. While it’s in this state the server is only using moderate server resources.

The way we’re working around this problem is to use files as a buffer between the server processes and the database. Every so often (depending on activity) each process will dump the events it wants to record out to file. Some time later (well under a second when there’s no load, but potentially longer on a well loaded cluster) another process that maintains a connection to the flog database reads the file, dumps it to the database, and then deletes the file. This eliminates the need for the game servers to connect to this database at all, so if it decides to go out to lunch the game is unaffected. It also makes the data collection more reliable by putting any backlog into one directory full of files instead of in memory on 1500 different processes spread across five server machines.

Technique #2

We have another database exhibiting similar problems, though not quite as severely. Each process in a game cluster connects to a shared database called Serverdir and uses the DB to report its status back to operations tools and the “keep everything running” processes. This data is strictly temporary and probably doesn’t belong in a database all, but Horrible Design Flaws That Are All My Fault aside, it’s just not that many queries and they’re all very simple selects and updates. This shouldn’t be a problem for server hardware as beefy as ours.

That argument doesn’t convince SQL Server, however. After a few days SQL Server pauses for a few minutes. The CPU goes to 0% and no queries return for the entire time it’s paused. Our code responds to that by closing things down because it can’t currently tell the difference between “Query takes over a minute” and “Crashed process.” At that point half the cluster shuts down.

We don’t have a great workaround for this one yet. We’ve been steadily reducing the load on the Serverdir database, but it doesn’t seem to take all that much load to make it happen. Our best bet is to make the code smarter and have it detect these situations. If it just sits tight for a few minutes everything will return to normal without needing to restart anything. Fortunately it only happens a couple times a week so while it’s something we definitely need to fix before launch it isn’t impacting beta tester’s ability to play.

Making an MMO scale is a pain

None of the profiling tools we’re using at the SQL Server or OS levels are much help with either of these problems. Nothing tells us why SQL Server is refusing connections, or why it’s refusing to work on queries. Most database books and websites think that a slow query is one that takes longer than a minute or two, but in our world that’s a dead process and a disappointed customer.

We have made great strides in scalability since the first stress test, but no matter how many things you fix there is always one more waiting to bite you on the ass. *sigh*  We’ll get it figured out and apart from these DB troubles everything is staying up quite well at this point. We have 43 more days until the pre-order head start, so there’s still plenty of time to get through this round of problems. Then we break through into the infinite!

My fix for the flogger scale problem is now ready for a code review, so I’m going home to play Rock Band.

November 22, 2007

p

Not much of a mystery

Filed under: Off Topic — Joe @ 10:28 pm

Thanks to Amazon’s convenient “We ship games a day late” service our copy of Rock Band showed up yesterday.  We played for five or six hours yesterday. Far enough to get to but not yet play through the gig that kicks off the world tour.  In many venues in many cities across the country we’ve played the “crowd request mystery set.”  Every single one of those sets included Should I Stay or Should I Go by The Clash as one of the two songs.  Most of the time the other song was In Bloom by Nirvana.  I guess the mystery was why we kept playing those two songs over and over, not what the songs were going to be.

Of course if that and “drums are hard” are my only two complaints I guess that means I’m enjoying the game. :)

November 15, 2007

p

Now hiring for Operations

Filed under: Uncategorized — Joe @ 7:57 pm

We are looking to staff up some more in the operations department in preparation for our launch.  If you’ve always wanted to work on an MMO and are a whiz at IT, one of these two openings may be for you:

I’m not actually the one doing the hiring, so reply through the ads if you’re interested.

p

PotBS Stress Test this weekend

Filed under: Day Job — Joe @ 7:47 pm

We are running our second stress test this weekend, and so far it’s going quite well.  Fileplanet just opened it up to non-subscribers, so head on over and give the game a try!

November 11, 2007

p

Will Facebook bring back PBM games?

Filed under: Social Gaming, Game Design, Game Industry — Joe @ 5:55 pm

Earlier this year Facebook announced their new Facebook Platform that allows developers to add applications that users can add to their profile and share with their friends. All these networks let you embed flash into your page, but in Facebook’s case applications can take advantage of all the features of the network itself: news feeds, friend lists, profile details, etc. And Facebook happily allows you to run advertising or charge the users of your application, so you can monetize your users. Developers have created 7782 applications as of this writing.

Not to be outdone, Google announce a new API last week that is sort of the open-standard equivalent to the Facebook Platform. It’s called Open Social and a bunch of non-Facebook social networks and application providers (including MySpace… remember them?) signed on to support it. Network effects work like crazy on this kind of site, so it remains to be seen if Open Social can boost these other social networks, but to the application providers it doesn’t really matter. As long as both APIs support some of the same basic functionality, a developer might as well port their app to both standards.

Of course games are a common application that people write for the Facebook platform. The application tagging on Facebook is pretty crappy, but “gaming” accounts for 879 of those applications. The most common games are trivia games (which seem to exist for every NFL team), games where you “attack” other players and get a news item with the results, simple arcade games with leaderboards, and turn-based board games. Many games give you benefits in the game for inviting people to play, which helps to spread the games through the network very quickly.

The one thing that all these games have in common is that they’re incredibly shallow. That lets people get into them easily but it also keeps them from being particularly sticky. I haven’t seen any metrics on the subject, but it seems like most people tire of any given game within a few days or weeks and remove it from their profiles. The Vampires/Zombies/Werewolves/Slayers game is incredibly popular with more than 900,000 daily active users total, but even more people have moved on from the game to other things. An October 28 article on Free to Play reported that Food Fight had 36k active daily users. It now has less than 23k.

The way people use Facebook puts some serious restrictions on the type of game that can be integrated with Facebook. While millions of people use Facebook every day they don’t spend a huge amount of time there each day. Games that require all players to be online at the same time have a serious disadvantage over games that work asynchronously. You might see FPS and RTS games on Facebook at some point, but they will never be as popular as “throw stuff at your friends” games simply because they have to be real-time to work.

One type of game seems to be entirely non-existent in the current crop of Facebook games: turn-base strategy games. There has always been a community of people playing these games flying under the radar. Back before the web these were called Play By Mail, and Flying Buffalo sold many of them. These days they are more likely to be web-based daily turn or action-point based games. These games are perfectly suited to a platform like Facebook:

  1. They are asynchronous
  2. You can play them in minutes a day
  3. They are deep enough to retain players for months or years

The big question is whether or not someone can design a Play By Facebook game that is easy enough to get into to succeed. Most of the PBM and turn-based strategy games have been pretty intricate simulations of something or other and are generally not for the feint of heart. To succeed on Facebook a game needs to be something that a total novice can learn to play in minutes, because that’s all the time somebody’s friend is going to give the game before they move on to something else. Very few games can manage that while staying deep enough to keep players engaged long-term. There is an opportunity here for someone that can pull it off, though.

November 3, 2007

p

Scripting for Designers

Filed under: Engineering, Game Design — Joe @ 10:38 am

I started a kerfuffle on the subject of designers writing scripts. Since my original post was more about our experience with Lua than about scripting for designers I thought I would collect what I’ve already written in everyone else’s comment thread in one place.

Raph believes that designers should know how to write scripts. I agree completely. Games are more about algorithms than they are about art, sound, or databases, and knowing how to code at some level is going to help any system designer immensely. It will allow them to communicate with programmers more effectively, it will make their designs fit better within existing game or technical systems, and it will improve the quality of their designs overall.

Where I draw the line, however, is at actually shipping those designer-written scripts with the game. They are a fine prototyping mechanism, incredibly useful at creating gobs of data, and a brilliant simulation mechanism. Designer scripts are also often slower, more obtuse, and less maintainable than the equivalent script (or code) written by a professional programmer.

Does that mean I think designers have some mental deficiency that makes them write crappy code? Of course not. While there are some basic concepts of programming that require a certain talent to grok (pointers, branches, order of algorithms) by and large most scripting designers have that talent. What they lack is the experience required to write code that you can keep running for years on end. Programmers spend all day, every day on the subject of how to quickly write maintainable code that runs well. For designers, it’s at best a sideline. We put our programmers though a hard-core technical interview to try to determine if we want to put up with their code. Any designer who can pass that interview is welcome to write production code in my book.

A much better approach is to provide a rich mechanism for driving game logic with data and give designers reasonable tools to manipulate that data. That doesn’t mean designers are reduced to inputting tables of numbers. The data-driven systems we use in Pirates allow designers to add entire new game systems by combining existing building blocks. We also work closely with the designers to implement new blocks for them on a regular basis.

Damion mentioned that schedule constraints often lead to programmers changing their tune when it comes to designers writing scripts. Tight schedules are why we integrated Lua in the first place. I thought it would let us take advantage of the people in the office who were less overloaded to write some of the game. My current position on designer scripting is a direct result of that Lua integration.

One thing I discounted in the “let’s get some designers to write some scripts” approach was how valuable the designer’s time is. In most cases it’s easier to build a new system using our data-driven system than it would have been to implement the same system in Lua. When using data isn’t easier, a day or two of a programmer’s time can usually make it so. Our system design team is even more critically understaffed than our programming team, and by using data instead of code we can save them time.

Just about everyone has said, “It depends on your situation.” It certainly does. If you have a team of 5 and your lead designer is also your junior programmer, you would probably be well served to have that designer writing production code. In a more general case with more specialization among your staff, it’s a bad idea to plan on all your design hires having that level of programming ability. And if you reject all designers who don’t meet some minimum programming skill level you may find it hard to hire designers.

All in all, the Great Designer Script Debate of ‘07 has been great. It’s nice to take a break from whining about how many users Second Life doesn’t have or how raid content in WoW is the best/worst thing to ever happen to MMOs. Who’s going to kick off the next kerfuffle?