How to make Microsoft SQL Server cry like a baby

Earlier this year we switched from MySQL to MS SQL Server. I don’t regret the switch at all; MS SQL Server has been far more stable than MySQL was, and has lots of whizzy new features. The MySQL client library was dropping connections under load and then crashing when it reconnected. That is what pushed us to switch in the first place. Well it turns out that MS SQL Server has some scaling problems of its own. It doesn’t crash, but it does get so slow as to be non-functional. This is a helpful guide that will help you make your own installation of SQL Server whimper.

Our server boxes are 8-way 2.6GHx Xeons with 16GB RAM running Windows Server 2003 64-bit and SQL Server Enterprise Edition 64-bit. If your configuration is different your mileage may vary.

Technique #1

We are using a system called the Flogger to record gameplay event into a database. To make this happen, all server processes connect to one central DB and call a stored procedure per event. This works fine when the number of processes is low, as in under 500. When the load on a world instance grows the number of processes connecting to the flogger DB increases to 1200.

Exactly how long seems to vary from a few hours to a few days, but after a while at this load SQL server decides that it has had enough and stops accepting new connections. New processes starting up time out eventually and things generally start going badly on the servers. Once SQL Server starts timing out connections the only way we’ve found to get the database running again is to restart the SQL Server service. While it’s in this state the server is only using moderate server resources.

The way we’re working around this problem is to use files as a buffer between the server processes and the database. Every so often (depending on activity) each process will dump the events it wants to record out to file. Some time later (well under a second when there’s no load, but potentially longer on a well loaded cluster) another process that maintains a connection to the flog database reads the file, dumps it to the database, and then deletes the file. This eliminates the need for the game servers to connect to this database at all, so if it decides to go out to lunch the game is unaffected. It also makes the data collection more reliable by putting any backlog into one directory full of files instead of in memory on 1500 different processes spread across five server machines.

Technique #2

We have another database exhibiting similar problems, though not quite as severely. Each process in a game cluster connects to a shared database called Serverdir and uses the DB to report its status back to operations tools and the “keep everything running” processes. This data is strictly temporary and probably doesn’t belong in a database all, but Horrible Design Flaws That Are All My Fault aside, it’s just not that many queries and they’re all very simple selects and updates. This shouldn’t be a problem for server hardware as beefy as ours.

That argument doesn’t convince SQL Server, however. After a few days SQL Server pauses for a few minutes. The CPU goes to 0% and no queries return for the entire time it’s paused. Our code responds to that by closing things down because it can’t currently tell the difference between “Query takes over a minute” and “Crashed process.” At that point half the cluster shuts down.

We don’t have a great workaround for this one yet. We’ve been steadily reducing the load on the Serverdir database, but it doesn’t seem to take all that much load to make it happen. Our best bet is to make the code smarter and have it detect these situations. If it just sits tight for a few minutes everything will return to normal without needing to restart anything. Fortunately it only happens a couple times a week so while it’s something we definitely need to fix before launch it isn’t impacting beta tester’s ability to play.

Making an MMO scale is a pain

None of the profiling tools we’re using at the SQL Server or OS levels are much help with either of these problems. Nothing tells us why SQL Server is refusing connections, or why it’s refusing to work on queries. Most database books and websites think that a slow query is one that takes longer than a minute or two, but in our world that’s a dead process and a disappointed customer.

We have made great strides in scalability since the first stress test, but no matter how many things you fix there is always one more waiting to bite you on the ass. *sigh*  We’ll get it figured out and apart from these DB troubles everything is staying up quite well at this point. We have 43 more days until the pre-order head start, so there’s still plenty of time to get through this round of problems. Then we break through into the infinite!

My fix for the flogger scale problem is now ready for a code review, so I’m going home to play Rock Band.

~Joe


10 Responses to “How to make Microsoft SQL Server cry like a baby”

  1. Jeff Ward replied on :

    Hey Joe!

    Was talking to DB administrator friend of mine about your problem and he had few suggestion for you.

    He said to make sure you’re caching SQL connections. Managing new connections, login, authentication and security is very expensive – your SQL box has better things to do then overhead. By making a new connection per query or request, you’re treating SQL as if it were IIS, which its not – you want to take advantage of SQL connection pools

    He does say that you CAN use IIS as a proxy to SQL Server, and that might help.

    As for your query problems, he says he’d have to look at it, but if you want he said I can pass along his email.

  2. Joe thought on :

    We’re definitely keeping the connections open and re-using them. Each server process makes its connections at startup, queries on them a bunch of times, and closes the connection at shutdown. That’s still a fair number of new connections because most zones only last 15 minutes. When there are 1500 of them that’s 100 new zones per minute or 1.6 per second. Still not nearly as many as there would be if we didn’t persist connections though. :)

  3. Steve wrote on :

    My friend Jeff Ward pointed me at your blog, and you should be able to get substantially better performance on that hardware then what you’ve described with SQL Server 2005.

    Regarding your first problem, are you seeing a substantial change in your SQL Server buffer pool size during the connection lock? What connection authentication mechanism are you using? Are you using SQL Server connection pooling?

    The second indicates that you’re definitely blocking on something. There are a couple of things that can cause this to happen. The first is a lock that goes way out of scope, which you should be able to track with the profiler or the lock management views. The second is if you’re not managing an autogrow tranlog file and the OS is slow at allocating new volume space. The last is a bad IO drivers, which were common on some of last year’s HP servers if that’s appropriate to you.

  4. Joe commented on :

    We still aren’t sure exactly what was going on with problem #1.

    Problem #2 we have a better handle on. Steve was exactly right: The transaction log was autogrowing and that process was blocking all queries for a full minute. It seems that there is some issue with the auto-truncation of that log, so it’s growing forever. It takes a while to grow a 50GB file. :)

    Hopefully the dump-to-file solution will resolve the flogger DB problem and both these will be behind us.

  5. sidereal thought on :

    Also, as general SQL Server advice, perfmon your Current Disk Queue Length (Logical Disk), % Processor Time (Processor), Transactions/sec (SQLServer:Databases), User Connections (SQLServer:General Statistics), and Average Wait Time (SQLServer:Locks) all the time. Generally, when something starts crapping the bed you’ll be able to narrow down what to look for by checking which counters go to hell.

    And a quick sp_who active followed by slightly less quick Profiler run tend to be helpful.

    Your transaction log should be getting truncated if it’s getting backed up as part of the maintenance plan. If it hasn’t happened in a while or for some other reason got really big, you need to shrink it (see dbcc shrinkfile).

  6. BigRed thought on :

    You have got your recovery model set to simple or are running regular log backups aren’t you?

    If you have chance, backup your database then shrink your transaction log as small as possible while there is no activity then resize it to a proper size to remove fragmentation of the virtual log files and set the autogrow figure to a sensible size.

  7. Joe replied on :

    We’re running regular backups and truncating the transaction logs when they run. We just had a problem with the backups that was preventing the truncation. They’re working fine now, though.

  8. The Professor commented on :

    I know this thread is retired by now but I have to add that I have never seen a more appropriate use of the term “crapping the bed” before in my life. Anyone who has seen SQL go berzerk on a process can completely relate to this and the mess that results.

  9. Paul commented on :

    It sounds like you guys are randomly trouble shooting your stress test here.

    First thing is find out if the problem is CPU, I/O, Network related, etc…

    What are your average disk queue lengths sitting at?
    high, you have a disk I/O bottle neck (controller, slow disk, bad filegroup layout, etc…)

    What are your CPU averages? Over 70%
    Bad index, etc…

    What type of locking hints are you using in your queries? Is something blocking/dead locking you? Try sp_who2 stored procedure and look at the DBCC INPUTBUFFER({spid}) here.

    You never mentioned anything about the disk setup. Usually a DB has issue with I/O before CPU depending upon the type of system it is. Now if I was talking about a web server… I would probalby be more concerned about CPU.

  10. Joe replied on :

    Well that was over three years ago, so I don’t really remember. I’m much more wise about optimizing SQL schemas these days. (And would never build things that way again in many ways.)

Leave a Reply