Lag sucks

One thing I’ve gained through the beta process for Pirates is a healthy contempt for the word “lag”. This word is used in many different ways that have basically nothing to do with each other, and every time I hear it I have to ask, “What do you mean?” Even people who know better often end up using it because they’re repeating what players are saying about their trouble.The problem is that lag is used to describe at least three totally different things:

  1. Latency – Most often this is demonstrated to the player by noticeable command lag (I click Fire and it takes 2 seconds to happen) or rubber banding (I run around a corner and it pops me back where I was a few seconds earlier.) The cause of this latency could be in the server, the network around the servers, the internet, the player’s local network, or even in the client. It just means data either isn’t moving quickly enough, or isn’t be processed in a timely manner.
  2. Poor Client Frame Rate – Regular old crappy client performance. This happens when we’re trying to draw too much for the hardware the client is running on to handle. It could also be caused by doing too many other things on the client CPU and slowing the frame rate down. Frame rate problems are very common on low-end hardware.
  3. Hitching - Inconsistent client frame rate, usually including occasional frames that are half a second or more in length. This is caused by processing something slow in the same thread that’s responsible for drawing. In my experience that is usually loading a file. Sometimes this is made worse by the hardware the client is running on, but usually if there’s a hitch on one machine it’s probably there on another to some extent.  As an added bonus, every time you hitch your camera may also go all wonky.

All that these three things have in common is that they are all Serious Problems We Should Fix Before Launch. They differ in the way you diagnose them, by which programmers are likely to work on the problem, and by what kind of information you need to gather from the players who are experiencing the problem. Until you know what kind of lag you’re dealing with, you’re really working blind.

Lag as Latency is the most painful of the three to deal with.  Chances are you never see these problems on your office network, so they mostly turn up “in the wild.”  The problem is that the wild is really wild.  Because a player’s network hardware can contribute so significantly to network latency, you often end up asking for intimate details about the player’s network topology: traceroutes from them to your  data center, make and model of all of their network equipment, packet traces, and maybe even who their ISP is.  On multiple occasions we’ve even had to procure network equipment that matched what the users had to try to reproduce the problem.

The biggest network latency problems we’ve had on Pirates all had to do with a combination of Network Address Translation (NAT) and game data sent over UDP.  Just about everyone runs NAT these days, so these problems could hit anyone. While NAT does a great job of holding its automatic port forwarding open for TCP connections, there is no connection for UDP.  Every hardware vendor seems to have its own idea of how to set up that forwarding, how long to keep it open, and what traffic to demand from the application to extend that time. The specifics really deserve a post all their own, but we’ve seen network code that works fine on one NAT device not work at all on most of them.  We’ve seen code that keeps the port forwarding alive indefinitely on 80% of hardware stop reliably after 10 minutes on the remaining 20%.  About a year after we fixed that we found another piece of hardware that was fortunately relative uncommon stop forwarding UDP packets after just a few minutes. This is a problem that just seems to never end, and I fully expect we will still be sorting out network trouble on some new piece of NAT hardware five years after launch.

Slow frame rates are on the opposite end of the spectrum.  Standard performance tools (like profilers) and tools provided by graphics hardware vendors (like NVPerfHud) do a great job on this kind of problem. Finding the cause of a poor frame rate is relatively easy as a result, and all you typically need from the player is a description of where they were and what they were looking at. A screen shot can often do the trick.

Actually fixing a slow frame rate can be a much bigger deal.  If you have to get the art team involved you are going to waste tens or hundreds of hours of somebody’s time redoing artwork.  Fortunately you can see most of these problems coming long before you’ve built all the assets. That’s why it’s so important to be testing on your min spec the whole way through development.

Hitching is a bit more difficult to track down than a steady state frame rate problem.  There is usually some event that causes it, like a new character coming on screen, or a new part of the environment loading.  We’ve also seen hitching from server updates of health information, Lua garbage collection, and external applications that had nothing to do with our game. The profiler does a poor job of collecting information over a time span as short as half a second, so it’s typically useless at finding hitches.  Call graph analysis can help sometimes, but it tends to suffer from a long sample period too.  Your best bet is to log all events that are going on in the game and try to correlate the hitching with a small number of events.  Then you can instrument the code around those events and find the culprit. It’s a little more difficult to figure out hitches that only happen in the wild, but often if they’re happening to one player they’re happening to all players, so running against (semi-)public test servers can demonstrate the problem.

Once you figure out what’s causing the hitch, fixing it is generally not that hard. If you can intentionally cause that event to happen hundreds of times a second while you run the profiler you’ll find out where the slow code is.  You may need to time-slice an algorithm, move some work to a background thread, or speed up the work itself.  After a couple of years of tracking every hitch down to the Lua garbage collector we eventually tossed Lua out on its ear and fixed that problem.

    And yet all of these problems with all of their myriad sources, diagnostics, and solutions are all just lag to the player. Almost every time you hear a report of lag you are going to ask the following diagnostic questions:

    • What does the in-game frame rate counter show when you see this?
    • What is your ping time when this happens?
    • Does the whole screen freeze? (In our case I usually ask if the ships are still rocking or if the ocean is still moving.  These are really obvious hitch indicators in Pirates.)
    • Is your character popping around?

    The answers to these questions will help you pin down which lag your player has. I’ve also found that there’s a good chance one of your players will be fairly technically savvy and can help you track it down further.  In one case we had a player rearrange his network and hook up a laptop above his router in the network.  With his packet traces from above and below the router we were able to see exactly what was happening and fix the problem.  His name is forever immortalized next to the code fix (and we send him a nice thank you gift.)

    That’s why lag sucks.  It confuses users, customer service, and programmers alike. It’s a pain to diagnose, and often a pain to fix. And you can never really fix it because no matter what you do someone is always going to report that they are still having lag.

    ~Joe


    3 Responses to “Lag sucks”

    1. Tim commented on :

      I am duly educated!

      The thing is, “Hiiiiiiiitching!!” and “Pooooor Fraaaame Raaaaaaate!!” just don’t cut it when you’re whining quite like “Laaaaaaag!!” does :^) Your mission Joe, is to come up with some new words!

    2. Joe said on :

      How about Hlag (for hitching), Clag (for client frame rate), and Nlag (for latency). :)

    3. jtd wrote on :

      I really have been enjoying reading your blog over the last few days. I am fascinated by the behind the scenes insight into what it is like being a programmer for a large scale game.

      I work as a programmer writing APIs and firmware for embedded environments (right now I am working on chips that go into cell phones) and performance is one of the things we spend months of time getting right. Of course in my case the hardware, and to some extent the environment it is used in, is static. So designing worst case tests and tracking down performance glitches is significantly easier. We also do not get bug reports in the form of random internet fans/potential customers.

      Then again at the end of the day you can point to a cool screen shot of a beach or a massive ship of the line and say “I helped create that”. It is a bit less impressive to explain to people that their Blackberry can save pictures faster because of some code I wrote that they will never see.

      I have been running in the open beta with an old pc at home (1.6 ghz sempteron, 1 gig of ram, crappy 256 agp video card, 5400 highly fragmented hard drive) and I have to say I have been very impressed with the overall performance of PotBS. I do experience a slight loading lag when I zone, but once that has passed the framerate is nice and steady. The only time I experience any kind of slow down is in the open sea when there are dozens of ships sailing around but with my setup I expect such things.

      I also have never had the game crash on me even once. I keep forgetting I am actually playing in a Beta.

      Keep up the great work, both on the job side and on the blog side. The open communication is a breath of fresh air in this market.

      Claaaaaag! I like it.

    Leave a Reply