If you search for “augmented reality” in Google, most of the hits will involve systems that analyze the output of a video stream in order to figure out what to draw in the overlay and where to draw it. Sometimes the what and where are answered by the same marker (as in the endless YouTube AR clips.) In the more interesting examples the what comes from using the camera to figure out where the camera is pointed in more general terms and then to draw something positioned in some sort of known coordinate space (like PTAM or the recently announced MetaIO World.) This latter approach is broadly termed visual odometry. This seems to be what most people think of when they refer to AR, and that is no surprise given how much academic AR research focusses on computer vision.
As Wikitude (and more recently Layar, Nearest Tube, and Wimbledon Seer) has shown us, there is another way. Making sense of a video stream is hard, particularly on a mobile device. Why not just use the non-camera sensors on that device (GPS receiver, tilt sensor, and compass) to provide the absolute position and orientation of the device and then look up nearby waypoints from some sort of database. This approach makes these applications more similar to map-based location aware apps (like Whrrl and Urban Spoon) than to those YouTube videos, but it’s not clear that users care.
Using sensors to determine position and orientation has key advantages. The first is that it works in more environments. While GPS often fails indoors, it works fine at night, at sea, and on most parts of the earth. Visual odometry has been shown to work relative to a start point — basically where you start up the tracking system — but not relative to an absolute coordinate system. GPS is also immune to nearby objects moving around. Real world scenes are very dynamic and moving cars, furniture, and people around can throw off vision-based systems. Tilt sensors comprised of accelerometers and gyros are quite good at returning stable, accurate pitch and roll values. Compasses are somewhat less reliable due to their susceptibility to nearby magnetic fields and large chunks of metal, but they are still able to give you a reasonable approximation of heading. Tilt sensors and compasses also work fine indoors and out of doors.
On the other hand, vision-based tracking systems have advantages of their own. The biggest is accuracy. PTAM demonstration videos show an accuracy down to a centimeter or less. Marker-based approaches show even better accuracy. Compare that to the two meters that represents the best possible accuracy of a GPS receiver. Those two orders of magnitude mean that GPS based AR systems simply don’t work for objects that are less than ten meters away. The second advantage for vision based systems is that there are many cases where it is impractical to know about all the objects in the user’s field of view. They aren’t there yet, but advanced computer vision techniques offer hope that one day a computer will be able to recognize any arbitrary object simply by looking at it. And until that day arrives there are already readily indexed markers on most items in the form of UPC codes. GPS will never provide such a service, and even if every item in the world had an RFID tag there is no way that every person would have access to the database into which those tags are indices.
Despite its those shortcomings, my belief is that pragmatism is going to result in GPS-based systems winning this fight. The fact is that today’s GPS-based solutions actually work in the general case and vision based have only worked well in very controlled demonstrations. If pioneering companies like SPRX Mobile and Mobilizy start to make money then capital is going to start flowing into this industry. Most of those new companies are going to follow the lead of the existing players and prefer GPS to computer vision. Eventually that will drive sensor-based approaches to get better, faster than vision-based approaches, which will encourage more investment until eventually vision-based AR tracking systems are left in the dust. One of these improvements could be Galileo, which is expected to offer GPS accuracy down to 2cm. When vision researchers eventually solve the object recognition problem those solutions will be integrated into already existing AR platforms with sensor-based trackers.
What do you think? Do you see vision-based systems coming out on top? Will non-camera sensors be king? Or is a hybrid system the only way to go long-term?