Cameras vs. Sensors

If you search for “augmented reality” in Google, most of the hits will involve systems that analyze the output of a video stream in order to figure out what to draw in the overlay and where to draw it.  Sometimes the what and where are answered by the same marker (as in the endless YouTube AR clips.) In the more interesting examples the what comes from using the camera to figure out where the camera is pointed in more general terms and then to draw something positioned in some sort of known coordinate space (like PTAM or the recently announced MetaIO World.) This latter approach is broadly termed visual odometry. This seems to be what most people think of when they refer to AR, and that is no surprise given how much academic AR research focusses on computer vision.

As Wikitude (and more recently Layar, Nearest Tube, and Wimbledon Seer) has shown us, there is another way.  Making sense of a video stream is hard, particularly on a mobile device. Why not just use the non-camera sensors on that device (GPS receiver, tilt sensor, and compass) to provide the absolute position and orientation of the device and then look up nearby waypoints from some sort of database. This approach makes these applications more similar to map-based location aware apps (like Whrrl and Urban Spoon) than to those YouTube videos, but it’s not clear that users care.

Using sensors to determine position and orientation has key advantages.  The first is that it works in more environments.  While GPS often fails indoors, it works fine at night, at sea, and on most parts of the earth. Visual odometry has been shown to work relative to a start point — basically where you start up the tracking system — but not relative to an absolute coordinate system. GPS is also immune to nearby objects moving around. Real world scenes are very dynamic and moving cars, furniture, and people around can throw off vision-based systems. Tilt sensors comprised of accelerometers and gyros are quite good at returning stable, accurate pitch and roll values.  Compasses are somewhat less reliable due to their susceptibility to nearby magnetic fields and large chunks of metal, but they are still able to give you a reasonable approximation of heading. Tilt sensors and compasses also work fine indoors and out of doors.

On the other hand, vision-based tracking systems have advantages of their own.  The biggest is accuracy.  PTAM demonstration videos show an accuracy down to a centimeter or less. Marker-based approaches show even better accuracy. Compare that to the two meters that represents the best possible accuracy of a GPS receiver. Those two orders of magnitude mean that GPS based AR systems simply don’t work for objects that are less than ten meters away. The second advantage for vision based systems is that there are many cases where it is impractical to know about all the objects in the user’s field of view.  They aren’t there yet, but advanced computer vision techniques offer hope that one day a computer will be able to recognize any arbitrary object simply by looking at it.  And until that day arrives there are already readily indexed markers on most items in the form of UPC codes. GPS will never provide such a service, and even if every item in the world had an RFID tag there is no way that every person would have access to the database into which those tags are indices.

Despite its those shortcomings, my belief is that pragmatism is going to result in GPS-based systems winning this fight. The fact is that today’s GPS-based solutions actually work in the general case and vision based have only worked well in very controlled demonstrations. If pioneering companies like SPRX Mobile and Mobilizy start to make money then capital is going to start flowing into this industry. Most of those new companies are going to follow the lead of the existing players and prefer GPS to computer vision. Eventually that will drive sensor-based approaches to get better, faster than vision-based approaches, which will encourage more investment until eventually vision-based AR tracking systems are left in the dust.  One of these improvements could be Galileo, which is expected to offer GPS accuracy down to 2cm. When vision researchers eventually solve the object recognition problem those solutions will be integrated into already existing AR platforms with sensor-based trackers.

What do you think? Do you see vision-based systems coming out on top? Will non-camera sensors be king? Or is a hybrid system the only way to go long-term?


9 Responses to “Cameras vs. Sensors”

  1. rouli commented on :

    It seems that GPS&Compass based AR is more mature, and imho, it augments reality much more than any marker-bound technique.
    On the other hand, there are some attractive vision based AR-like applications, such as GetFugu or Nokia’s point and find, which limits themselves to identification of 2d images.
    I wonder how come no one created an application that merges the above two techniques. This could be a killer tourist application – walk the street of an European city using the GPS AR, enter a museum and look at images via the computer vision AR. Both are not really what the academy would call AR, but it’s a step in the right direction.

  2. Noah Zerkin replied on :

    Hybrid. Taking the cue from our own perceptive abilities, I’d say that it makes sense to use as much data as one has at one’s disposal, within’ the limitations of one’s processing capabilities. Also, GPS is capable of more precise location fixing once you throw dGPS and eventually carrier-phase GPS into the mix. Feiner’s prototype system in the mid-90s featured dGPS. Perhaps with adoption of wearable AR systems we’ll see dGPS beacons domestically deployed more widely by the USCG, or deployed by somebody else with a bit more interest in the land-locked portions of the country. An ideal system might give accuracy weightings to all of the various elements of the equation based on environmental context. In urban areas, it might be possible to generate image-based “markers” from Google Street-View and Earth, and MS Photosynth imagery to generate a more precise fix than that possible with current consumer-accessible GPS. So the GPS and compass heading dramatically reduce the database set against which you need to compare your vision input. i.e. use GPS and magnetometer readings to set the context, and vision and/or local RF beacon triangulation for precision positioning where possible. In other words, I don’t really think it’s a “versus” situation at all, but one in which we’ll find more and more ways of using diverse and complementary data-sets to refine the gestalt. Sensors will only get cheaper and more accurate, embedded processors and GPUs faster, camera resolutions higher, served resources bigger, and mobile bandwidth broader. The key is in the knitting-together. It’s all about the convergence, methinks.

  3. Joe replied on :

    Hasn’t dGPS been more or less superseded by WAAS, at least in the US? From what I understand they use the same sort of approach, only with WAAS transmitting its correction signals through a couple satellites. Carrier-phase GPS certainly has a lot of potential.

    local RF beacon triangulation

    Long-term, I think this is where it’s at. Like GPS they would work at night. Unlike GPS they could be sprinkled around indoors and anywhere else that it’s tough to get a GPS signal.

  4. serge wrote on :

    “When vision researchers eventually solve the object recognition problem …”

    Those problem mostly solved long ago. The problem how to squeeze those solutions into mobile device

  5. Joe wrote on :

    The problem of recognizing an arbitrary object in any orientation without restricting (reasonable) lighting conditions has been solved? I know there are plenty of examples of vision systems recognizing the set of specific objects they’ve been trained on, and some of those more restrictive systems are even on mobiles (like SnapTouch Explorer). I just haven’t seen an example of a general solution.

  6. Yulia Panina replied on :

    We are using exactly this approach in our Google ADC2 entry: lightweight pattern recognition combined with Android’s pitch and roll sensor, far less computation is then needed to calculate extrinsic camera parameters.

  7. Alex Kasper said on :

    Aribtrary object recognition is far from being solved. Special cases in restricted environments have been solved, but I haven’t seen anything that’s close to being usable in an everyday environment.

  8. Thomas K Carpenter thought on :

    Hybrid of the two depending on the usage, mainly depending on the precision needed of the project. If you’re just doing location based information layers, then GPS is fine. If you’re getting down to the objects within those locations (like items in a grocery store), you’re going to need object recognition. As the technology improves, the usage will change too.

  9. Adam thought on :

    I don’t think it’s fair to say the GPS-based techniques will win since the applications are quite different. GPS-based techniques appear to be easier to implement so I agree that we will see more of these apps.

    In general, GPS-based AR is good for navigation, tours, and other large-scale applications, while vision-based AR is better for object recognition at smaller scales. General object recognition is still quite hard though, so it’s natural that there will be fewer of these apps out there. This doesn’t mean these apps wouldn’t be more useful if they were available.

Leave a Reply