Cameras vs. Sensors

If you search for “augmented reality” in Google, most of the hits will involve systems that analyze the output of a video stream in order to figure out what to draw in the overlay and where to draw it.  Sometimes the what and where are answered by the same marker (as in the endless YouTube AR clips.) In the more interesting examples the what comes from using the camera to figure out where the camera is pointed in more general terms and then to draw something positioned in some sort of known coordinate space (like PTAM or the recently announced MetaIO World.) This latter approach is broadly termed visual odometry. This seems to be what most people think of when they refer to AR, and that is no surprise given how much academic AR research focusses on computer vision.

As Wikitude (and more recently Layar, Nearest Tube, and Wimbledon Seer) has shown us, there is another way.  Making sense of a video stream is hard, particularly on a mobile device. Why not just use the non-camera sensors on that device (GPS receiver, tilt sensor, and compass) to provide the absolute position and orientation of the device and then look up nearby waypoints from some sort of database. This approach makes these applications more similar to map-based location aware apps (like Whrrl and Urban Spoon) than to those YouTube videos, but it’s not clear that users care.

Using sensors to determine position and orientation has key advantages.  The first is that it works in more environments.  While GPS often fails indoors, it works fine at night, at sea, and on most parts of the earth. Visual odometry has been shown to work relative to a start point — basically where you start up the tracking system — but not relative to an absolute coordinate system. GPS is also immune to nearby objects moving around. Real world scenes are very dynamic and moving cars, furniture, and people around can throw off vision-based systems. Tilt sensors comprised of accelerometers and gyros are quite good at returning stable, accurate pitch and roll values.  Compasses are somewhat less reliable due to their susceptibility to nearby magnetic fields and large chunks of metal, but they are still able to give you a reasonable approximation of heading. Tilt sensors and compasses also work fine indoors and out of doors.

On the other hand, vision-based tracking systems have advantages of their own.  The biggest is accuracy.  PTAM demonstration videos show an accuracy down to a centimeter or less. Marker-based approaches show even better accuracy. Compare that to the two meters that represents the best possible accuracy of a GPS receiver. Those two orders of magnitude mean that GPS based AR systems simply don’t work for objects that are less than ten meters away. The second advantage for vision based systems is that there are many cases where it is impractical to know about all the objects in the user’s field of view.  They aren’t there yet, but advanced computer vision techniques offer hope that one day a computer will be able to recognize any arbitrary object simply by looking at it.  And until that day arrives there are already readily indexed markers on most items in the form of UPC codes. GPS will never provide such a service, and even if every item in the world had an RFID tag there is no way that every person would have access to the database into which those tags are indices.

Despite its those shortcomings, my belief is that pragmatism is going to result in GPS-based systems winning this fight. The fact is that today’s GPS-based solutions actually work in the general case and vision based have only worked well in very controlled demonstrations. If pioneering companies like SPRX Mobile and Mobilizy start to make money then capital is going to start flowing into this industry. Most of those new companies are going to follow the lead of the existing players and prefer GPS to computer vision. Eventually that will drive sensor-based approaches to get better, faster than vision-based approaches, which will encourage more investment until eventually vision-based AR tracking systems are left in the dust.  One of these improvements could be Galileo, which is expected to offer GPS accuracy down to 2cm. When vision researchers eventually solve the object recognition problem those solutions will be integrated into already existing AR platforms with sensor-based trackers.

What do you think? Do you see vision-based systems coming out on top? Will non-camera sensors be king? Or is a hybrid system the only way to go long-term?

Microsoft gets it exactly wrong

The other day Craig Mundie, head of Microsoft Research, said “There will be a successor to the desktop; it will be the room.”

I think things are headed in exactly the opposite direction. Everything that has happened with computing and telephony in the last fifteen years has pointed away from engaging in these activities in a fixed location. Mobile computing is just the latest part of the overall trend, but it will be what finally eliminates the personal desktop computer once and for all.

Consider what effect the web had on the way people use computers. Back in the dark ages (aka the 1980s) if you wanted to do a thing on a computer you bought a disc with a piece of software then used that software to generate whatever data you needed and stored that data on a second disc. If you were fortunate enough to be using software that was widely distributed (e.g. WordPerfect) you could take the data with you and access it on another computer. Unfortunately floppies were so small that only a dozen documents would fit on them, which meant that you generally had to make a conscious choice about which data to bring.

In the early days of the consumer web email access was the same way. You could technically log into your POP3 account from anywhere, but it was really designed to be used from a single computer with all your contacts, old messages, etc. on it. The rise of Hotmail and other web-based email packages changed all of that. Suddenly you had access to your email from everywhere. Server-based email vendors like Microsoft eventually got on board and now even corporate email is (or can be) web based.

No-install software on the web still hasn’t turned out to be as all-encompasing as some people are predicting, which has given rise to another way people take it with them: the laptop. What percentage of people under thirty use a laptop as their primary personal computer? Two of my teeneaged nephews got laptops for christmas… I suspect they will never own their own desktops. As more applications become feasible on the web netbooks are cropping up to replace laptops.

These days the only people still buying desktops are corporations (because they’re cheaper and frankly work better when you’re going to force the user to sit in the same place for 40 hours a week anyway) and gamers (because they’re more powerful.) Gamers are already moving to laptops in a big way, and I expect the business world will follow within the next ten years.

Perhaps the cloud-based services (like web email, in-browse productivity apps, etc.) would be well served by the computer as a room, but web browsing isn’t as universal as it used to be.. People tend to get uncomfortable when they don’t have their favorite browser, their set of plugins, and their bookmarks surrounding their web experience. It feel like even web browsing is a pretty personal experience.

What do you think? Do you see a place for these big computing experiences?

If Augmented Reality is the solution, what is the problem?

Although augmented vision is where I started with my interest in the field, I have really moved a bit beyond that now.  These days when I say Augmented Reality I really mean wearable mobile computing with an interface that actually works.

With my new job at Valve came a new commute. I spend on the order of three hours a day riding on the bus, waiting for the bus, or walking to or from the bus stop. So far I have occupied my time with podcasts and paperbacks, but I am going to try to spend more time on my AR work starting today.  Thus, I am writing this while on my morning commute to Bellevue.

The mobile computing experience is poor for many reasons but they can be summed up as two things: Input and Output. Yes, that’s all. :)

On the input side the problem is that so much of what I want to do is driven by entering text. My phone (a Treo 650) has a great keyboard… for a phone. Even so, it really doesn’t compare to the experience of typing on a laptop or desktop keyboard. On my laptop the act of entering the text into the computer is not hindered by the act of typing. I’m far from the fastest typist, but I can still type much faster than I can generate the words I want to type. Until a mobile platform offers the same level of comfort and speed it will never be suitable for writing anything longer than text messages and tweets. The idea of writing code via a phone keyboard is just absurd.

My laptop only works while I’m riding the bus or waiting someplace I can sit down. It is s not an option at any stop where I wait while standing. That means I can’t use my laptop at full half of the places where I wait for the bus. I’m not sure I have ability to write while I’m actually walking and not knock over my fellow pedestrians (or get hit by a car) but I certainly have a lot of downtime-while-standing when waiting for my transfer on the way home.

So the first problem I want to solve that I keep in my mental file labelled “Augmented Reality” is the ability enter thousands of words of text comfortably while out and about.

The second problem is getting video output from the computer while out and about.  My laptop does an good job for the part of the time when I can have it open. It’s hard to see the screen on a bright day, but we fortunately don’t get too many of those here in Seattle. The trouble is that I could use output from my computer in many more places than I can break out the laptop.

The screen on my Treo (or my iPod Touch) is a little better in some ways. It’s much more feasible to bring it out when I need to check a bus arrival time. Thanks to onebusaway.org and my phone I can get this information on demand. All I have to do is wake up my phone, open the web browser, hit one of the bookmarks I’ve saved for the bus stops I frequent, and wait for 10-15 seconds while the page loads. If I’ve recently looked up that information for a stop I just need to wake up the phone and hit refresh to get updated information. It’s definitely better than looking at a clock and the never-very-accurate schedules on the stop itself.

What I really want, however, is to just know this information. On my way home from work there are two questions I often ask:

  1. How far is the #550 from my stop? Do I need to run? – By the time I can see the bus it’s only about 40 feet from the stop, so having some notice that I’m going to miss it would really help.
  2. How long do I have before the #1, #2, #4, and #13 arrive at this stop? – Any of these busses will get me within walking distance of home, the only real difference is how far I have to walk. The #2 stops a half block from my house, so I prefer it, but if it is lagging behind the others by a wide enough margin it isn’t worth waiting.

This kind of ambient awareness is where the augmented vision comes in. If it is within an hour of one of my usual riding time at one of my usual bus stops I want to see the current data from onebusaway.org for that stop.  Big obvious columns of light that let me see the bus approaching from blocks away would be cool, but they probably don’t solve the problem as well as a 2D display on my personal HUD that I can glance at occasionally. “When will my bus arrive” is just the most obvious question I want a constant answer to. Once that one is solved I imagine that many more will present themselves. (I also imagine that many of those will actually require some level of registration with the world.)

Once I am wearing a head-mounted display I will probably use it for one more purpose. I would like to be able to block out the world in front of me once I am actually sitting on the bus. I am prone to visual distractions, and have a hard time focussing on much of anything when a bunch of people are around me. If I could occlude most of my field of view with whatever I’m working on the distraction would be greatly reduced.

The problem that I am interested in solving is “Mobile computing sucks.” Location and temporally aware wearable computers with first person displays are the solution to that problem.

Augmented Reality should be open

Over the past year I’ve spent a lot of time thinking about what piece of the augmented reality ecosystem would be the best to start a business around. I’m still not ready to take that jump so, in my case at least, the answer is still “none yet”.  However, in my exploring I keep coming up against a problem:

  1. The absolute most profitable place to be in augmented reality is the platform provider at the center of everything.
  2. The profit motives of that platform provider could set the development of AR back by about ten years.

A brief history of the web

Whether by design or happy accident the protocols (HTML and HTTP) behind the web are easy to implement and completely open. This meant that by the time Netscape came along, there were already browsers on the Macintosh (CERN’s and Mosaic), Windows (Mosaic), and X (CERN, Mosaic, Viola, etc.) There were also 200 active web servers and port 80 accounted for more than 1% of the traffic on the NSF backbone.

That ecosystem meant that Netscape remain compatible with what already existed in order to succeed.  Sure, they were selling licenses to their own software, which let them cash in on the shocking growth of the web, but the Netscape browser had to work just as well against pages served by HTTPD, IIS, Apache, and any other random web server anyone decided to write. The same thing was true from the other side.  Netscape Now! buttons aside, website operators soon had to deal with at least two and possibly more different browser, as well as various versions of each browser.

This made life interesting for web designers, but it was good for the web as an platform. The nature of the web meant that nobody had to convince somebody else to say “Yes” to get involved.  There is no way that any one company (or any ten companies for that matter) could have even authorized, let alone managed, all of the initiatives that went on with the web between 1994 and 2000. There was just too much stuff happening.

The open nature of the web allowed the cost of innovation to be spread around to thousands of organizations around the world.  It also let anyone with enough cash to buy some hosting try out their big idea. Most of those ideas failed, of course, but when taken as a whole they succeeded beyond anyone’s wildest expectations.

I think that augmented reality has the potential to follow a growth curve with the same shape as the one the web followed. The web had very few institutional barriers standing in the way of its growth, and the AR ecosystem would do well to learn from that.

Open Augmented Reality

If the emerging augmented reality ecosystem wants to grow as quickly as the web it cannot include anyone who must say “Yes” to allow existing users to get a new capability. That implies a few things:

  1. Anyone can publish content into the system. There are no controls for quality or appropriateness of content on this ability to publish.
  2. Clients from multiple vendors are able to view that content. Anyone who choses to can write a new client that works with existing content.
  3. Servers from multiple vendors are able to respond to requests for data. Choosing server technology is primarily a decision for content providers to make and their choice is invisible to end users.
  4. The network itself is neutral to the data being transmitted across it. This means the mobile internet providers must not white-list content from publishers that it has partnerships with.
  5. There is no single central directory that all content (or every content provider) must be listed in to be available.

Note, that this does not require that the software in question be open source. Open source software (in the form of Linux, HTTPD, Apache, Perl, PHP, and others) was instrumental in spreading the web far and wide. However, the personal computer revolution happened with little in the way of open source software and was just as rapid as the spread of the internet.

Open Standards

As VRML and many other standards over the years have taught us, developing a new standard from whole cloth is fraught with peril. It is even more difficult (as in the case of VRML) when there is not an existing standard that the new standard is intended to supplant. The AR community must avoid repeating the history of VRML. Fortunately there are existing standards that lend themselves well to the problems augmented reality developers are trying to solve.

The first of these is good old HTTP. As a transport protocol, HTTP fits the list above very well. The protocol is well understood, decentralized, and available in server or client library form for every platform. Minor new standards for querying location-specific data are already emerging.

The second current standard that the augmented reality developers can adopt and bend to their will is KML. KML is the file format that Google Earth uses to represent geocoded information. It has support for points, lines, and shapes. KML is an open standard and is supported by many GIS packages in addition to Google Maps and Google Earth.  Google has open-sourced its own KML parsing library so there is a place to start there too.

Any augmented reality client that supports attaching web browsers (including URLs) to locations can also take advantage of most other existing web standards for whatever happens to be in those browsers.

Is this how things are actually going?

So far, I have seen very little discussion of how different augmented reality systems will work together.  In large part that is  the point of this post. But then there are also very few AR systems that exist outside of laboratories, so we could just be in the bad old proprietary hypertext system days of the late 80s.

So far the AR systems that seem to be designed for lots of different kinds of data (Layar and Seer) have not announce any way for third parties to publish data for their clients. My twitter exchanges with Raimo at SPRXMobile make me think that Layar is at least thinking about it.  Hopefully they will turn out to be as open as I’ve outlined above.

How important do you think open AR standards are? Can an AR solution succeed without them?

Going to the Show

Have you ever seen Bull Durham? If not, watch this clip:

 
Over the past eleven years I have worked with many great people on many great projects.  While there has been plenty for those teams to be proud of, I have never worked on a hit.  I have never worked on a game with a marketing budget to speak of. I have never worked on a game with a 90+ Metacritic rating. I have never worked somewhere that could really afford to push a game back just to make sure it was right before it came out. In other words, I have never been to The Majors.

Well that is all about to change:  tomorrow is my first day at Valve Software as a programmer on Team Fortress 2.

I am excited to be going to Valve for many reasons.  I love their games. I am hugely impressed by their consistently high level of quality. I am excited to work with a whole new pile of very smart people. I am excited to learn how their freaky “we don’t have managers, or pure designers, or even job descriptions, really” development process works. I love their dedication to playtesting and really taking playtest feedback to heart.  This is a big opportunity to learn from people who have built some of the best games out there.

My career is about to take a big step forward. I am excited, nervous, and more than a little intimidated. I feel like I’m finally going to the show.