2D vs. 3D in a Gesture Recognition World

If you’re using or planning to use one of Omek’s motion tracking products, you may have wondered why our technology relies exclusively on 3D sensors. Why not use 2D color cameras?

2D RGB cameras are now a standard component in nearly every laptop or all-in-one computer, and have a lower price point than 3D sensors. But, while they are well-suited for capturing images and video for recording or display, they have limitations when applied to the development of gesture-based systems.

At Omek, our goal is to enable the creation of natural and intuitive applications that incorporate motion tracking and gesture recognition. 3D sensors offer tangible advantages to reach this objective. For example, they can perform under a wide array of environmental and lighting conditions. 3D cameras also have the ability to detect fine, nuanced movements. Finally, with the cost of these sensors falling, it’s likely that consumers will start seeing these cameras appear in their personal computing devices as early as 2013.

The rest of this post will highlight a few of the key challenges that arise when we attempt to translate “human vision” into “computer vision” and how 3D sensors are better equipped to handle these issues.

RGB vs. Depth Images
Let’s start off by defining how a depth sensor differs from its 2D counterpart.

An RGB camera provides a 2D image that depicts the color of each point within its field-of-view. A depth sensor provides a 3D image that reports the distance from the sensor of each point within its field-of-view. It provides additional data on the “z” coordinate, or, in other words, on the distance between the camera and every point within the field of view.

Here’s an example of the same image, rendered both as a color image (on the left) and as a depth image (on the right). In the depth image, lighter shades represent points closer to the camera, and darker shades represent points that are farther from the camera.

A depth image can be used to construct a 3D model of its subject. In the case of a gesture-enabled application, this will usually be a 3D model of a hand or a person to enable them to control an application. This 3D model is necessary both for 3D display and for gesture recognition.

Translating Human Vision into Computer Vision
Seeing and comprehending what we see, comes so naturally to humans, that we often overlook what a subtle and sophisticated task it is. And translating “human vision” into “computer vision” is quite complex.

However, 3D depth sensors are much better equipped than their 2D counterparts at overcoming some of the inherent challenges found in computer vision.  We’ll highlight a few of them here.

Separating the “Object” from its Background
In the context of computer vision, we will be talking about “objects” – i.e., whatever it is we are tracking.  In the case of Grasp, it is hands and fingers, while with Beckon we are tracking a full human skeleton.

The first task a computer vision-based application performs in order to interpret objects is to differentiate between the object itself and the “background”. If an application relies solely on a 2D color image in order to find an object’s contour, it can be “confused” if there are similar colors in both the object and the background.

In the example above, you can see that in the color image, the man’s white sweater merges into the white background, making separation of his upper body from the background difficult. On the other hand, using the depth image on the right, this separation becomes a trivial task.

Blocking
A related problem is when one object or part of an object blocks another object from view. On the right you can see an example of this, when the person’s arm moves in front of his body.  Similar colors and textures can make it hard to separate the two objects. Using depth data, however, the software can approach this challenge much more easily.

Differences in Lighting
Different light intensities and angles can radically change the appearance of an object, as demonstrated in the example below – it’s the same face, photographed under different lighting conditions.

 

 

 

 

This can contribute to making it difficult to separate objects from their background.  Or an application may not correctly identify the same object under different lighting conditions. It can also cause parts of the object to be missed altogether, as in the left part of the person’s face in the left-hand image.

This is a case where it would also be difficult to collect depth data, unless the sensor has an independent light source. As a result, most 3D sensors come with their own light source, usually infrared light.

Interpreting Size
Interpreting the size of an object is nearly impossible, based on a 2D image alone. Small objects that are close to the camera may appear the same as large objects that are far from the camera.

Since a depth image provides precise distance values for each point on the object, the object’s size can be easily extrapolated from the depth image. Size interpretation is important in order to differentiate between users and track each user continuously. For instance, a child who is close to the sensor cannot be confused with an adult who is far from the sensor.

Handling Object Orientation and Motion
While an application is tracking a non-stationary object, the object may change both its position and its orientation in relation to the camera, sometimes abruptly. The application must still identify the object as the same one that was tracked before the change.

This is especially true for human bodies, which may change posture as well as position and orientation.  By matching a theoretical 3D model (for example, a skeleton of a hand) to real-time 3D position data, an application can maintain a continuous representation of a moving object.

Summary
Although 2D color cameras are sufficient to provide some types of visual input, at Omek we believe that 3D sensors have significant advantages in the domain of motion tracking and gesture recognition.  Omek products use 3D depth sensors because we’re convinced they enable superior motion tracking and gesture recognition enabling you to create natural and engaging user interfaces and applications.

2 thoughts on “2D vs. 3D in a Gesture Recognition World

Click on a tab to select how you'd like to leave your comment

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*