|
We consider any image to be a collection of a finite number of discrete features. This is a novel approach to images - until now they were always thought of as continuous. Of course, when we talk
about a discrete set of elements, we don't mean pixels. Pixels represent discreteness which has nothing to do with the content
of the image; it simply is a property of the display or image acquisition device. If the same object is photographed by two digital cameras with different resolutions, different orientations in space, etc., we get two very different pixel-based descriptions. It would be useless to try to compare them pixel by pixel - the pixel values are not going to match. Yet to the human eye it is
immediately
obvious that the two photographs have the same content. Why? Because, we believe, our brain records an image not as an array of pixels, but as a discrete set of features - like the tip of the nose, a fork on a branch, an end of a stick, etc. The existence of these is not affected by the resolution of the camera or its orientation.
If we can fully describe an image as a discrete collection of features, we can easily solve the image recognition problem. Unlike pixel values, features have to match. If they do, we declare that the two images have the
same content. Of course, if the first image contains a single face and the second one is a group photo with the same face and some others, only some of the features in the two descriptions are going to match. Still, something has
to match.
Is it possible to determine if two images have the same content without describing them as discrete sets of features? Of course. A purely continuous approach relies on the fact that if two images have the same
content, they are going to match pixel for pixel after one of them is translated, rotated, scaled, distorted and have its brightness and contrast adjusted in a certain specific way. Of course, you don't know in advance the
transformation (if any) that is going to make these two images identical. Finding this transformation is one of the tasks of the image recognition algorithm. A considerable number of such algorithms have been developed, mostly
based on the Neural Networks approach.
There is one problem, however, inherent in any such algorithm. Let's suppose we want to develop a Web search engine based on image content. The user submits, let's say, a picture of a
face. The engine has to find all the images on the Web containing this same face. The problem is that a meaningful image contains hundreds of thousands of pixel values, and there are millions of images on the Web. No matter how
efficient the image comparison algorithm is, it has to perform at least a few operations per pixel (realistically, more than just a few), so we are talking about many trillions of operations - every time
somebody requests a search.
Let us compare this with text-based search engines. There are also millions of text strings, but each string is a small discrete and ordered set of characters. If the first character does not
match (and roughly speaking, in 25 cases out of 26 it does not), the whole string does not match. Thus on average it takes just a little more than one operation to compare two text strings. This is very manageable even with
millions and hundreds of millions of text strings to compare to. Image-based Web search can only become a reality if we create for images a situation similar to the one that exists for text strings: that is, we convert pixel-based
image descriptions into the feature-based descriptions mentioned above. The latter will be small and discrete, easy to compare. Of course, this conversion will require accessing all the images on the Web in their regular
pixel-based form and running a complicated algorithm on them - and this, as mentioned above, is slow. However, this needs to be done only once, not every time somebody wants to do a search.
The same is true in any situation where an image has to be checked against a large database of images. Dealing with all these images on the pixel level every time a search is needed is horrendously inefficient.
How do
we know that it is possible to fully describe an image as a discrete collection of features? That a function can generally be described by a finite set of parameters is not the sort of a statement that you can encounter in your
calculus textbook. In fact, you will encounter exactly the opposite statement there. And yet in our everyday lives we always describe images as finite sets of features. We perceive the letter 'S' as having two bends, while to a
mathematician it is a curve - an infinite set of points - whos curvature changes continuously along it. We can verbally describe the shape of an unknown object by enumerating its bumps, dents, and other such features. It will only
require a finite number of words and it will be useful when trying to describe this object to another person who has never seen it. Is it possible to create a mathematical algorithm that converts pixel-based description into such a
description enumerating features? We were able to make a considerable progress towards this goal in our Flo Control Project. |