Computer vision: clothes recognition in the photo (picture) by means of a mobile application


Not so long ago we decided to make the project allowing to search clothes in various online stores according to the photo (picture). The idea is pretty simple — the user loads the image (photo), allocates the area interesting to him (a t-shirt, trousers, etc.), points (optionally) the specifying parameters (the gender, the size, etc.), and the system looks for similar clothes in our catalogs sorting it by similarity degree with the original one.

Not that the idea is original, but it is qualitative nobody realized. As an example, the project exists in the market for some years already, however, relevance of its search is very low, selection is made generally by determination of color of the image. For instance, it will be able to find a red dress, but the dress of a certain style or with a special drawing – it won’ t. By the way, the audience of this project doesn’t grow. This is by reason of the search of definitely low relevance, in our opinion. In fact, you can choose the color of an item on the site of the shop checking in its catalog, thus, it doesn’t differ from the search by

In 2013 www project was started up. Its search engine is slightly better to compare with the previous one. The emphasis is on the color and some small options specified manually from the special catalog (a short dress, a long dress, a medium- length dress ). This project, having met difficulties of visual search, turned a bit toward social networks where trendies can share their “looks” from “a shazam for clothes” in “an Instagram for trendies”.

In spite of the existing projects in this area, the requirement of search according to the picture, that is very actual nowadays, is definitely not satisfied. The solution of this problem – is to create a mobile application (as it was made by SnapFashion and Asap54) that corresponds to the tendencies of e-commerce of the market: according to various forecasts the share of mobile sales in the USA from 11% in 2013 can grow up to 25-50% in 2017. This rapid growth of mobile trade foreshadows as well the growth of popularity of the most different applications helping to make purchases. Apparently, shops will invest in the development and the promotion of such applications, and, surely, they will cooperate actively with them.

Having analysed competitors, we decided to try to see into this theme, as a consequence, we’ve started the Sarafan project

The corporate style was chosen initially bright. We’ve worked on a variety of options:


Eventually, the style with bright colors was approved.


To start we’ve chosen the client for iOS (for iPhone). The design is in the form of ink smears, the application works through Rest-service, at the main screen there is a choice: to take the picture or to choose from gallery.


Apparently, it was the simplest of all project. Meanwhile on a backend-development front line everything wasn’t so rosy.Here is the history of our searches: what we did and what we came to.


Visual search

We tested some approaches, but didn’t get any result which would allow to make high-relevant search. In this article we are going to present what was tried, what was working and how it was working when different data were used. We hope, this experience will be useful to all readers.

So,the main problem of our search is a so-called semantic gap. More precisely, the difference of the human’s and the machine’s considering of similar images (in this case- clothes images). As an example, a person wants to find a black t-shirt with a short sleeve:


The person will easily point the second image in the list below. The machine, most likely, would select the image number 3 with a women’s t-shirt. However, it should be noticed, that the picture has very similar configuration and identical color distribution.


The person expects to see as search results  – the items of the same type (a t-shirt, an undershirt, jeans …), of the same style and with the same color distribution (color, texture or drawing). But in practice to perform all three conditions is problematic.

Let’s begin with the simplest, the image with similar color. To compare the colors of images a method of color histograms is most often used. The idea of a method of color histograms for images comparison is as follows. All set of colors is divided into a composition of subsets, disjoint and completely covering it. For this image the histogram is formed which reflects a share of each color subset in color scale of given picture. To compare histograms the concept of distance between them is used. There are many different methods to form color subsets. In our case it would be reasonable to form them from our images catalogue. However even for such simple comparison the performance of the following conditions is required:

—pictures in the catalogue have to contain only one item on easily separable background;

— we need to distinguish  the background and the clothes area interesting to us in photos of the user.

In practice the first condition hasn’t been never satisfied. We are going to tell about attempts to solve this problem a bit later. It is rather simpler with the performance of the second condition as the allocation of the area we need is made with a user’s active participation. For example, there is quite effective algorithm of a background removal – GrabCut. We proceeded from the reason that the area on the image interesting to us is closer to the center of the area encircled by the user, than to its border, and the background in the image field is rather uniform in color. Using GrabCut and some heuristics we’ve succeeded to receive the algorithm working correctly in most cases.

Now let’s speak a bit about the allocation of the area we need on images of the catalogue. The first that comes to mind – to segment the image on color. To get it, you can use the algorithm watershed.

However the image of a red skirt in the catalogue can have some options:


If in the first and the second cases to segment the necessary area  is  easy enough, in the 3rd case we will allocate a jacket in addition.This method won’t work for more difficult cases, for example:


It should be noted that the problem of images segmentation is completely not solved. I.e. there are no methods allowing to allocate the necessary area with one fragment, as the human can make it:


Instead of this, the image is splited into super pixels, here is better to note algorithms n-cuts and turbo pixels.


Further some combination of them is used. For example, the problem of object search and its localization is reduced to search of combination of super pixels, belonging to object, and not to search of the limiting rectangle.


So, the problem of images marking is reduced to search of combination of super pixels corresponding to the item of this type. And here is the problem of machine training. The idea is as follow: to take a set of images marked manually, to train the qualifier on them and to classify various areas of the segmented image. Finally, to consider the area with the maximum response as the area which we are interested in. However, here we need to decide how to compare images, as a simple comparison on color will definitely not work. It is necessary to compare a form or a certain image of a scene. As it seemed at that time, gist descriptor would be suitable for these purposes.The gist descriptor is a certain histogram of edges distribution in the image. The image is divided  by a grid of any size into equal parts, in each cell the distribution of edges of different orientation and of different size is counted and is discretized. Than we can compare n-dimensional vectors received as a result.

The training selection was created, the set of images of different classes (about 10) was manually marked. But, unfortunately, even at cross-validation it wasn’t succeeded to achieve classification accuracy higher than 50%, changing algorithm parameters. Partially, the cause is that the shirt -from the point of view of edges distribution-is not very different from a jacket; partially – that the training selection is insufficiently big (usually gist is applied to search in very big collections of images);  partially – that for this task it couldn’t be applicable at all.

One more method of images comparison is to compare local features. The idea is following: to allocate significant points on images (local features), somehow to describe vicinities of these points and to compare number of features match of two images. We have used SIFT as a descriptor. But comparison of local features leaded as well to poor results, mostly because of the fact that this method is intended to compare images of the same scene shot from different angles.

Thus, it didn’t turn out to mark images from the catalogue. Search in untagged images by using the above described methods sometimes gave approximately similar results, but in most cases the result had nothing similar from the human point of view.

When it became clear that we couldn’t mark the catalogue , we tried to make the qualifier for user’s images, i.e.  to define automatically the type of a thing the user wanted to find (t-shirt, jeans, etc.). The main problem was a lack of  training selection. Images of the catalogue didn’t approach, firstly – because they weren’t marked, and secondly – they were submitted in quite limited set of spatial representations and there was no guarantee that the user would provide images in similar representation. To receive the set of spatial representations for a thing  more complete we shot the person in this thing on video, then cut out a thing and built the training selection on a set of shots. Herewith the thing was contrast and easily separated from a background.


Unfortunately, this approach was rejected quickly when we realized, how much video should be removed and processed to cover all possible styles of clothing.

Computer vision — is a very broad segment, but we (for the present) failed to reach desirable result with high-relevant search. We don’t want to turn aside, adding supplementary functions; we will fight, creating a search tool. We would be glad to hear any advices and comments.


Leave a Reply