Blog Posts French Process Management

Starting from scratch, how to embed computer vision techniques into your project #2

Blog: Smile - Le blog des consultants

Part 2 — Training data set

Ok, let’s do it

A lot of Screwdrivers

First, we need a significant amount of screwdrivers (and hammer, paintbrush, etc.) images. And to teach our student correctly, we need to explain to him where, in the pictures, the tools are.

To do so, we must draw bounding boxes around each object of interest and assign them an object label on each image. These bounding boxes/label information are stored as meta-information for each image file (“annotation”). This task is done through a dedicated tool (“annotation tool”).

As we are speaking about a significant amount of images, this process is highly time-consuming. We found sufficient accuracy from 1000 images per object category when using transfer learning on the selected models. This means at least (1000 x number of things to recognize) images :

Retrieving the images

To retrieve a real-life image that contains the objects, two kinds of data sources are obvious :

Common data sets are handy as they come with images that are already annotated. We need to filter out the kind of objects we want. And that’s where we will reach the limitation of these data sets: they do not contain all possible objects we are interested in, and when they have them, the number of available images can be too small.

On the other hand, public search engines, like Google, contain a massive amount of images. It is easy to find the objects we want, with the quantity we want. But then image quality is not guaranteed (content, format, etc.) and is not annotated.


So it is easier to start with the common data sets, to check if the objects we want are available or not, before crawling the public search engines. We did that with these sources :

Each of these databases has its data format and constraints. So the first task is to build some scripts to extract the information we need, collect the relevant images and then merge all the meta-data into one single database, using a common format.

A quick check on the website of each data source can also be helpful to know if the objects we want exists or not in the data source: in our case, the Coco data source was eliminated quickly as it does not contain any objects we are looking for. So we focused on the three remainings and finally got these numbers :

As a common output format, we choose to use the Coco annotation format. Here is a good article explaining in detail how to use it.

The scripts we used to collect the meta-data, images and to build the final dataset are available here :

GitHub – smileinnovation/visual-search-dataset

As a conclusion for this step: we could start a learning process with the first version of the training data set. Still, as the quantity of the available images is small, we cannot expect any good accuracy from a trained model. We need more photos.

Second round: getting more images

To get more images, we now have to focus on public search engines. A simple tool can help us to crawl several search engines with the same query: searx. You can run this meta-search-engine locally (i.e., with a docker container) and extract a uniform search result (a JSON file) from it. Then, thanks to a simple script written in your favorite language, download the images from URLs included in this result.

The outcome, targeting Google Image and Bing Image, looking for the objects of this use case, we can grab this quantity of images :

driller: 1764
hammer: 1425
paintbrush: 546
screwdriver: 2156
wheelbarrow: 2737

If we add the last number of images extracted from common data sets, we can say we now have at least 1000 images of paintbrushes and 2000 for each other objects. “at least” because each image can contain more than one object in it.

By doing so, we can quickly grow the number of images of our data set. But these additional images are not annotated: this means we will need extra work to produce the objects annotations required to include these images in the training data set.

Annotation job

Annotation is a critical step for machine learning, as the quality of the annotations directly impacts learning accuracy: incorrect annotations can decrease the model accuracy.

And as this job is time-consuming, it is essential to control and streamline it: you need a tool to standardize this human activity.

Many annotation tools are available, but we also need to consider the human workload for this task to choose it wisely.

We have more than 8600 images to review, with annotation task duration to a minimum of 10 seconds; we have at least 24 hours of continuous human work.

This means we have to schedule this workload, find people capable of doing this task precisely to avoid incorrect annotation.

So here we have two possibilities :

Regarding the annotation tooling, we have also to consider the annotation format produced by the tool: it must be easily used by any other software/scripts to run the training.

Our choice went to AWS SageMaker GroundTruth for this annotation job :

We can note that GroundTruth also allows you to pick up a 3rd party vendor in the AWS Marketplace, specialized in machine learning annotation. So you are free to choose within this list a preferred vendor. This option is essential when :

To set up a GroundTruth labeling job, you need to go through a few steps :

GroundTruth labeling job setup

GroundTruth labeling job pricing is based on the number of images included in the job. The pricing is decreasing with the volume.

If you want to use a public workforce, Mechanical Turk pricing is based on the number of objects to annotate. If there are three hammers in an image, this counts for three objects.

You can also ask for several reviews per object to increase the annotation accuracy. Still, of course, this increases the labor cost (by default, it is set to 5, meaning that five distinct workers will perform the same task on a dataset object). The Mechanical Turk pricing starts at $0.036 per object (for an 8s job), but you can increase the budget for more complex tasks that require more time.

So yes, we chose to give a try to Mechanical Turk, with a minimal budget. Workers had to go through this interface to annotate each image (drawing a bounding box for each object, applying the correct label)

AWS GroundTruth labeling tool

It took about ten days for the Mechanical Turk workforce to complete the review of all images. Here is a sample of the annotation result.

GroundTruth annotations job results

And finally, we have a data set ready to be used for training :

Total images : 11669
Total bounding boxes : 21363
Bounding boxes per object : 
- driller : 3236
- hammer: 3217
- paintbrush: 2413
- screwdriver: 8591
- wheelbarrow: 3906

All annotations are available, in GroundTruth format, in our S3 bucket as an outcome of the labeling job.

We have a lot of screwdrivers, as we have a lot of screwdriver images that contain multiple objects, as on this sample image :

A lot of screwdrivers

This could be a problem, as this object will be over-represented compared to others in our training data: this will create a bias in our training data, as the model will get more trained on screwdrivers than on other objects.

We will have to fine-tune this data set before starting the training, and we will investigate that in our next chapter.

Coming next :

Smile is the proud editor of ElasticSuite, a great Magento open-source extension, with more than a million downloads on Github and trusted by more than 1500 top retailers worldwide. It’s the leading solution for intelligent search and merchandising on Magento.

As part of this product road map, we have to test and experiment with new features.

That’s all, folks!
Did you enjoy it? If so, don’t hesitate to 👏 our article or subscribe to our Innovation watch newsletter! You can follow Smile on Facebook, TwitterYoutube.

Starting from scratch, how to embed computer vision techniques into your project #2 was originally published in Smile Innovation on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples