Mobile Monitoring Solutions

Close this search box.

Podcast: Emmanuel Ameisen, Head of AI at Insight, on Building a Semantic Search System for Images

MMS Founder

Article originally posted on InfoQ. Visit InfoQ

On this week’s podcast, Wes Reisz talks to Emmanuel Ameisen, head of AI for Insight Data Science, about building a semantic search system for images using convolution neural networks and word embeddings, how you can build on the work done by companies like Google, and then explores where the gaps are and where you need to train your own models.  The podcast wraps up with a discussion around how you get something like this into production.

Key Takeaways

  • A common use case is the ability to search for similar things – I want to find another pair of sunglasses like these, or I want a cat that looks like this picture, or even a tool like Google’s Smart Reply, can all be considered broadly the domain of semantic search.
  • For image classification you generally want a convolutional neural network. You typically use a model pre-trained with a public data set like Imagenet pre-trained to generate embeddings, using the pre-trained model up to the penultimate layer, and storing the value of the activations.
  • From here the idea is to mix image embeddings with word embeddings.  The embeddings, whether for words or images, are just a vector that represents a thing.  There are many approaches to getting vectors for words, but the one that started it is word2vec.
  • For both image embeddings and word embeddings you can typically use pre-trained models, meaning that you only need to train the final step of bringing the two models together.
  • Before deploying to production it is important that you validate the model against biases such as sexism, typically using outside people to a carry out a through audit.

Subscribe on iTunes

RSS Feed


Sponsored by Goldman Sachs

Goldman Sachs Engineers don’t just make things—we make things possible. Our engineers are innovators and problem-solvers, building solutions in risk management, big data, mobile and more. Interested? Find out how you can make things possible at

Show Notes

Tell us about what Insight is?

  • 2:10 Insight started in 2012 and it is a free fellowship for both academics and engineers to transition to roles in data.
  • 2:20 It started with data science, and expanded to data engineering, health data and more recently artificial intelligence which is a program that I lead.
  • 2:30 We now have even more programmes like data project management, devops, and the idea is they are free seven week programmes for qualified people to come through.
  • 2:40 They build strong pieces of software and show these to companies that buy them.

Do the attendees have some background already?

  • 3:05 It would be very ambitions to teach all of data science in seven weeks.
  • 3:10 What happens is that people from both industry and universities from the US that have a strong background in software engineering, statistics, machine learning and so on.
  • 3:25 They come to Insight and put something together that they can demo in seven weeks.

Tell me about your blog post on how to create an image service from scratch?

  • 4:10 The blog post “Building an image service from scratch” was written to answer some standard questions.
  • 4:15 A lot of companies are interested in both searching for things in text and images.
  • 4:35 This is an idea that many different people have and propose different solutions.
  • 4:50 There is a lot of similarity with semantic search.
  • 5:05 Google’s smart reply is similar; you get a suggestion of what to reply.
  • 5:15 There are many different applications, and the methods weren’t clear of how to handle it, so I decided to write about it.

So semantic search is being able to search for pictures of roads?

  • 5:35 It’s going slightly beyond just matching words.
  • 5:45 You might get similar images, like those of paths, because the concepts are similar.
  • 5:50 If you just search on words or keywords, then you might not get those back.
  • 5:55 What ways can you get to a system to make those associations?

How do you start reasoning about a problem like this?

  • 6:05 This is a relatively new area.
  • 6:10 There’s a lot of traditional ways you could go about it.
  • 6:20 You could say that you wanted to index roads, and if you could detect lines in images that were parallel, then maybe you had a road.
  • 6:30 It turns out that these tend to be brittle and don’t work well in practice.
  • 6:30 What has ended up being a method that works is using deep learning classification techniques.
  • 6:40 What you hear about deep learning is that it learns representations – so instead of you having to come up with a collection of rules to describe a road, it can learn it for you.
  • 7:00 That makes the problem much simpler, because you just need to find the right data.

What are some of the characteristics that work well for a convolutional network versus a rules engine?

  • 7:35 The way I think about it is to try and write a set of steps then you might not need machine learning.
  • 7:50 If you want to find images with a blue square then you’d be crazy to use a neural net to search for them.
  • 8:00 There’s more complex things – like what is a chair or a cat – can you find a set of rules to find them?
  • 8:20 It is easy to find counter examples to things that break a rules based system.
  • 8:30 If you can find a a simple set of rules, then go for it – or even start with that.
  • 8:35 For example, if you’re trying to detect trees, then a very good rule is counting the number of green pixels.
  • 8:50 In cities, there aren’t many trees – so if you count the number of green pixels, you’re doing a pretty good job.

How do you choose between RNN or CNN or deep learning?

  • 9:20 A lot of it is having seen these problems before and seeing what works for each.
  • 9:25 For images, you generally want to use CNN is that they’re built for images with the idea that the main property they have is composed of a bunch of filters that are immune to translation.
  • 9:40 If they learn to detect a tree at the top left image, then they’ll be able to find an image at the bottom right of the image.
  • 9:50 That’s a useful property for image processing in general – you wouldn’t want to have a set of images all objects in all potential positions.
  • 10:05 They also have multiple layers, so they have the ability to keep going higher in abstraction.
  • 10:15 RNNs are suited to multiple steps of information, so they might not apply immediately to images.
  • 10:30 If you want to process images, CNNs are the way to do that.

What’s the next layer after classification?

  • 11:00 Taking a step back, most of those CNN models, you’ll do transfer learning – so you’ll take a pre-existing trained set and then start from there.
  • 11:15 In the data set, you might have 1000 different classifications of cats, dogs, planes – but the way that these classes are represented is that they are an array of 1000 numbers.
  • 11:30 The way that the system learns is that it tries to get the number of the right index.
  • 11:40 If the model guesses one type of cat, but the wrong one, then the penalty given is the same as if it had detected a table as a human.
  • 11:45 If you were asked to identify a cat, but you got the wrong breed of cat that wouldn’t be a big issue, but if you said it was a chair I would be much more worried.
  • 11:55 The idea is that if you just take a model that does classification, it hasn’t been told these differences.
  • 12:05 What ends up happening is that to do that classification efficiently, the model has to compress the information in that image.
  • 12:15 In that layer that’s before which tells you what kind of object it is, there is a dense layer which has almost all of that information.
  • 12:25 It turns out that if you take the prior layer and use those numbers to find images with similar numbers, then they will be similar images.
  • 12:45 A lot of similarity searches are built using the approach of taking the penultimate layer of a CNN to find similarities.

Can you explain what happens in that situation?

  • 13:00 What CNNs end up doing is taking an image, and the network has learned a set of weights, and those weights are used to multiply the pixel values a bunch of times.
  • 13:10 You then have functions applied to it between layers, and by the time you get to the end of the layers then you have a set of values.
  • 13:25 The layer right before – in the case of BG16, the prior layer is of size 2000 – it is from that layer that you’ll do the last step to do the classification.
  • 13:45 That layer has all the information that the final layer will use for classification.
  • 13:50 Everything has been thrown away by this point, but if you take an image and put it through the network, you’ll end up with 2000 numbers.
  • 14:00 Images with a similar classification will have similar numbers at the back of this layer.
  • 14:10 It’s an approach which will take you pretty far.

The blog post took you into word embeddings?

  • 14:25 I explored an approach mentioned in a paper titled ‘devise’ – the idea is to mix word embeddings and images.
  • 14:35 Word embeddings are just a vector that represents a thing – the last set of numbers in the image example is an image embedding.
  • 14:55 There are many approaches to getting word embeddings; the approach that started it is word2vec in 2012.
  • 15:00 The idea was for each word in an English language (based on crawling news and wikipedia) can we create a vector of size 300 that represents the semantic meaning of the word.
  • 15:20 Numbers will be similar, male terms, female terms will be similar to each other.
  • 15:35 Countries will cluster around certain areas, capital cities of those clusters will be in a similar area, and so on.
  • 15:50 Each row of the matrix will be a word, and for each row, you’ll have a vector of size 300.
  • 16:10 Each word will be represented by a set of 300 numbers.
  • 16:15 You can download a set of pre-trained word vectors for any word you could need.

So you don’t have to learn the vectors yourself?

  • 16:35 It makes it much faster, because you download this big binary, and you have an interface which allows you to get out the matrices for any word Wikipedia.
  • 16:55 A lot of the time you’d like to know if those words are similar, so you can detect similar phrases.
  • 17:10 By default you might not know these words are similar, but they will have similar vectors.

So after a CNN you use the word embeddings to find similarities?

  • 17:55 So your image detection produces a vector of size 2000 which describes the image, and the same for words with a size of 300 numbers.
  • 18:10 If you want to search for more images or text, you can use those vectors to search for nearby images or words.
  • 18:15 The issue is how to bring both together.
  • 18:30 What you do is to first get them to the same dimension, and secondly ground one in the other.
  • 18:50 You need to do something so that images of cats have a similar vector to the word cat.
  • 19:00 You have to start training your own data here instead of using pre-trained models.
  • 19:15 If you want to build a search engine where you want to search for images based on a description, or to have a description of an image, you need to train a model.
  • 19:25 What we do is take the image, and then train that to predict the word vector instead of a classification model.
  • 19:50 The idea is you predict a word vector from the image – so you train the network to predict one from the other.
  • 20:25 It generalises well to other things.
  • 20:30 If you were a sunglasses manufacturer, you would do that process with a bunch of images of sunglasses, and then you’d get a really powerful engine for that.
  • 21:10 The tricky part is going from the business problem to something that is more engineering.
  • 21:35 There are many other ways you could think of doing this – such as learning text and then feed it a model to output a set of images.
  • 21:50 The reason I didn’t do that was that Google published a paper about how they tried to do that for smart reply and it didn’t work at all.
  • 22:25 If you’re training vector models, you can do a lot of this before the user types a query, and then store the results – after which it’s just database access

How do you decide where to draw the line for off-line training and processing?

  • 23:00 First, you need to think about the performance that you need – can you architect a way out of that problem?
  • 23:15 If you’re building a product page, and you want to suggest similar products, you could design a site where it shows one, and if you click on it, good.
  • 23:25 If you redesign your site so it shows five at a time, the problem becomes five times easier.
  • 23:30 The first question depends quite a lot of what you can get away with.
  • 23:35 The tradeoff is usually pretty clear – to get the best results you’ll do everything at query time and build the most complex model and it will take fifteen seconds to run.
  • 23:50 Often you can get away without doing that – in the Google papers about smart reply – they went from having a model that analysed the whole email to a word embedding that suggested it faster.
  • 24:10 The word embedding reply was much faster but only dropped 0.1% engagement.
  • 24:20 My bias is going to a simple solution, and don’t build it until you need it.

What are some of the things you now have to think about to move it to production?

  • 24:55 As an added step before deploying – you need to perform testing on it.
  • 25:10 You want to test for unintended bias, where you’re recommending horrible things.
  • 25:25 The best ways to have others to test it, because they will be able to find these issues much faster than you will.
  • 25:40 To put it in production – it depends on your model.
  • 25:45 In the example of images and word embeddings, you can pre-compute the embeddings – and as long as those models stay the same, those results can stay in your database.
  • 25:00 There are still a few things you have to think of.
  • 25:10 A forward pass on a CNN takes a few tenths of a second on a GPU or a 2-5 seconds on a CPU.
  • 26:15 If users are going to search images you don’t necessarily want them to be waiting for that time.
  • 26:20 You might want to have a GPU on standby that will give sub-second response time.
  • 26:30 If your users are going to type in words then it’s usually pretty fast.

Are there different ways of handling the model the GPU or CPU?

  • 26:55 Most deep learning frameworks will let you use the same code for either.
  • 27:05 The issue comes with deployment – if you need GPUs in production then you’re going to need to provision them.
  • 27:10 If that becomes a need then they are 100 times more expensive than CPU only machines.
  • 27:15 In general you want to pre-compute as much stuff as you can because it will either save GPU cycles or eliminate the need for GPUs entirely.
  • 27:25 The last thing is, even when you have pre-computed these embeddings, you have the last problem which we have glossed over.
  • 27:35 When you type in the search bar, you’ll get an embedding, which is a 300-sized vector.
  • 27:45 You then have to search for the nearest vectors but if you have 10 million items it is an expensive operation.
  • 11:00 You aren’t going to do a pairwise comparison on each of the 10 million items.
  • 28:00 There’s a lot of work being done by teams (explained in the blog post) about how to efficiently query for vectors that are close to other vectors.
  • 28:15 There are approximate methods which are fast, but won’t always give you all the items that are the closest.
  • 28:20 There are exact methods that involve spacial indices.
  • 28:30 That’s one of the things that people often gloss over.

How do you choose between the different tools that are out there?

  • 29:05 It’s very likely that whatever my answer is, it will be outdated in three months because it’s a fast moving field.
  • 29:15 What’s interesting is that for most projects when you’re doing engineering and not research is to start with Keras.
  • 29:30 What I have found is that I’m able to go much faster with Keras abstractions, and then dive deeper.
  • 29:55 The fastest speed you can get in an ML project is getting to a full pipeline and cutting as many corners as possible, and when done, looking at what might require diving into TensorFlow or Pytorch.

How do you debug things?

  • 30:35 What I found to keep my sanity is to start by taking an existing codebase and make it work without changing anything.
  • 31:00 Take something that exists, make it run, then start moving one piece at a time and checking your assumptions.
  • 31:10 Another crucial trick is that when you do deep learning, you don’t have time to try things on your data set doing forward passes.
  • 31:40 Restrict your data set to one image and one word – that way, you can test that your pipeline is connected properly and debug the easy things.

How can you debug the results or avoid bias?

  • 32:05 The first part is debugging the software; the second part is see you are getting the results you’re expecting.
  • 32:20 You may find your loss function is decreasing during training, but when you try a new image it gives inappropriate results.
  • 32:35 The other issue is one of fairness.
  • 32:40 The hard thing is that you don’t have test coverage for that context – you can’t think of all the positive and negative things that the model might do.
  • 33:10 I don’t think there’s a well accepted answer other than looking at the data all the time – and including other images in your search.
  • 33:30 We all have our individual experiences, and so we have different blind spots.
  • 33:45 At a high level, I would encourage every practitioner to confront the actual data – that’s the way that you move from traditional programming to data is the data.

Emmanual is talking at QCon San Francisco 2018 in November.


About QCon

QCon is a practitioner-driven conference designed for technical team leads, architects, and project managers who influence software innovation in their teams. QCon takes place 7 times per year in London, New York, San Francisco, Sao Paolo, Beijing & Shanghai. QCon San Francisco is at its 12th Edition and will take place Nov 5-9, 2018. 140+ expert practitioner speakers, 1300+ attendees and 18 tracks will cover topics driving the evolution of software development today. Visit to get more details.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS feed, and they are available via SoundCloud and iTunes.  From this page you also have access to our recorded show notes.  They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.