This article originally appeared on Hacker Noon.

When it comes to data science, testing your far-fetched ideas in production can be daunting. This is especially hard when you’re in a fast-paced environment such as Flipp, where ideas are always brewing and new projects are always starting. In exploring Flipp’s approach to testing new ideas, we’ll look at how we build simple models to solve some hard problems.

Let’s use an example from the Flipp search team:

Pinpointing the Problem

The fundamental goal of a search engine is to satisfy a user’s information need. At Flipp, the user’s information need is to find the product deals that match their intent. So the real challenge is to understand what the user intends to find when they enter a search query.

The classic search engine uses keywords entered by the user’s query to retrieve results. These classical systems rank relevant results based on keyword statistics. This can be a problem for Flipp when trying to build a world-class search engine. For example:

Words often have multiple meanings, like how the word “nuts” can refer to the food item or hardware fasteners. Secondly, product deals often use the keywords of another product in their description, like when cereals contain “nuts.”

The solution we proposed to address this search issue requires two phases. First, you have to learn to predict a user’s intent when they search. Then you use this insight to build a production system.

Making a Simple Model

One particularly interesting event to our team is the search hit. This is when a user clicks a search result after they run a query. The user is essentially telling us exactly what item they are looking for.

Every single item in a Flipp flyer gets tagged with a category. We have over 5000 categories to choose from! So when a user clicks a search result, the user is able to implicitly tell us what category they are looking for. If we collect enough of these events for each query, we can eventually determine which category a user is most likely to choose.

For example, if a user searches for “nuts”, the probability a user will click a certain class can be easily identified (50 per cent of the time, a user will click an item tagged as “Nuts & Seeds”).

With this very simple model, we are able to predict the intent of a search query. There are some obvious flaws (such as the inability to predict queries Flipp has never seen) with such a simplified basic system, but we’ll address those concerns a bit later.

Exploiting “Simple” Ideas

The next step is to build a system that can then use this model to make some real-time predictions. This is a classic case of top-k retrieval. At Flipp, search works in two stages:

  1. Documents are retrieved using a search engine. Flipp uses Apache Solr as our main search engine.
  2. The top-k result are reranked based on an external model.

In this implementation, we decided to approach the problem at stage 2. The system diagram below shows search working in two phases. First, the flyer items are retrieved. The data is then passed to the second system, which first identifies the item category for all the retrieved items and also the query-based model as described above.

It then looks at additional signals in the model, such as item popularity. These signals are then combined in a linear model. Each item is given a new ranking weight based on the linear model. The final list is then re-sorted based on the new computed weights.

The easiest way to compute each item’s weight is to quantify each signal into a numeric value. Then we take a weighted sum of each numeric value to compute the individual item’s score which determines the order of the results the end user will see. In this case, we can look at the signal representing the query’s intent and the item’s category. We’ll call this the category curation score, which uses this formula:

Where θ1 is the value of one type of feature, θ2 is the value of another feature. w1 is the weight of the first feature and w2 is the weight of the second feature

Since each item has a category, we can look up the probability the user will click that category for their query. For example, say a bag of peanuts is labelled as “Nuts & Seeds.” If the query is “nuts”, the probability a user will click on that category is 50 percent. We then say the category curation score for that item, given the query, is 0.5. We then compute the final score of the item alongside some of the other signals. Simple, right?

To validate the success of the ranking model, we conducted an A/B test. In this case, the A/B test allowed us to tune variations to the model in order to see a significant lift in our metrics. The image below is an example of what a user would see with (on the left) and without (on the right) the simple model. Clearly, the right one is better at putting useful results at the top of list. Without the model, you would get items like cereal, guitars and chocolate chips at the top of the list instead of actual nuts.

Flipp’s Data Philosophy

As mentioned above, the model we chose is almost… too simple. A better way to predict the category of a query would be to use a more state of the art model such as Naive Bayes or even a neural net.

However, a more advanced technique requires time to build and also brings more complexity when it comes to putting the idea into production. For example, even training a language model is a lot more challenging, especially compared to doing a simple lookup of historic data.

It also would be a lot more difficult to build a production system that does real-time prediction with complex machine learning models… when none of the infrastructure is in place. As with any experimental project, there is always room for failure. The cost of failure with a complex solution is much higher than the cost of a simple solution. (In other words, why spend three months instead of one week to validate an idea?)

By thinking holistically, we are able to optimize the tradeoff between risk and reward. Simple models allow for low risk and quick validation of ideas. In this case, the predictive model (really just the probability distribution) was successful in improving the search results. With stakeholder buy-in achieved, the business is able to invest in more complex models.

In this case, we could look at better ways to predict queries such as using neural nets, as mentioned above. We could also utilize more advanced modelling methods, such as learning to rank to build a better method of sorting search results.