Building Better Prediction Engines

William McNamara • December 21, 2020

Sometimes, you just gotta build it yourself.

I was working for a B2B growth sales organization with the familiar problem of having too many leads to work and not enough time. The challenge was to think of ways that the sales organization might better target their limited time toward those prospects most likely to convert into business.

The conceptual model is simple, and has been used to predict discrete events for centuries. A subject (in this case the prospect) has a given set of characteristics X that yields a Y probability of the event occurring (in this case a sale). This probability can be expressed most easily as a binary ("I think this event will happen"), categorically ("I am very confident this event will happen"), or as a probability itself ("I am 95% confident this event will happen"). Even children will intuitively use this model in their head when they're assessing weather or not to ask their parents for a cookie before dinner.

The difficulty is eventually you go from your parent's kitchen to a multi-million dollar company with operationally complex partnerships. What may previously have been a handful of considerations is now hundreds if not thousands of variables in X. Thank god for statistics!

Iteration #1: Third-party software

Like any good developer given a new task, I first look to see if anyone else has done this in a way I can copy meaningfully use. Fortunately for me, the company was already using an enterprise platform with predictive functionality. All I had to do was tell it what column to predict and presto it would use all the other data in that table to make the predictions!

Okay. Is that...too easy?

It's not uncommon for applications such as these to hide the details of their methodology int he interest of preventing people from copying their IP. But the thing is if you're going to have a secret methodology...it has to work. in this case the predictions output by the standard model were accurate only 45% of the time. This means the sales team would have better luck randomly guessing what will convert than if they use this model. This performance can likely be attributed to the fact that models need to be trained to the specific data model, otherwise it will make assumptions about what is or is not important that may not be correct. So I decided to build a custom model, given my closer understanding of the data, to see if custom could outperform standard.

Iteration #2: Machine Learning

I decided to go with a Python model because all we're really doing is pulling, analyzing, and pushing data with an API. Python is a great language for data analysis, and especially great if you're just getting into programming. My plan was to build a classifier, which is a type of algorithm for grouping a population into sub-populations (in this case will/will not convert), and applying a continuous probability to its prediction using the following form.

After loading and cleaning the data, which any data engineer knows is going to be 75% of the job, I divided the dataset into 80% of historical records I will use to train my model, and 20% of historical records I will use to validate the model's success. The reason data shouldn't be used for both is it can lead to overfitting, meaning it would be very good at predicting against the sample but not data generally. For training my model I used a Gradient Boosting Decision Tree from the scikit-learn ensemble module for ML in Python. Gradient boosting is a method that combines weaker models together, and improves itself over stages.

The number of weak models used is controlled as a hyper-parameter, as are the size and depth of each model used. After fitting the model to our training data we're then able to apply that model to the validation data. I started with the model's defaults and started making minor adjustments to the hyper-parameters to optimize the model's performance. It became apparent to me that higher learning rates for each weak model had the biggest impact on performance. Through manual adjustment I was able to increase the model's performance from 80% to 88% accuracy. But this method was slow, and I knew a faster way.

Iteration #3: Machine Learning w/ Grid Search Optimization

Scikit-learn has another great library for model selection that compares different configurations across the hyper-parameter space and picks the one with the best performance. They automated tuning, what a concept! They tradeoff is you don't want to give it too broad of a hyper-parameter space to look through or it can take a LONG time to run. In the interest of saving time I had it search in iterations of 10% up to +/- 30% in either direction from the default hyper-parameters.

When I applied the grid search optimized hyper-parameter to the model, I got a staggering result of 92% accuracy! Again revisiting the original premise, this model is able to correctly predict if a prospect will convert into business 92% of the time, as compared to the 45% we started with using the third-party model. Still I wanted to see where the 8% was coming from so I could have an idea of how the model might be improved in the future. I saw that the 8% were mostly false negatives, which means the model is slightly pessimistic, and underestimating the conversion potential of some prospects. This can be visualized with a Receiver Operating Characteristic (ROC) Curve. ROC curves typically feature true positive rate (TPR) on the Y axis, and false positive rate (FPR) on the X axis. This means that the top left corner of the plot is the “ideal” point - a FPR of zero, and a TPR of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better. The “steepness” of ROC curves is also important, since it is ideal to maximize the TPR while minimizing the FPR.

Conclusion

In conclusion, I was happy to be able to provide a model to help target the sales organization's limited time, and proud that my custom model was able to perform 2x as well as a standard third-party model. In the three months following the implementation of my classifier, there was a 15% increase in the organizations lead conversion metric, representing millions of dollars in potential new business. Still there are some areas for improvement. False Negatives potentially represent missed business if sales reps don't follow through, and so I fully intend to revisit this at a later date and see what we can do about that last 8%.

< Older Post

Newer Post >

Mail

Joe Biden's AI Legacy

By William McNamara • December 11, 2024

Not a whole lot for Trump to undo

Exploring Alphafold Model (Part 2)

By William McNamara • October 15, 2024

ColabFold has changed the game for amateur protein folding analysis

Exploring Alphafold Model (Part 1)

By William McNamara • February 3, 2024

Getting started with Deepmind's revolutionary model for protein folding

Extracting Patterns from Genomic Data

By William McNamara • December 22, 2023

A continuation of genome sequencing analysis

Biden's Executive Order on AI: It's a start

By William McNamara • November 7, 2023

A good step but not nearly enough

Dimensionality Expansion for Genome Sequencing

By William McNamara • August 1, 2023

Introductory methods for genome sequencing

Fun with Spotify's API

By William McNamara • March 19, 2023

Like many music enthusiasts, the most used app on my phone by far is Spotify. One of my favorite features is their daily or weekly curated playlists based on your listening tastes. Spotify users can get as many as six curated ‘Daily Mixes’ of 50 songs, as well as a ‘Discover Weekly’ of 30 songs updated every Monday. That’s more than 2k songs a Spotify user will be recommended in a given week. Assuming an everage of 3 minutes per song, even a dedicated user would find themselves spending more than 15 hours a day to listen to all of that content. That…wouldn’t be healthy. But Spotify’s recommendations are good! And I always feel like I’m losing something when these curated playlists expire before I can enjoy all or even most of the songs they contain. Or at least I did, until I found a way around it. In this articule, I’m going to take you through Spotify’s API and how you can solve this problem with some beginner to intermediate Python skills. Introduction to Spotify’s API Spotify has made several public APIs for developers to interact with their application. Some of the marketed use cases are exploring Spotify’s music catalogue, queuing songs, and creating playlists. You can credential yourself using this documentation guide . I’d walk you through it myself but I don’t work for Spotify and I want to get to the interesting stuff. In the remainder of this article I will be talking leveraging Spotipy , an open source library for python developers to access Spotify’s Web API. NOTE : At the time of writing, Spotipy’s active version was 2.22.1, later versions may not have all of the same functionality available.