Blog

How to Migrate an Algorithm to an ML Model

Blog

Previously, we wrote about the benefits of using machine learning (ML) to replace costly algorithms and adapt more fluidly to changing data. In this blog, we give you a step-by-step primer on how to migrate an algorithm to an ML model.

Blog-How-to-Migrate-an-Algorithm-to-an-ML-Model

By starting here with an existing algorithm, we bypass the need to start with a business objective as gate criteria. At the end of this blog, we discuss what to do if no legacy algorithm exists. In that case, the importance of having a business objective and acceptance criteria cannot be overstated.

Start With Data

ML models need lots of data, and the cost of ML projects is driven by data availability. If you already have an algorithm, then you should be able to produce lots of data very quickly. Create a test driver that sends exhaustive amounts of input data to the algorithm and capture the results. This becomes the “supervised training data” for your new ML model.

Train Your ML Model

Training an ML model nowadays is surprisingly simple. You don’t have to write any complex code, you don’t need to run a compiler or a builder, and you don’t need to hire a data scientist. There are a variety of low-code and no-code options for implementing ML today, both commercial and open-source. Here is a resource that describes some of the models that are commonly available. Alternatively, your cloud provider offers APIs with pre-trained models for a variety of use cases and plenty of training and documentation on how to use them.

Validate the Model

How do you know if it worked? You typically hold back 20 percent of your training data to use as a “validation set.” After you train the model with 80 percent of the data, then you send the “validation set” through the newly trained model as a “quiz” to see how well the model performs. You already have the correct answer, so you can automatically check whether the newly trained model got it right. Use these results to calculate the accuracy. If the model has decent accuracy, then you have a starting point to start replacing your algorithm.

The point here is that you don’t have to spend 1,000s of hours of QA time trying to validate that the ML model is working. You don’t need a QA person trying to design test cases for black box, glass box, edge case, boundary case, happy case, etc. Just evaluating the “Data In — Data Out” aspect of the model is sufficient.

Deploying the ML Model

Initially, you will run the newly trained model in parallel with your production algorithm. You will also audit some subset of the results to determine the “correct” answer. (You can audit manually. Or, if your algorithm is good enough, you can use the algorithm output directly as the “correct” answer.) Any of your audited data – especially the records where the model failed to find the correct answer – should be saved as additional “supervised training data” for the next time you train a model.

Wait – what next time? Yes, you are going to want to train another model. ML is an iterative process, not a “one-and-done.” Initially you will be auditing new data and training new models every day, or multiple times per day. And each time you train a new model, you will have more training data, which represents more use cases. So your model should get smarter and more accurate over time. Once you reach your target accuracy, then you can decrease your frequency of auditing and retraining to once per week, or once per month; just enough to maintain the desired accuracy. But for most use cases, you will never reach a point where you can just stop training new models, because you will always need to account for new data values or new patterns.

Note: Training models is not necessarily deterministic. (That is, two models trained with the same dataset will not necessarily yield identical results.) So you always need to validate your new model’s accuracy, even if it was trained the same (or better) data as before.

Eventually, when your model starts to outperform your algorithm (in terms of accuracy), then you can retire the algorithm, and just run the production data through the ML model.

Special Cases
Special Case 1: What if none of the standard ML models comes close to the needed accuracy?

One thing to look for is whether you have bifurcated data. That is, perhaps you really have two types (or n-types) of data records, both being used as input to your single algorithm. And perhaps the first thing the algorithm does is identify the “type” of record, and then it treats the different record types accordingly. In this case, split the training data into two sets, and train two separate ML models (or “n” separate ML models) to do the work. (For extra credit, you can also train a model to do the record type classification as well.)

Special Case 2: What happens if the business rules change?

In this case, you will need to get some training data that represents the new business logic. You can do this in a few different ways:

  • Write a little algorithm to generate the data automatically.
    • Alternatively, there are companies that specialize in synthetic training data (including YDataHazyDatomize, and Tonic)
  • Generate the data manually via a spreadsheet application, such as Excel.
  • If possible, request the data from the business unit that instituted the change. They have often already worked through many examples as part of their implementation and roll-out.

Once you have the new training data, you can add it to the original training data and train a single model to handle both data sets. Or, you can add in an additional ML model to just handle the data records that require the new business logic. (See Special Case #1 above.)

Note: If the data from the new business logic is too different from the original data, then there is a risk that the new data will cause problems for existing reports and views. In this case, you may want to think about versioning your data, and publishing a restatement of the data that includes the new business logic.

Special Case 3: What if I don’t already have an algorithm to get started?

What if no one has figured out an algorithm that works? Maybe the problem is too complex, or there are too many possible variations? In this case, write an algorithm that works maybe 50% of the time. You can start with some basic business rules from your intuition and a Subject Matter Expert. Use that algorithm to generate some initial (rough) training data. Use that training data to train an initial ML model. Then, audit the results of the ML model and capture the audited records as additional training data. It may take a few weeks of intense manual auditing. But over time the ML model becomes smarter, and the auditing process becomes easier. The auditing can be done by business people, interns, temps – but does not have to be done by developers who may be overloaded, or too expensive.

CoStrategix helps companies solve business problems with digital and data solutions. Data science is a journey. We help companies build their data maturity from building a data platform infrastructure, to developing machine learning (ML) and artificial intelligence (AI) solutions, to DataOps as a service. If you are struggling with how to apply data science to your business challenge, feel free to reach out to us.