Author: William Vorhies
Summary: Are you wondering about moving up to Automated Machine Learning (AML)? Here are some considerations to help guide you.
Are you wondering about moving up to Automated Machine Learning (AML)? Or perhaps you’ve already made the decision but are wondering about the capabilities of individual platforms, their strengths and limitations and how to choose. Here are some considerations to help guide you.
What’s Your Motivation?
This is intended to be a little broader than business case and requirements. Chances are your broader motives fall into one or more of the following buckets and can certainly involve more than one at the same time.
- Efficiency
So far the greatest motivation behind AML adoption has belonged to companies who are already deploying large numbers of ML models. If you’re creating and managing dozens or even hundreds of models as is frequently the case in insurance, banking, and ecommerce then the ability to create more models and keep them refreshed is an obvious issue.
Cost savings are a top motivation as fewer data scientists can now do the work of many. Speed, that is time to benefit is also greatly enhanced especially in the model refresh and deploy cycle.
- Broader Participation
Be aware that many of the up and coming AML platforms differentiate themselves based on audience. Those that appeal to your existing data science team offer easier and more complete access to choices in data prep, feature selection, model selection, and model tuning with their hyperparameters.
The larger emerging camp seeks to make the process much easier for less experienced modelers. On the one hand this can be your first year data science hires who will rely more on the automated features than perhaps the more experienced team members. On the other hand there are platforms so completely automated that they encourage LOB Managers, analysts, and other citizen data scientists to participate directly in model building.
Having more people directly participating in model building can seem like a very desirable objective. Be sure you have sufficient controls to prevent putting models into operation that haven’t been fully vetted by your experienced data scientists. It’s still possible for the operator of a fully automated tool to create a model that’s not sufficiently accurate, won’t generalize, or worse, predicts exactly the wrong thing.
- Just Getting Started
If you’re just getting started on your digital journey and don’t yet have a dedicated data scientist or two, you might be tempted to sign up for an AML and give your LOB Managers and analysts enough training to get started. Don’t go there.
As in the last section, it’s still possible for an inexperienced modeler to create a model that will leave you worse off than having no model at all. You’re going to need some quality control before you turn new models loose on your customers or processes.
How Much Does Accuracy Count
In machine learning there is always a practical tradeoff between model accuracy and time to develop. Your data scientists will no doubt be happy to continue to deliver increasing incremental gains in model accuracy for days or weeks.
Still, it’s important to understand the tradeoff between model accuracy and revenue or margin. It’s not unusual for small gains in accuracy to create proportionately much larger gains in campaign results.
Your data science team lead no doubt understands this and has already put some controls in place. The real issue is whether the automated output of the AML platform meets your minimum requirements.
Determining this will require some benchmarking during the selection process so that you have side-by-side comparisons. Most all AML platforms use multiple algorithms and teams of algorithms run in competition with one another to select the winners.
Accuracy within the AML may be less than optimum if the number of candidate algorithms is restricted to just a few. It’s just as likely however that any shortfall in accuracy may have occurred in the automated data prep, cleansing, feature engineering, and feature selection. You’ll need experienced members of your data science team to help you evaluate this issue.
Basic Feature Set
At this stage in market maturity, any AML you consider should offer all of the following automated capabilities:
Data Blending: The combination of data from different sources into a single file. This still requires the operator to specify things like inner or outer joins of data sets. The most advanced platforms may also be able to detect whether the data from two different sources with the same name (e.g. ‘sales’) has the same meaning. At this point however it’s best to have either really robust data governance (and not many do) or to have modelers sufficiently intimate with the data that they can detect this sort of mismatch.
Data Prep and Cleansing: In this category is automated correction of data in incompatible formats (dates, values with embedded commas, etc.) Most AML data prep platforms do a good job at this. Cleansing is more complex. It involves for example the identification of outliers and how they are to be treated, the correction of badly skewed distributions, the conversion of categoricals into independent features, or even the compression of data ranges (typically -1 to 1) to create data sets as required for some specific types of algorithms like neural nets.
Feature Engineering: In concept feature engineering is simple. For example converting related variables into ratios (e.g. debt to income) or dates into number of days since other events have occurred (age of the account, days since last purchase, etc.). In automated form this frequently requires the AML to create all possible combination of these artificial features without regard for whether they are logical, and then let the algorithms figure out which are predictive (typically only a small fraction). Depending on how this is handled in the AML this can add a very large amount of compute overhead. You’ll want to examine if this step creates any unforeseen requirements in time or compute cost.
Feature Selection and Modeling: These are traditionally thought of separately but I’ve combined them here as AML platforms might. In traditional modeling feature selection can be a separate step that precedes model creation to make the modeling process more accurate and efficient. However, it’s also possible to have the models consider all possible features and to automatically eliminate those which are least predictive.
Automated modeling typically involves running parallel contests on the data with different algorithms. During the contests the AML should also be varying the hyperparameters of the different models to attempt to achieve an optimum result. How feature selection, modeling, and hyperparameter tuning is handled by the platform will require your detailed attention during trials.
Model Deployment: Your AML should be able to automatically generate production code in your choice of language compatible with your operating systems (typically Python, C+, Java, or other popular production languages).
Model Management and Refresh: The first time you deploy a model in your operating systems you will need to define exactly where it goes. Thereafter a complete AML should be able to monitor the model, determine when a refresh is appropriate, and with minimum human intervention refresh the model and automatically redeploy it. There are human quality control verifications in this process but once the model has been developed, refresh and redeploy should require only a small fraction of the original labor for original development.
Some Advanced Considerations
Automation of the Entire Process: In a fully automated system, particularly one focused on maintaining and refreshing existing models it’s important that the entire process can be programmatically defined. In this way the entire process from data capture through deployment and all the customized steps in between can be captured and repeated making the end-to-end process truly automated.
Data Types: Depending on your business you may have a variety of data inputs that may have special needs including unstructured or semi-structured text or image data, or streaming data. A few AML platforms can handle these more advanced requirements. A few AML platforms already have the ability to create deep learning CNN and RNN models though this type of modeling is not yet common in business.
Prepackaged Automation Libraries: During initial model development your data science team will have identified specific steps in the process that need particular attention. These might include data prep, feature selection, or hyperparameter optimization. Ideally your AML platform will include libraries or APIs of callable solutions that can shortcut data scientist labor on these tasks.
Training Data Requirements: Some algorithms that might be considered during the competition for best model may be particularly data hungry. You will want to understand the tradeoffs between including these algorithm types against the availability or cost of acquisition of sufficient training data.
On Premise Solution: Some AML platforms that are particularly compute intensive (as many are) are optimized for a SaaS cloud delivery solution. If your business requires an on-prem or private cloud solution for data security you’ll need to identify the cost and complexity of this option.
While AMLs are positioned for their simplicity, there are many factors to be considered before jumping in. You’ll want help from your data scientist pros in selecting the right one.
Other articles by Bill Vorhies
About the author: Bill is Contributing Editor for Data Science Central. Bill is also President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001. His articles have been read more than 2 million times.
He can be reached at:
Bill@DataScienceCentral.com or Bill@Data-Magnum.com