Treat any IT project leveraging machine learning as a data science project. This is key to project success in our data-driven world. Since the data is used to train the system initially and will be analyzed on an ongoing basis, it must be both reliable and representative.
Machine Learning Requires Big Data to Get Smarter
Regardless of what machine learning technologies are employed, they almost all require access to large quantities of data.
Every example of applied machine learning, from autonomous cars to facial recognition requires an intensive prerequisite of acquiring an adequate sample set and spending time training the machine learning algorithms. Deep learning is even more data hungry than traditional machine learning algorithms.
For any business process task to benefit from machine learning, it has to be trained to work effectively, and the nature of the data is critical to its precision and reliability. There are relatively few cases where a solution can be pre-trained on data that is not specific to your organization. These cases typically involve the curation of an immense set of data. Google’s millions of miles driven all over the country is one example. Even these immense data sets will not always be reliable. The most common and effective means to use ML is to train the system on your data.
Rethinking Good Data
So input data is essential, let’s talk about what you need to do in order to ensure your data science-driven ML project is a success: reliable and representative data. Reliable means that the data has to be accurate. Sample data is not just the data itself, but it also must be accompanied by what the data is or means, referred to as “ground truth” data.
When implementing machine learning to predict hospitalization using claims data, for example, each claim must be accompanied by whether or not the patient was hospitalized. In mortgage document classification, the sample set of documents needs to be accompanied by the real document class for each sample document. It can take a lot of quality controls to ensure that sample data is as accurate as possible. A cottage industry has sprung up to provide access to high-quality ground truth data. It’s easy enough to contract out to create a data set for your organization if you don’t already have one.
Geek’s Guide to Data
Representative data means that it must represent a significant amount (or all) of the production data. Achieving a representative set usually relies upon statistical sampling. This is where the gathering of samples and the size of the sample set is driven by the size of your overall population of data over a period of time. It must have the desired level of probability that the sample data represents that the data population measured as confidence intervals. The reason why data must be representative is to prevent the machine learning algorithms from making inferences that introduce bias. It also prevents the software from only working with a portion of data, called “overfitting”. Both are bad for the organization since they could lead to decisions being made that are incorrect or even harmful.
Data’s Job Never Done: Ongoing Data Science
Once trained, the machine learning algorithms effectively deal with some situations, but not all. We are in a constantly changing environment. As a result, our data changes, too. If your CRM system uses ML to analyze your pipeline data to determine which leads will most probably convert to sales, for example, sales of a new hot product may not fit into your existing model. A higher percentage of leads may convert much faster than for your existing products. However, no one wants their entire pipeline conversion model to be affected by this new data. Systems have to include ways to curate new data that can be used to update models without adversely affecting the performance for conversion probabilities of other mature products.
Machine Learning Futures Fueled by Data
Machine learning has proven to deliver significant benefits in specialized areas. This will undoubtedly continue as it progresses towards general business problems. While delivering seemingly magical results, machine learning is not magic. You must put- in the effort as there are few examples of out-of-the-box ML platforms. Unless your business is very generic and produces and uses data in the exact same way a model is trained, these systems need to be trained on and tailored to your data.
The key to benefiting from this burgeoning world of AI is to acknowledge the significant impact that good data science principles play within any ML-based project and to ensure that these principles are a cornerstone of your efforts.
Greg Council is Vice President of Marketing and Product Management at Parascript. Greg has over 20 years of experience in solution development and marketing within the information management market. This includes search, content management and data capture for both on-premise solutions and SaaS.