Predictive analyses using big data can produce strategic insights that positively impact the effectiveness, efficiency, precision, actions, and value of an organization. Assumptions about business decisions based on predictive analysis include that they are based on good data, competent analyses, and accurate data modeling and statistical techniques. An organization can only have confidence in data-driven decisions to implement resulting strategies if they are based on sound data modeling practices.

If your organization is embarking on a predictive modeling project, be aware of five areas of data modeling that are subject to error and oversight. The first two errors concern project preparation and require much time and collaboration, making them highly subject to cutting corners. The last three errors concern model-building techniques. These are key areas for which skilled analysts need to produce good models.

  1. Not Understanding the Business Use Case

In the data analysis workflow, the first step is to precisely define the business objective in order to determine what needs to be analyzed. Without a thorough understanding of the business context and needs, building a model that generates relevant insights or recommendations that can be implemented will be difficult, even if a technically sound model is used. Analysts risk skipping this step due to lack of time, overconfidence in previous models, or lack of internal communication. It is critical to take the time to research, collaborate, and interact with relevant business domain experts to get ideas about which correlations of data work, which do not, and which variables are important to model. With a full understanding of the business case, an analyst can best determine what data might be predictive, choose accurate and precise metrics, and match data to the correct dependent variable. Additionally, collaboration with business stakeholders works best as an ongoing and iterative process, not just a one-time exercise.

A simple example of understanding the business use case involves foundations and fundraising. A model built to predict future giving will look a lot different for someone with a giving history than for someone who has never given before. An analyst should also consider, with fundraising experts, precisely defining the larger target population and desired outcomes to build a model for scoring of accurate results tailored to that population.

  1. Lack of Data Integrity

The primary goal of data collection for analysis is to obtain data integrity. A lack of enough data with integrity can render a top-notch model useless. Analysts should take the time to ensure that data is complete, correct, and timely. Common data collection errors include not having enough data, using data not correlated to the business case, not removing anomalies, not identifying duplicate records, failing to complete missing data, or using data with different attributes when implementing a model than had been used for building it.

When analysts are in a hurry to start building a model, they may barely engage in data exploration. They might miss small “data silos,” such as spreadsheets and memos. They might assume that business staff knows what data they need, when in fact they do not. Analysts should therefore communicate frequently to seek out appropriate data from business users, because they might not know what data is predictive. Time spent upfront to ensure data values, completeness, consistency, and reliability saves much time later in the modeling process.

  1. Too Many or Too Few Variables

Generally speaking, an analyst should use as many variables and observations as possible. An analysis with too few variables may be limited, missing predictors and hidden variables. Considering as many variables as possible that are unique to the business case is important. On the other hand, not scoping the project properly in terms of what is achievable with variables can lead to too much data, duplication, and too many factors. Any project scope an analyst chooses should have the potential to add value above the cost of model development.

A project will often have its own built-in noise, rendering the need for a realistic scope to be even more important. In AI domains such as image recognition, there is a margin of error for any algorithm. If an analyst is trying to predict who will purchase a certain product, the margin of error in a model might not be a huge deal. If they are trying to identify something health related, like cancer, margin of error becomes extremely important. An analyst must understand this challenge, create a strategy for handling various factors, and have a way for testing this strategy.

  1. Imbalance of Bias and Variance

Achieving a balance between bias and variance is key to building a good predictive model. Analysts must balance simplicity versus complexity in their models by managing the trade-off between bias, resulting from simplification of relationships between data points, and variance, due to complexity. Analysts often intuitively strive to remove bias out of their models, ignoring the consequences of too much variance. However, if the model is too complex and flexible, it may have high variance and therefore not be trustworthy. Analysts must find a way to manage bias and variance to keep the total error of bias and variance as low as possible. One solution is to train and test the data so as to minimize the expected mean squared error, choosing a model that has both low variance and low bias.

  1. Model Testing and Evaluation Errors

Training and testing a model is necessary for confidence building and to encourage experimentation later on. This can be a challenging process, and, as a result, analysts may cut corners or avoid checking model accuracy. Sometimes, analysts use up all the data to build a model and do not keep a test sample aside to validate it. Data should be split into two subsets, a larger one for training the model and smaller random one for testing it. The trained model should then be ran against random test data to understand how it is performing.

A common error is to use a set of test data from a different distribution than had been used for determining the training data set. The most thorough and effective way to do test a model is through cross-validation, using each data point both for training and testing. But even in this process mistakes can be made, for example, by training a time series model on data from a later time period than had been collected for the validation set. This would not make sense, as an organization would not have future information when trying to predict a real-time result.

Another common error is using the wrong metric of model evaluation with respect to the business case.  An analyst might habitually use the accuracy of the model as the metric. Interpretation of the business, rather than past model accuracy, might be what is needed. This requires a more subjective metric or a customized one. Customizing a metric takes time and can require iterations, so many are tempted to avoid it.

Conclusion

Mining data to build predictive models provides exciting business benefits and invaluable insights when done correctly. Analysts need to keep in mind that businesses invest many resources in developing models and implementing recommendations generated. Sub-par models, if implemented, could cause an organization to experience substantial setbacks. Knowing how to generate models using the latest techniques and algorithms is not enough. Analysts should collaborate with business stakeholders, track their experiments, identify and learn from modeling mistakes, and avoid them in future iterations and projects.