Not Enough Data To Do Machine Learning? Think Again
We have all reacted like this at one point once we looked at the available data prior to a data science project.
I wish I could develop a statistical models and AI but I lack data :(
One of the most important issue that companies and Data Scientists need to overcome is lack of data. This is a serious problem as it is almost impossible to deliver serious analytics or statistical modelling capabilities that enable companies to resolve their problems. This is especially true now that Deep Learning enable companies to decorticate complex and non linear data patterns in order to solve problematics that are hard to solve using other techniques.
Most blogs and courses will explain that machine learning needs a lot of data, but what does this mean? Also, how come we often lack data?
What is a ‘lack of data’
It is impossible to estimate the minimum amount of data, which is required. These are parameters that can help to estimate:
- Type of problematic: Let’s simplify a lot by saying that text, image and video usually will need more data. Once again, many other factors are needed to estimate.
- Number of categories to be predicted: What is the expected output of your model? In Data Science, we call this the ‘label’ or the dependant variable in statistics. The fewest number or categories the better. For example, if you plan on creating a sentiment analysis but you do not have enough data to have 7 categories of satisfaction, you should consider having 2 categories instead.
- Expected model performance: If your goal is to get a good model that will generate a moderate impact, you might be ok with a little bit of data. However, if you plan on getting a product in production, you need more.
- The overall complexity of the model: A general rule of thumb is that a most complex problem is going to necessitate more observations.
Here are some examples of complexities:
- Many variables (or features) are needed to generate your model. If this is the case, you will need to have more data. A rule of thumb in Deep Learning is that you need a minimum of 10x the number of observation per feature chosen. This is a minimum, usually it has to be way more.
- There are many relationships between your features (some features are correlated for example). Usually, having too correlated features (positively or negatively) is not a good idea.
- The label is not easily explained by the features (you see no patterns). Although this might be discouraging, potentially that having different techniques along with more data might help to still achieve great results
- You have imbalanced data. If you try to predict anomalies, you might need a lot of data relative to the anomalies themselves. More importantly, you have to adjust your data science techniques.
- You have several edge cases to take into consideration. Even worst than dealing with anomalies, is dealing with multiple types of anomalies!
- You do not categorize everything. If you want to be able to let’s say, categorize a support tickets topic, using tickets from the last 5 years, you needs every of these tickets to already have a category defined.
Can it be worst?
Of course it can. On top of the data quantity issue, there are classic known pitfalls companies usually fall into. In other words, even your available data might not be as ‘usable’ as you may think.
Classic pitfalls:
- Your data is inconsistent: you might have exceptions and errors in your data. It is important to exclude data that will not help your model. This will actually make your model more complex.
- You have many empty your data. Classic example. You added a new field in a form 2 months ago. If you need to retrieve a year of data, you will have a lot of blank variables. You do not have to systematically ignore this field because it is new! We will cover that.
- You have duplicates in your data. That’s a problem. The same customer fills out a form using multiple different emails.
What can you do about this?
First of all, please try to get as much data as possible by developing your external and internal tools with data collection in mind.
Then, let’s not panic and explore options to make the most out of the data available! In the next article, we will investigate these solutions in more details.
Methods to overcome lack of data
We will discuss about the methods on how to resolve most of these issues in the next article. High level, here are the techniques that I will present:
- Data simulation. We will see how to create new data that looks very similar to the observed data. Even if this does not provide new information, it helps to use machine learning algorithms!
- Dealing with imbalanced data. if you have some anomalies that you need to analyze but lack observations, there are ways to optimize your analysis to take those anomalies more into consideration.
- Semi-supervised learning. This is a technique that allows you to have great statistical models even if all the observations are not categorized (i.e. the example about the support ticket topics above)
- Data imputation. This is a series of techniques that enable you to replace missing data (i.e. the above example of a new field that has missing information) with predicted values.