This is the third in a series of posts on the types of bias that can affect AI systems. In the previous post we talked about bias in the algorithms themselves. With this post the series pivots to bias in the AI system’s training data.
If you’ve ever watched election coverage you already have some understanding of sample bias. When opponents argue about each other’s election predictions the discussion invariably comes down to how well each side sampled the universe of voters. Choose a sample that does not represent the electorate, and your predictions will probably miss the mark.

Sample bias in machine learning is no different. In AI domains like computer vision, natural language processing and entity resolution, it isn’t possible - or even necessary - to expose an algorithm to its entire universe in order to train it. A sample of that universe is adequate, more affordable and more practical.
But then comes the challenge facing every experimental researcher: how to choose a sample of a larger group that accurately captures and describes the group.
Imagine a facial recognition project involving photographs. If the universe of photographs the production algorithm will be exposed to has...
- everything from very high to very low resolution photos
- photos taken from every conceivable angle
- faces taken full on, in profile and every angle in between
- photos taken in a variety of lighting and lighting sources
...and the photos used to train the algorithm have...
- uniformly high resolution
- fixed poses
- full on facial angle
- uniform lighting
...the algorithm will in no way be prepared to do its job. This is an example of sample bias.
In an another illustration, data scientists learn about an ML project where the algorithm needed to learn to distinguish huskies and wolves. As it happened, the photos of huskies that it was shown were shot predominantly with grassy backgrounds, and the photos of wolves overwhelmingly had snowy backgrounds.
By the end of the training period the algorithm seemingly had mastered the identification of each animal. However, as one reviewer noted, “It didn’t learn the differences between dogs and wolves, but instead learned that wolves were on snow in their pictures and dogs were on grass. It learned to differentiate the two animals by looking at snow and grass. Obviously, the network learned incorrectly. What if the dog was on snow and the wolf was on grass? Then, it would be wrong.”
Sample bias is a big problem, but it can be mitigated. Social science researchers, for example, receive intensive training in sample definition and selection. These techniques can be incorporated into ML projects either through training or by bringing social scientists into data science teams.
In the next post we’ll talk about another form of training data bias.
If you can’t possibly wait for the next installment to learn about all 4 types of AI bias, you can get the full white paper now.