Let’s tie a bow on this thing. To review, we’ve talked about the model, sampling, and prejudice. The final type of bias we will discuss is the most fundamental. In the other posts on data bias it was assumed that the data - with or without biased content - was accurately captured. This final post is about distortion stemming from the data’s collection or creation.
If the images your algorithm is learning from were shot with a lens with a color filter, all of those images will be identically skewed. If the ruler used to determine the dimensions of data elements was ¼ inch short of a foot, all of the dimensions will be wrong by an identical proportion.
This is measurement bias, also known as systematic value distortion.
Leading questions can lead to measurement bias as well. “How much will prices go up next year?” Responses will all be about prices going up regardless of the reality of future pricing. If instead you ask, “How will pricing be affected next year?” you’re likely to get more accurate responses.
As with sample bias, there are established techniques for detecting and mitigating measurement bias. It’s good practice to compare the outputs of different measuring devices, for example. Survey design has well-understood practices for avoiding systematic distortion. And it’s essential to train labeling and annotation workers before they are put to work on real data.