Categories: Easy Reading

Data Is Not Neutral

 

I like data driven processes. You’ve seen me drum on about CRS and Full-Waveform Inversion. Sometimes we like to pretend that data is pure. Well , it’s noisy and full of kinks but if we process it just right, we extract the truth.

Turns out that is a little bit wrong. The data collection process is often biased. I have put numbers on it in the article introducing bias for Geoscientists. Yet, often we tend to not put numbers on those uncertainties. Sometimes we can’t.

Since deep learning is data-driven, we have to think about bias. There are some very unfortunate examples. One aspect of machine learning is natural language processing (NLP). One awesome application to explore your data space, uses analogies. Matt Hall has a funny example for geoscience

A very large data set from the NLP community has a troubling analogy in it. When you ask “Man : Doctor :: Woman” it will give back “nurse”. I’d say this shows a certain bias in the text body.

Disclaimer: I will write about racist bias here, I am describing a phenomenon.
            English is my second language, so if I make a blunder in formulation please let me know. 
            I am writing this with the best of intentions, but definitely tell me if I make a mistake.

Sometimes those biases are well hidden. HP in its earlier days had a problem the people of color were not identified by face tracking software. Some motion detectors had the same problem. This goes back to calibration curves from the first color films that would intentionally fit the parameters to optimize for “fair skin”. Targeting white customers and inheriting these calibrations has cost HP dearly.

Google still keeps the detection of gorillas in their object detection algorithm disabled, as they confuse people of color for gorillas.

The Link to Geoscience

This is easy to answer: I don’t know. However, a catastrophic social mis-classification for a software vendor and the social backlash will probably run in a similar area of cost as a dry well.  Painting a glum picture here, but some mis-classification may harm people in our industry, just like the harm of an over-pressured zone that was not identified.

We bias data starting with the acquisition. Embedded in acquisition geometry and tools, we start out imprinting on our data. How do we make our algorithm robust against the seismic gridline acquisition footprint? There is always a human component in data collection. How do we deal with missing data? Some data will be acquired by staff that also knows what makes processing easier. Often that is not the case.

Although unsupervised and semi-supervised machine learning is a great topic of research. Most great successes have been made in supervised machine learning. That means we have to trust the labels or account for mislabeled data. In the deep learning community this is known as “label noise” and the reason why most scores will not reach 100% accuracy.

Culture Neutral — or is it? // CC-BY Karl-Ludwig Poggemann

We often hear that geologic interpretation is an art. That is true, you never interpret all features you see, only those you deem relevant. Therefore, we naturally have an expert opinion influence the interpretation. Expert opinions always mean bias.

Here’s a problem you see a lot in the field. A point many professors have tried to drive home with classes I was in. You stand in front of an outcrop and they ask what you see. It’s the data collection phase. Often students will go into the interpretation phase right away. Instead of describing the data points they see, many will directly describe the depositional or tectonic environment. Although professors will mention this in the field, I have not seen many that “get” this point. Biasing the data collection right away, may render the entire field trip useless, as the data is too biased, driving a certain narrative the field scientist preferred.

We see that geoscience deals with often biased data and has to deal with biased interpretation. It’s a matter of us researchers building models that are robust to these biases.

 

The following two tabs change content below.
... is a geophysicist by heart. He works at the intersection of machine learning and geoscience. He is the founder of The Way of the Geophysicist and a deep learning enthusiast. Writing mostly about computational geoscience and interesting bits and pieces relevant to post-grad life.

Latest posts by Jesper Dramsch (see all)

Jesper Dramsch

... is a geophysicist by heart. He works at the intersection of machine learning and geoscience. He is the founder of The Way of the Geophysicist and a deep learning enthusiast. Writing mostly about computational geoscience and interesting bits and pieces relevant to post-grad life.

Recent Posts

Juneteenth 2020

Here at The Way of the Geophysicists, we have written about social justice before. Today… Read More

2020-06-19

All About Dashboards – Friday Faves

This Friday we're looking at a machine learning state-of-the-art Dashboard and also a new way… Read More

2020-05-22

Keeping Busy – Friday Faves

It sure is an interesting time. Apologies I kept you waiting with more Friday Faves,… Read More

2020-04-24

2020 Fast Approaching! – Friday Faves

Aaaand it's gone. It's starting out with one of my new projects and then a… Read More

2020-01-03

Machine Learning for Science – A Youtube Series

I'm starting a new project, where I take concepts from machine learning for science and… Read More

2019-12-30

Something for a Long Trip or Unwind during the Holidays – Friday Faves

It's the holiday season, so let's keep this Friday Fave short, with a fave that… Read More

2019-12-20