Culture Neutral - CC-BY Karl-Ludwig Poggemann

Data Is Not Neutral

 

I like data driven processes. You’ve seen me drum on about CRS and Full-Waveform Inversion. Sometimes we like to pretend that data is pure. Well , it’s noisy and full of kinks but if we process it just right, we extract the truth.

Turns out that is a little bit wrong. The data collection process is often biased. I have put numbers on it in the article introducing bias for Geoscientists. Yet, often we tend to not put numbers on those uncertainties. Sometimes we can’t.

Since deep learning is data-driven, we have to think about bias. There are some very unfortunate examples. One aspect of machine learning is natural language processing (NLP). One awesome application to explore your data space, uses analogies. Matt Hall has a funny example for geoscience

A very large data set from the NLP community has a troubling analogy in it. When you ask “Man : Doctor :: Woman” it will give back “nurse”. I’d say this shows a certain bias in the text body.

Sometimes those biases are well hidden. HP in its earlier days had a problem the people of color were not identified by face tracking software. Some motion detectors had the same problem. This goes back to calibration curves from the first color films that would intentionally fit the parameters to optimize for “fair skin”. Targeting white customers and inheriting these calibrations has cost HP dearly.

Google still keeps the detection of gorillas in their object detection algorithm disabled, as they confuse people of color for gorillas.

The Link to Geoscience

This is easy to answer: I don’t know. However, a catastrophic social mis-classification for a software vendor and the social backlash will probably run in a similar area of cost as a dry well.  Painting a glum picture here, but some mis-classification may harm people in our industry, just like the harm of an over-pressured zone that was not identified.

We bias data starting with the acquisition. Embedded in acquisition geometry and tools, we start out imprinting on our data. How do we make our algorithm robust against the seismic gridline acquisition footprint? There is always a human component in data collection. How do we deal with missing data? Some data will be acquired by staff that also knows what makes processing easier. Often that is not the case.

Although unsupervised and semi-supervised machine learning is a great topic of research. Most great successes have been made in supervised machine learning. That means we have to trust the labels or account for mislabeled data. In the deep learning community this is known as “label noise” and the reason why most scores will not reach 100% accuracy.

Culture Neutral - CC-BY Karl-Ludwig Poggemann

Culture Neutral — or is it? // CC-BY Karl-Ludwig Poggemann

We often hear that geologic interpretation is an art. That is true, you never interpret all features you see, only those you deem relevant. Therefore, we naturally have an expert opinion influence the interpretation. Expert opinions always mean bias.

Here’s a problem you see a lot in the field. A point many professors have tried to drive home with classes I was in. You stand in front of an outcrop and they ask what you see. It’s the data collection phase. Often students will go into the interpretation phase right away. Instead of describing the data points they see, many will directly describe the depositional or tectonic environment. Although professors will mention this in the field, I have not seen many that “get” this point. Biasing the data collection right away, may render the entire field trip useless, as the data is too biased, driving a certain narrative the field scientist preferred.

We see that geoscience deals with often biased data and has to deal with biased interpretation. It’s a matter of us researchers building models that are robust to these biases.

 

The following two tabs change content below.
... is a geophysicist by heart. He works at the intersection of machine learning and geoscience. He is the founder of The Way of the Geophysicist and a deep learning enthusiast. Writing mostly about computational geoscience and interesting bits and pieces relevant to post-grad life.
Posted in Easy Reading and tagged , , , , , , , .