I like data driven processes. You’ve seen me drum on about CRS and Full-Waveform Inversion. Sometimes we like to pretend that data is pure. Well , it’s noisy and full of kinks but if we process it just right, we extract the truth.
Turns out that is a little bit wrong. The data collection process is often biased. I have put numbers on it in the article introducing bias for Geoscientists. Yet, often we tend to not put numbers on those uncertainties. Sometimes we can’t.
Since deep learning is data-driven, we have to think about bias. There are some very unfortunate examples. One aspect of machine learning is natural language processing (NLP). One awesome application to explore your data space, uses analogies. Matt Hall has a funny example for geoscience
Using word vectors to make analogies is fun!
small : smaller :: large : larger
rock : geologist :: animal : biologist
geologist : rain :: doctor : pain
geologist : hammer :: geophysicist : curtain
geology : computer :: geophysics : computers
science : fun :: programming : funny
— Matt Hall (@kwinkunks) February 10, 2018
A very large data set from the NLP community has a troubling analogy in it. When you ask “Man : Doctor :: Woman” it will give back “nurse”. I’d say this shows a certain bias in the text body.
Disclaimer: I will write about racist bias here, I am describing a phenomenon.
English is my second language, so if I make a blunder in formulation please let me know.
I am writing this with the best of intentions, but definitely tell me if I make a mistake.
Sometimes those biases are well hidden. HP in its earlier days had a problem the people of color were not identified by face tracking software. Some motion detectors had the same problem. This goes back to calibration curves from the first color films that would intentionally fit the parameters to optimize for “fair skin”. Targeting white customers and inheriting these calibrations has cost HP dearly.
Google still keeps the detection of gorillas in their object detection algorithm disabled, as they confuse people of color for gorillas.
Google Photos, y'all fucked up. My friend's not a gorilla. pic.twitter.com/SMkMCsNVX4
— jackyalciné's just about 67.9% into the IndieWeb. (@jackyalcine) June 29, 2015
The Link to Geoscience
This is easy to answer: I don’t know. However, a catastrophic social mis-classification for a software vendor and the social backlash will probably run in a similar area of cost as a dry well. Painting a glum picture here, but some mis-classification may harm people in our industry, just like the harm of an over-pressured zone that was not identified.
We bias data starting with the acquisition. Embedded in acquisition geometry and tools, we start out imprinting on our data. How do we make our algorithm robust against the seismic gridline acquisition footprint? There is always a human component in data collection. How do we deal with missing data? Some data will be acquired by staff that also knows what makes processing easier. Often that is not the case.
Although unsupervised and semi-supervised machine learning is a great topic of research. Most great successes have been made in supervised machine learning. That means we have to trust the labels or account for mislabeled data. In the deep learning community this is known as “label noise” and the reason why most scores will not reach 100% accuracy.
We often hear that geologic interpretation is an art. That is true, you never interpret all features you see, only those you deem relevant. Therefore, we naturally have an expert opinion influence the interpretation. Expert opinions always mean bias.
Here’s a problem you see a lot in the field. A point many professors have tried to drive home with classes I was in. You stand in front of an outcrop and they ask what you see. It’s the data collection phase. Often students will go into the interpretation phase right away. Instead of describing the data points they see, many will directly describe the depositional or tectonic environment. Although professors will mention this in the field, I have not seen many that “get” this point. Biasing the data collection right away, may render the entire field trip useless, as the data is too biased, driving a certain narrative the field scientist preferred.
We see that geoscience deals with often biased data and has to deal with biased interpretation. It’s a matter of us researchers building models that are robust to these biases.