Getting started with the SEG Machine Learning contest

The Society for Exploration Geophysicists (SEG) started a contest to predict facies from wireline logs via machine learning.

Get your Buzzword Bingo cards ready, we’re about to dive deep. The October 2016 issue of The Leading Edge had a geophysics tutorial of a special kind. Brendon Hall explains how we can use machine learning to predict facies from borehole data. The paper is Open Access, so you can read it for free and without paying or institutional access. The tutorials usually branch out to give you hands-on material to try it for yourself. This time they made it into a contest for everyone to try out their own approach. As I have not worked with a lot of borehole data, Machine Learning nor Jupyter, I’d like to take you on a journey with me how I dug my way into the topic.

The Basic Tools

Jupyter

Hanging around (virtually) people like Matt Hall, Evan Bianco and Matteo Nicoli will inevitably expose you to the magical words “Jupyter Notebook”. You hear how it revolutionized their Python experience and makes sharing and developing code so much easier. To me it felt like this huge thing I have to sink an incredible amount of time in to understand. Turns out, it isn’t. I installed Anaconda which comes with a Jupyter installation. I started Jupyter with one click and that was it. I was a little disappointed at first, as a Jupyter notebook is basically glorified documentation of your code. You program in a “Jupyter notebook” which has the file extension “.ipynb” where you can mix your programming language of choice (from Python, Julia, C++, C#, Lua, Spark to esoteric languages like Brainfuck see list) and text formatted in Markdown.

Once you get into the notebooks and start tweaking your code, you will realize that it is not just glorified documentation. It’s truly interactive programming. Need to take a quick look at your dataframe? One line of code, click run and you see it nicely formatted. See a 6vs6 cross-correlation? 4 lines, run and it’s right there in your notebook, not even another window. Change of colorbar? Sure thing. You only have to change that part of your notebook and run it, the prior parts of the notebook do not need to be rerun. It’s beautiful.

Github

Github is a social integration of the git-versioning software. I have written extensively about it here. It’s getting better by the day, so go try it out! It has helped me so many times, when I have “improken” my code.

Then you “fork” the repository (repo) and code away.

Tl;Dr: Get Anaconda and Github

The Contest

In the paper, Brendon uses a Support Vector Machine (SVM), which is a specific machine learning algorithm that tries to divide the data set into the most meaningful separate units. Our data isn’t very orderly so instead of doing this in our usual three dimensions SVMs usually go to higher dimensions called a hyperspace. It’s an interesting choice to start off the challenge. An SVM is intuitively understandable for the geophysical tutorials but most would expect it to perform worse than other ML algorithms.

The data set is available for the contest and should only be used in this context as the licensing is not quite clear. But we get to play around with real data and maybe even win something nice, I count that as a win. Contributors are asked to license their contribution, so others know whether they can make use of it, to increase the challenge. (So assuming you license your effort GNU GPL, you will have to be mentioned in the winning solution, if they use your code.)

You can find the paper here and the accompanying github repo here

You should definitely take a look at Bohling and Dubois (2003) and Dubois et al. (2007), I did not understand the dataset fully until I was explicitely pointed to Dubois 2007 where the NM_M and the RELPOS feature are explained in more detail.

Machine Learning

The task at hand is something called “classification problem”. We are trying to match a pre-defined label to a bunch of data. Since we already have some data with labels, we can do something that is called “supervised learning”. That means our algorithm will look for similarities in the labels and subsequently use the resulting model to match unlabeled data. Before we move on, take a look at this ingenious visualization of Machine Learning using R and d3.js – R2D3.

You’re back? Great. You should have a grasp on the concept of “training your model”, “decision trees” and “overfitting”. Those are the most important concepts we need to start out on our journey.

Why would we even want to participate anymore?

If you’re still with me, you’re probably a bloody beginner as well. We have a hard time grasping the extent of Neural Networks and Boosted Trees (two common ML algorithms). However, it is commonly known that the best results do come from something called “feature engineering”.

A feature or feature vector in the context of machine learning is the set of data point belonging to a label. Machine Learning algorithms are only as good as the data we feed them. Geographical data is the most intuitive example for this. Training a model latitude longitude relationships relies on the hope that the algorithm can somehow luckily decode the complex relationship between two points from lat-long coordinates. However, calculating the distance between two points, will give an algorithm a very tangible measure to work with. In the case of our data, every label only corresponds to a specific data vector that does not take surrounding data into account. Changing facies contain much information that can be lost in this way. Therefore, creating an additional variable that includes information about changes might improve our algorithm immensely. In fact, the leading contribution as of now, has been able to generate additional information by signal processing the initial data set.

The best algorithm

I wanted some more information in the beginning to find the best algorithm to do the job. At that point I did not know about the importance of feature engineering yet. The most used algorithms in this challenge are

  • SVM as the original one
  • Random Forest, which is a smart play on many decision trees
  • Boosted Trees, which uses a supposedly smart choosing algorithm for decision trees
  • Ensemble Methods, which use a combination of algorithms and select afterwards
  • (C/D) Neural Networks, that are convolutional or deep neural networks

The last one may be something I would leave to the pros as it is quite a challenge to train NNs on a limited data set like ours. The company kaggle is known as hosting ML challenges for other companies. Expedia and Netflix have hosted challenges on their website with immense monetary incentives. Most articles talking about “winning kaggle competitions” will point out that an ensemble method will most likely outperform other methods.

If you’d like to get started on these problems using python, I highly recommend these introductory articles:

If you really want to dive deep, take a look here but this will probably leave you overwhelmed and without a contribution in the end (like me).
However, I can recommend this Google Video series on Machine Learning, which will give you the necessary basics in a hands-on way.

This should give you some basics to participate in the challenge, of course, next you should take a look at all the amazing work of the other teams to get inspired. I hope I will see your name on the leaderboard soon!

Thanks to Chris Hamer and Lukas Mosser for giving me pointers into the right direction and Matt Hall, Brendon Hall and Matteo Nicoli to give me helpful advice directly regarding the challenge.

The following two tabs change content below.
... is a geophysicist by heart. He works at the intersection of machine learning and geoscience. He is the founder of The Way of the Geophysicist and a deep learning enthusiast. Writing mostly about computational geoscience and interesting bits and pieces relevant to post-grad life.

Latest posts by Jesper Dramsch (see all)

Posted in Computer Science, Earth Science, Software and Programming, Tools and tagged , , , , , , .