An introduction to NLP classificiation techniques in Python for the 2017 NIPS competition track

I demonstrate some basic techniques for machine learning and data analysis with NLP, with the theory behind them, in the context of the MSKCC's competition on extracting the class of genetic mutation from clinical text data.

I decided recently that I wanted to try my hand at a Kaggle competition to improve my skills in Python data-wrangling. The current 2017 NIPS competition track challenge, on 'Classifying Clinically Actionable Genetic Mutations', seemed an appropriate starting point. The data were provided in the form of clinical writeups as well as an associated Gene and Variant for each entry, along with a hidden Class, and the goal was to predict the multivariate probability of each of the 9 classes for every entry, optimising for multiclass log-loss.

Machine learning in all contexts is about extracting signal from noise, so while I haven't worked much with natural language processing before, it's still a great venue to brush up on some of the fundamentals of data analysis, and I hope I can give some insight into the information theoretic principles behind some of the algorithms that are so commonly used in these kinds of Kaggle competitions. Today, I want to give an overview of some basic text preprocessing, feature extraction and classification methods using common Python libraries, and the results that can be achieved with this kind of classical machine learning.

N.B. I include some brief code snippets but the full scripts can be found on my GitHub.

Getting started with Scikit-Learn

The first thing that is always necessary is to convert the data from the supplied CSV format into something more useful, which in most cases is Pandas dataframes (Pandas is a library building on top of Scipy/Numpy but with some more useful tools for common data analysis workflows). Once we have the data in this form, we can run a quick baseline test, using the relative frequencies of each class as the probability for every entry.

freq = np.zeros(9)
for i in range(TRAIN_N):
    freq[int(train[i]['class'])-1] += 1
for i in range(9):
    freq[i] /= TRAIN_N

pred = np.zeros((TEST_N, 9))
for i in range(TEST_N):
    pred[i] = freq
write_results(pred, 'pred/baseline_frequency')

To start looking into the classification ability of the text, we then run a naive Bayes classifier using bag-of-words and TF-IDF, which is made easy with Scikit-Learn, a versatile Python library that is useful for 'gluing' various feature transforms and classifiers together.

text_clf = Pipeline([('vect', CountVectorizer()),
                                        ('tfidf', TfidfTransformer()),
                                        ('clf', MultinomialNB()),

text_clf =,
test_predicted = text_clf.predict(

Text preprocessing

Until this point we have been using the supplied data as-is with no modification to the long blocks of rather verbose text. To achieve some real results we need to clean up the supplied text and consider feature extraction. The methods for achieving this in most NLP applications are: removing stopwords and punctuation, standardising case, stemming and/or lemmatisation). Lemmatisation achieves the same goal as stemming -- reducing related words to a common stem -- but does so in a more advanced, language-aware way. For this preprocessing we use a combination of the libraries NLTK and Spacy: NLTK includes a stopword list for removing unneccessary words, and Spacy has a built in lemmatiser which is generally considered to perform better than the venerable Porter stemmer which is implemented in NLTK.

def preprocess(raw_text):
    no_punc = re.sub("[^a-zA-Z0-9]", " ", raw_text)
    words = no_punc.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = " ".join([w for w in words if w not in stops])
    doc = nlp(meaningful_words)
    lemmatized_words = [w.lemma_ for w in doc]
    return(" ".join(lemmatized_words))

Feature extraction

Once we've cleaned up the text let's add some features. This is where domain knowledge comes into play, and I admit to having little to no knowledge of genetic biology. That said, there are a lot of datasets out there which can be used to augment your own knowledge on specialist topics. We're going to stick to the provided data here and work with more generally applicable techniques, with the caveat that this will result in poorer performance than if we were to use use more specific data augmentation. Here's an example of some additional features being added to the training data:

gene_var_list = [x.lower() for x in list(train['Gene'].unique()) +
for x in gene_var_list:
    train['count_'+x] = train['clean_text'].map(lambda y: y.count(x))

The next thing we need to do is split our training data into training and validation sets. It is essential to never train with your test data, and in this case we cannot because the test labels are hidden by Kaggle. However, we can divide up the provided labelled training data and create a small validation set which we use to test the progress of our training during development. The use of a validation set comes in preventing over-training: in general, the more we train against our training data and the more features we use, the better the performance will be on that data, but this does not necessarily correlate with performance against the test data. Assuming our training and test data are randomly sampled from the same original data source, we can section off the validation data to use only to evaluate our progress, and not to fit our parameters. This helps to prevent over-fitting because we can stop training once we reach a minima when testing against our validation set.

train_f, valid_f, train_l, valid_l = train_test_split(
        train_union, train['Class'], test_size=0.18)

Performing the classification

After cleaning up our text and adding it as a new column to our train and test dataframes, let's introduce the first main machine learning algorithm we will be using, the support vector machine (SVM), implemented in SKLearn using stochastic gradient descent and the SGDClassifier. SVMs seek to separate our data using whatever features we provide by extending the margins of the support vectors away from the separating hyperplane. In the general case where our feature vectors are not linearly separable we transform the data using some kind of kernel, typically a radial basis (Gaussian) function, since this has the property that it guarantees separation of any data if given enough dimensions. Stochastic gradient descent helps us achieve the optimal parameters for our SVM classifier by approximating a closed-form gradient minimisation across multiple iterations. We use FeatureUnion from SKLearn to combine different features (as mentioned above) before performing the classification.

combined_features = FeatureUnion(transformer_list)
train_union = combined_features.fit_transform(train_no_label)
test_union = combined_features.transform(test)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
    eta0=0.0, fit_intercept=True, l1_ratio=0.15,
    learning_rate='optimal', loss='log', n_iter=8, n_jobs=-1,
    penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,

clf =, train['Class'])
train_predicted = clf.predict_proba(train_union)
test_predicted = clf.predict_proba(test_union)

Manual hyperparameter optimisation is one of the most time-consuming and menial of tasks in machine learning, so we will attempt to speed this up using various cross-validation tools. SKLearn comes with GridSearchCV which will search across the l2 regularisation and alpha parameters to find the best result using SGD optimisation.

clf = SGDClassifier(loss='log', penalty='l2', alpha=5e-5, random_state=42, n_jobs=-1)
parameters = {'alpha': (1.3e-4, 1.25e-4, 1.2e-4, 1.15e-4, 1.1e-4, 1.0e-4),
              'n_iter': (5, 6, 7, 8, 9), }

gs_clf = GridSearchCV(clf, parameters, cv=10, n_jobs=-1)
gs_clf =, train['Class'])
train_predicted = gs_clf.predict_proba(train_union)
test_predicted = gs_clf.predict_proba(test_union)

Improving results with boosting

We now move onto the algorithm which helps acheive the highest scores in just about every Kaggle data science task: gradient boosting. Boosting is an ensemble algorithm in contrast to stochastic gradient descent, which is effectively an extension of the traditional CART model (classification and regression trees) used in statistical analysis for many years. It works by creating many greedy decision trees as weak learners and creating a weighted combination of these with gradient descent for the final model. XGBoost is the standard library implementation for Python, based on the original AdaBoost, and runs well on multithreaded systems. (An alternative algorithm is light gradient boosting, implemented in Microsoft's LightGBM library, which has been shown to achieve similar results to XGBoost while being less computationally expensive.) Further algorithms which could be applied to challenges like this include Word2Vec (implemented in Gensim) and Doc2Vec.

watchlist = [(xgb.DMatrix(train_f, label=train_l), 'train'),
          (xgb.DMatrix(valid_f, label=valid_l), 'valid')]
xgtrain = xgb.DMatrix(train_f, label=train_l)
clf = xgb.train(param, xgtrain, num_rounds, watchlist,
             verbose_eval=50, early_stopping_rounds=60)

test_preds = clf.predict(xgb.DMatrix(test_union))

Finally, we must write the data to the appropriate format file for uploading to Kaggle:

def write_results(probs, filename='pred/temp'):
    f = open(filename, 'w')
    f.write('ID,class1,class2,class3,class4,class5, \
    for i in range(len(probs)):
        for j in probs[i]:

write_results(preds, 'pred/predictions.txt')


Much of the knowledge in this post comes from MIT's 6.036 class on Machine Learning, which was (perhaps unsurprisingly) the most popular class at the institute in the semester I took it. Much more still comes from the huge amount of resources out there in the form of Gist walkthroughs, Kaggle kernels and more. A reminder that you can see all my code for this project in my GitHub repo.

This is only a very basic introduction to these methods, and it is possible to achieve significantly better results than I did by integrating more specialised domain knowledge (and, to some extent, training with more powerful hardware). But it's amazing how well we can do in the classification realm using off-the-shelf libraries and generalisable methods these days, even on text that is almost unintelligable for most humans.