NLP getting started: Classical GloVe–LSTM and into BERT for disaster tweet analysis

Hi! so I got promoted to a new position in my company to handle NLP and NLU problems since previously spent my time in the computer vision department. I am not actually fresh into this area, since I was previously exposed to RNN in the deep learning university course. I think I was just outdated with researches in this field, but I am trying to catch up recently.

Yes, I love Kaggle!

So I started to compete in one of the NLP starter competition, Real or Not? NLP with Disaster Tweets. This competition aims to identify whether a tweet is about a real disaster or not. For example, “LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE” doesn’t mean it is about ragnarök or doomsday or our sun turned into a red giant. Twitter is very expressive and actually, that is the way it works.

Before proceeding, this is my Kaggle notebook to follow up this article, since I will only explain the essential parts.

  1. Using LSTM
  2. Using BERT
Image for post
Image for post
Photo by MORAN on Unsplash

I was thinking out how I should start by trying out some notebook to do preprocessing. Text cleaning is important since we want to make a words vector later.

Creating Corpus

Corpus is a bunch set of words used for analysis. I use NLTK to create the corpus.

Text Cleansing

I just using basic regex for this one to fix sentences. Regex is basically a Swiss army knife for NLP (actually lowkey NLP exists and you could just use regex). Actually, I wrote my notebook’s preprocessing with this reference.

  • Removing URLs. I know this could be undone for other tasks, but for simplicity, let's get rid of them.
  • Removing HTML tags. I think this is a normal byproduct of APIs.
  • Removing emojis. I know everyone likes it but at least in this part sadly I have to let it go.
  • Removing punctuation.
  • Spelling checker. Try to install pyspellchecker .

and after that,

Actually, this step took a lot of time, but we do not know what we have not tried yet.

That is it. Now, some modeling!

GloVe Vectorization

GloVe stands for Global Vectorization, developed by Stanford NLP group. According to their research, GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Image for post
Image for post
How GloVe works. Image from https://nlp.stanford.edu/projects/glove/

To create an embedding dictionary, I use available GloVe from the Kaggle dataset. This is how I create the embedding dataset from GloVe.

Tokenization

Tokenization is the process of separating our corpus into small pieces. That is because the machine can’t actually read directly from the words. I use a maximum length of 50 words, and if the sentence is less than 50 words, the rest will be padded with zeros.

Building the Embedding–LSTM Model and predict the test

LSTM is a well-known RNN used in NLP, and actually reliable for handling sentences since it has memory track. This is what the LSTM with GloVe looks like.

And then train the model and test it on the submission dataset. This yields the best accuracy of 0.8.

Image for post
Image for post

I am not quite satisfied with that, so I was looking for another notebook that is straightforward and easy to combine with the prior notebook, but using a transformer.

Using BERT

Image for post
Image for post
Image from https://medium.com/brandlitic/5-key-takeaways-about-googles-bert-update-1a9850d42734

Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) is a neural network-based technique for the natural language processing (NLP) pre-training model. BERT is the technology behind Google’s search engine. The library is accessible through Huggingface and it is suitable to use this pre-trained model for tweets.

If you want to learn more about BERT, you can learn it on the documentation. Furthermore, I learn about the theories behind BERT from this article.

To use this model, first, we need to encode the tweets into three embedding tensors: token, mask, and segment. Later, we will add positional embedding tensor. I just copied the previous repository and replace the embedding and LSTM parts.

Image for post
Image for post
Image from https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a.

I borrowed some of the model buildings from this notebook. I still use Keras for building model, but I want to warn that actually many of transformer-related model was built in PyTorch, which is worth to try.

  • Tokens are basically breakdowns of the sentence,
  • Masks are parts of token which are hidden,
  • Segment embedding to distinguish between paired input sequences.
  • Positional embedding for the temporal property.

We will apply those embeddings to the train and test dataset.

And this is the vanilla BERT model,

Note that this model is not tuned yet. The BERT pre-trained model runs shorter than the previous model, and gave a jump in accuracy :)

Image for post
Image for post

edit: I improved it a bit and the accuracy is currently at 0.845.

Conclusion

NLP and NLU are interesting and emerging AI disciplines. There is a lot of utilization, from Twitter sentiment analysis to the advanced cyberpunk self-decision-making government (should be a collaboration with RL). For those two simple approaches, we know that our model knows how to distinguish metaphor expression and the real news. BERT is a breakthrough but is not the best. There is a lot of BERT modification, and to mention GPT which aim to the general intelligence. I hope this article helps you to get into NLP.

References

  1. https://medium.com/brandlitic/5-key-takeaways-about-googles-bert-update-1a9850d42734
  2. https://nlp.stanford.edu/projects/glove/
  3. https://arxiv.org/abs/1810.04805
  4. https://github.com/google-research/bert
  5. https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove/notebook
  6. https://www.kaggle.com/massinissaguendoul/nlp-disaster-tweet

About the Author

Salman is a cyberpunk enthusiast that loves coffee, lo-fi music, extragalactic astrophysics, and self-driving car so much. He is working on Asis and Mata — AI personal assistant products — as an applied researcher in the field of natural language understanding and reinforcement learning-driven self-learning agent. Previously, he was a computer vision applied researcher at the same parent company — Zapps AI, a research assistant at the Department of Astronomy, Institut Teknologi Bandung, and a robotics computer vision engineer at Dago Hoogeshool. He is part of the Nvidia developer program, Jakarta machine learning, and Google developer group Jakarta.

He has a dream to establish a self-working cyberpunk company which consists of AI agents as the employees.

He pursued an Astronomy and Astrophysics major at Institut Teknologi Bandung, spent a physics summer school at Princeton University, and Machine Learning Summer School (MLSS) 2020 which he is proudly taught by Max Welling’s AMLab and researchers from DeepMind.

Check his profile on LinkedIn.

A coffee enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store