Cloud Ace Blog

Oct 21, 2019 | General

by Shivam Kohli

Developer

Preprocessing text for NLP

ProcessingTextForNLP-1

Introduction

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Computers can understand numbers but understanding words have always been a task. With the vast number of applications of NLP like chatbots(google home, Alexa, etc), sentimental analysis, spam detection, spellchecker, etc. it’s really crucial to use the power of NLP. To analyse and compute on the text gathered or to make some prediction model we need to preprocess the text so that our computation becomes easy. Data preprocessing is the data cleaning of the raw data into another format that is more general so that it is ready for analysis.

The target for this blog is to understand the pre-processing of text for NLP. The different steps that are performed in data pre-processing are:

  1. Lower case
  2. Tokenization
  3. Stop Words
  4. Contractions
  5. Alphanumeric characters
  6. Stemming and Lemmatization

The code for the following blog in the URL provided.

https://gist.github.com/kohlishivam/986ec33a3b1d8c12926e7bbb91957122


Lower Case

Usually, this is the first step that is performed while pre-processing. The underlying principle of performing this task is to maintain the consistency of both the input and the output. The various advantages of doing lower case are that it increases the speed of our search. For example, there is a word that needs to be searched but it is present in capitalized form then it would take time to give the results. Also, there might be a possibility that there could be various variants of a word present like Apple, APPLE, etc. So to remove the problem of sparsity, we can use the lower case in the pre-processing.

There is one important thing that needs to be noted that while performing lower case, there can be cases where the capitalization is important and has to be preserved. So we should avoid performing this step in such cases.


Tokenization

Tokens are a smaller part of a larger thing. Therefore, tokenization is the process of breaking a large piece of something(in our case text) into smaller segments known as Tokens(for example for the word “apple” the tokens are “a”, “p”, “p”,”l”,”e”). When tokens are joined together they depict the same information. Sentences can be broken down into tokens of words, words can be broken down into alphabets.

There are three types of tokenization

  1. Character Tokenization: The process in which we split each word of the text into various tokens. Like “apple” into “a”, “p”, “p”,”l”,”e”.
  2. Word Tokenization: The process in which we split the text using the space character. This has a drawback that it also splits multi-word expressions like “New Delhi” into “New” and “Delhi”.
  3. Sentence Tokenization: The process in which we split the text using the punctuation like “?”, “.”, etc


Contractions

The words that are written with an apostrophe are termed as contractions. For example: don’t, I’ll, can’t etc. Since we aim to standardize our text, it makes sense to expand these contractions i.e. don’t- ‘do not’, can’t -cannot and I’ll-I will.


Stop Words

In natural language process, the words that occur most commonly in a document is termed as stop words like a, an, the, etc. We filter out or remove such words from the text before preprocessing our text. Removing such words won’t affect the overall meaning of the sentence as they contribute only a little in the overall meaning of the sentence. Removing such words has an advantage that we don’t have to pre-process them and use computation in such words. Moreover, it reduces space and time complexity.

Alphanumeric Character

Alphanumeric characters are the characters that consist of the 26 alphabetic characters from a to z and 10 numerals from 0–9.
Any character other than this is Non-alphanumeric characters like +, <, [, %, etc


Stemming and Lemmatization

There can be multiple representations of a single word. For example, play, played, playing, etc are all mapped to the same wordplay. They all have the same meaning or the same root word. The process of converting the word into a single normalised form is known as stemming/Lemmatization. There is a wide range of advantages for which we perform this step. We use this in the tagging system, SEO, indexing, information retrieval, searching, etc.

Stemming is a process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. One of the drawbacks of stemming is that the result of stemming can result in a word that actually doesn’t exist like daily stemmed into dai. We have to avoid over stemming so that the actual meaning of the word is preserved. Stems are formed by removing the prefix and suffix from the word.

For English, we have some algorithms like LancasterStammer and PorterStammer, SnowballStemmer

Lemmatization is the process of converting a word into its canonical form. That requires extra computational linguistics power such as a part of speech tagger. This allows it to do better resolutions (like resolving is and are to “be”).

ProcessingTextForNLP-2

Conclusion

I hope you have learned the basics of pre-processing text for performing NLP tasks.

Happy Hacking!

We also publish articles on Medium. Read this article on Medium and follow us to see when we publish new articles.

Contact Us