Natural Language Processing Introduction

NLP or Natural Language Processing is one of the hottest areas of AI research nowadays. This post will introduce what NLP is, the most frequently used techniques, and some applications.

Why is text analysis important?

  • According to industry estimates, more than 80% of the data generated is in an unstructured format (Text, image, audio, video, etc.) and growing exponentially.
  • Text data is most common and covers more than 50% of the unstructured data.
  • 40% of business executives complain that they have too much-unstructured text data and cannot interpret them.

What is NLP?

Natural Language Processing (NLP) is the term for making machines to understand and interpret the human language.

Since machines and algorithms cannot understand text and characters directly, it is important to convert these text data into a machine-understandable format.

Some applications

  • Sentiment analysis: Customer’s emotions toward products offered by the business.
  • Complaint classifications – Email classification – E-Commerce product classification, etc.
  • Topic modeling: Extract the unique topics from the group of documents.
  • Resume shortlisting and job description matching using similarity methods.
  • Chatbots, Q&A, and Voice-to-Text applications like Siri or Alexa.
  • Language detection and translation using neural networks.
  • Text summarization using graph methods and advanced techniques.
  • Document categorization.
  • Information/Document Retrieval Systems, for example, search engine.
  • Machine translation.
  • Spellcheck.
  • Speech-to-text and Text-to-speech.
  • Automatic summarization.
  • Text generation/predicting the next sequence of words using deep learning algorithms.

NLP End-to-End Pipeline and Life Cycle

Natural Language Processing End to End Pipeline and life-cycle

For instance: consider a customer sentiment analysis and prediction for a product or brand, or service.

  • Define the problem: Understand the customer sentiment across the products.
  • Understand the problem’s depth and breadth: Understand the customer/user sentiments across the product; why are we doing this? What is the business impact?
  • Data requirements brainstorming: Have a brainstorming activity to list out all possible data points.
    • All the reviews from customers on e-commerce platforms like Amazon.
    • Emails from customers.
    • Warranty claim forms.
    • Survey data.
    • Call center conversations using speech to text.
    • Feedback forms.
    • Social media data like Twitter, Facebook, and LinkedIn.
  • Data collection: We can use web scraping and Twitter APIs.
  • Text preprocessing: All techniques are explained below.
  • Text to features: All techniques are explained below.
  • Machine Learning/Deep Learning: Use conventional algorithms to handle supervised and unsupervised learning to achieve goals like text classification, text generation, etc. 
  • Insights and deployment: There is absolutely no use for building NLP solutions without proper insights to connect the dots between model/analysis output and the business.

Data Collection

The data can be obtained from different sources. Here is the list of most common resource/formats where we can find the data:

  • SQL and NoSQL
  • Cloud Storage
    • Pdf’s
    • Word files
    • Plain text files
  • Web Scraping (Reading HTML)
    • Keep one eye on regulation
  • APIs
    • Twitter
    • Facebook (and Instagram)
    • Public Data Bases
    • API’s Marketplaces
    • Government data

Text preprocessing for NLP

In this stage, we clean and standardize text data. One of the key elements of this task is Regular Expression. This technique helps us identify a sequence of characters with the definition of a search pattern. Thanks to this procedure, we can identify emails, links, names, and other words in a text. Generally, we use the following techniques to make preprocessing done:

  • Lowercasing (Text data to lowercase)
    • Capital letters could help to identify Entities Recognition.
  • Punctuation removal
    • Sentences analysis. Topic Modelling.
  • Stop words removal
    • They are ubiquitous words that carry no meaning or less meaning compared to other keywords. We can focus on the important keywords instead.
    • For applications like search engines, handling stop words introduce noise to results.
  • Text standardization
    • Most of the text is in the form of either customer reviews, blogs, or tweets, where there is a high chance of people using short words and abbreviations to represent the same meaning. 
    • This may help understand and resolve the semantics (the branch of linguistics and logic concerned with meaning) of the text.
    • In the standardization phase, we cover:
      • Spelling correction: Depending on the source of the text, there is a chance of people using short words and making typo errors. This will help us in reducing multiple copies of words, which represent the same meaning.
      • Tokenization: It refers to splitting text into minimal meaningful units. There are two types:
        • Sentence tokenizer
        • Word tokenizer
      • Stemming: It is a process of extracting a root word. For instance, fish, fishes, and fishing are stemmed from fish. Stemming works by slicing the end of the beginning of the word, using a list of common prefixes and suffixes like (-ing, -ed, -es). This slicing can be successful on most occasions, but not always.
      • Lemmatization: It is a process of extracting a root word by considering the vocabulary. For instance, “good,” “better,” or “best” is lemmatized into good. The part of speech of a word is determined in lemmatization. Lemmatization is an extension of Stemming as we get better results. Lemmatization takes the help of the linguistic analysis of the words. It is necessary to have detailed dictionaries that the algorithm can look through to link the form back to its lemma. It takes the help of various linguistic insights into that particular word.
Stemming example
stemming vs lemmatization
Stemming vs Lemmatization

Quick recap: So far, we have gathered, cleaned, and preprocessed the data. Now, we need to translate the texts into a machine-understandable format. That is why we use the techniques of text to features. This is the foundation of Natural Language Processing.

Once again, the procedure of converting raw text data into the machine-readable format -numbers- is called feature engineering of text data or embedding.

Machine learning and deep learning algorithms’ performance and accuracy fundamentally depend on the type of feature engineering technique used.


Text to features

There are two basic techniques to achieve the task of translating the information into a numerical representation:

  • Frequency-based embedding of features
  • The prediction-based embedding or word embedding

One-Hot Encoding

The most traditional and simple method. It is a process of converting each word into features or columns and coding one or zero for the presence of that particular word. The number of features will be the number of total tokens in the whole corpus (text).

(+) Pros:

  • Simplicity – Easily to implement.

(-) Cons:

  • Can be memory inefficient – It leads to the sparse matrix (expensive training).
  • No notion of word similarity – Orthogonal vectors (all words are equivalent and no semantic captured)
  • It does not take the frequency of the word occurring into consideration. If a particular word appears multiple times, there is a chance of missing the information if it is not included in the analysis.


  • Similar to One Hot Encoding. The only difference is that instead of checking whether the particular word is present or not, it will count the words present in the document.
  • Advantages:
    • Simplicity.
  • Disadvantages:
    • It can be memory inefficient.
    • Each word is considered as a feature. It does not consider the previous and the next words to see if that would give a proper and complete meaning.
      • For instance, consider the word “not bad.” If this is split into individual words, then it will lose out on conveying “god.”

(+) Pros:

  • Simplicity.

(-) Cons:

  • Can be memory inefficient.
  • No combinations of words. Information loss.


  • It is a fusion of multiple letters or multiple words.
  • They are formed in such a way that even the previous and next words are captured.
    • Unigrams are the unique word present in the sentence.
    • Bigrams is a combination of 2 words, and so on.
  • Advantages:
    • N-grams solved the problem of loss of potential information or insights because many words make sense once they are put together.
  • Disadvantages:
    • It cannot interpret unseen instances concerning learned training data.

(+) Pros:

  • It solved the problem of loss potential information.

(-) Cons:

  • Can be memory inefficient.
  • It cannot interpret unseen instances in training data.

Co-occurrence matrix

(+) Pros:

  • Statistic-based.

(-) Cons:

  • Can be memory inefficient.
  • Expensive re-calculations when we add a new word.
  • It is like a count-vectorizer where it counts the occurrence of words together, instead of individual words.
  • Example: If you observe, “I,” “love,” and “is,” NLP” have appeared together twice, and a few other words appeared only once.
    • Advantages:
      • Statistic-based.
    • Disadvantages:
      • This method, as well as the count-vectorizer, has one limitation, though. Vocabulary can become very large and cause memory/computation issues.
      • It is based on matrix decomposition, the dimensions of the matrix change when the dictionary change and the whole decomposition must be re-calculated when we add a word.
      • It is also very sensitive to word frequency imbalance, so we often have to preprocess the documents to remove stopwords and normalize all the other words (either through Lemmatization or through Stemming).

Hash vectorizer

(+) Pros:

  • It is memory efficient.

(-) Cons:

  • It only works one way and once vectorized, the features cannot be retrieved (Loss of Information).
  • This method can overcome the issue of having a large vocabulary.
    • Example: It created a vector of size X, and now this can be used for any supervised/unsupervised tasks.
  • Advantages:
    • It is memory-efficient. Instead of storing the tokens as a string, the vectorizer applied the hashing trick to encode them as numerical indexes.
  • Disadvantage:
    • It only works one way, and once vectorized, the features cannot be retrieved.

Term Frequency-Inverse Document Frequency (TF-IDF)

  • Term Frequency-Inverse Document Frequency (TF-IDF):
    • Disadvantages of the above method: Let’s say a particular word appears in all the corpus documents, then it will achieve higher importance in our previous methods. That isn’t good for our analysis.
    • The whole idea of having TF-IDF is to reflect on how important a word is to a document in a collection, and hence normalizing words appeared frequently in all the documents.
    • Term Frequency (TF): It is simply the ratio of the count of a word present in a sentence to the sentence’s length.
      • It captures the importance of the word irrespective of the length of the document.
    • Inverse Document Frequency (IDF):
      • IDF of each word is the log of the ratio of the total number of words to the number of rows in a particular document in which that word is present.
      • It measures the rareness of a term.
      • If a word appears in almost all documents, then that word is of no use since it is not helping to classify or in information retrieval.
    • TF-IDF helps make predictions and information retrieval relevant.
    • Example: If you observe, “the” appears in all the 3 documents, and it does not add much value, and hence the vector value is 1, which is less than all the other vector representations of the tokens.

(+) Pros:

  • It reflects how important a word is to a document in a collection, and hence normalizing words appeared frequently in all the documents.

(-) Cons:

  • It calculates document similarity directly in the word-count space, which may be slow for large vocabularies.
  • This technique makes no use of semantic similarities between words.
  • The whole idea of having TF-IDF is to reflect on how important a word is to a document in a collection, and hence normalizing words appeared frequently in all the documents.
  • Disadvantages:
    • It estimates document similarity directly in the word-count space, which may be slow for large vocabularies.
    • It implies that the counts of different words provide independent evidence of similarity.
    • It makes no use of semantic similarities between words.

The above techniques have a strong limitation in terms of Homonyms, hyponyms, hypernyms, lack of context-understanding, qualifiers, irony, etc. Some of these drawbacks are addressed in the next set of techniques to convert text to features.

Text to Features

Prediction-based embedding or Word Embedding

Word embeddings are dense vector representations of words in lower dimensional space.

Its main goal is to capture a type of relationship between words.

  • Morphological.
  • Semantic.
  • Contextual.
  • Syntactic.

Even though all previous methods solve most of the problems, once we get into more complicated problems where we want to capture the semantic relation between the words, these methods fail to perform. Below are the challenges:

  • All these techniques fail to capture the context and meaning of the words. All the methods discussed so far basically depend on the appearance or frequency of the words. But we need to look at how to capture the context of semantic relations: how frequently the words are appearing close by.
  • For instance:
    • I am eating an apple.
    • I am using an apple.
  • If you observe the above example, Apple gives different meanings when used with different (close by) adjacent words, eating, and using.

For a problem like document classification, we need to create a representation for words that captures their meaning, semantic relationships, and the different types of contexts they are used in.

  • Word embedding: (References: (i)
    • Word embeddings are dense vector representations of words in lower-dimensional space.
    • Its main goal is to capture a type of relationship between words.
      • Morphological, semantic, contextual, or syntactic (syntax is the set of rules, principles, and processes that govern the structure of sentences in a given language, usually including word order).
    • It enables to store of contextual information in a low-dimensional vector. Words that occur in a similar context tend to have similar meanings.
    • Word embedding is the feature learning technique where words from the vocabulary are mapped to vectors of real numbers capturing the contextual hierarchy.
  • Word embeddings are prediction-based, and they use neural networks to train the model that will lead to learning the weight and using them as a vector representation.
    • Word2vec: It is the deep learning Google framework to train word embedding. This uses all the words of the whole corpus and predicts the nearby words. It will create a vector for all words present in the corpus so that the context is captured. It also outperforms any other methodologies in the space of word similarities and word analogies. There are mainly 2 types in word2vec:
      • Skip-gram
        • It is used to predict the probabilities of a word given the context of word or words.
        • Each sentence will generate a target word and context, which are the words nearby. The number of words to be considered around the target variable is called the window size (Need to be selected based on data and the resources at your disposal – The larger the window size, the higher the computing power).
      • Continuous Bag of Words (CBOW)
        • To train these models requires a huge amount of computing power. So, let us go ahead and use Google’s pre-trained model, which has been trained with over 100 billion words.
          • Examples (Insert from Book)
    • (Insert Diagram CBOW & Skip Gram)
    • Two versions of Word2Vec were proposed: Continuous Bag-of-words (CBOW) and skip-gram (SG). The first one predicts the central word based on a window of words surrounding it, and the second one does the exact opposite: it predicts the context based on the central word. These architectures contain only one hidden layer, which leads to more efficient training.
  • Implementing fastText:
    • It is another deep learning framework developed by Facebook to capture context and meaning.
    • fastText is the improvised version of word2vec. Word2vec considers words to build the representation. But fastText takes each character while computing the representation of the word.
    • Even with words that are not present in training, since fastText is building on a character level, it will provide results. Instead of word2vec, we do not get a vector for a word not present in training. Out Of Vocabulary words (OOV).

  • Word2vec (pre-trained Deep Learning Google Framework – 100 billion words)
    • Skip-gram
    • Continuous Bag of Words (CBOW)
  • fastText (pre-trained Deep Learning Facebook framework):
    •  Out Of Vocabulary words (OOV)
  • ELMo (Pre-trained): 
    • Contextual (Take into account word order)
    • Character-based.
      • It learns from words.
    • It uses LSTM architecture.
  • BERT (Pre-trained):
    • Same as ELMo but the input is subwords.
      • Better for Out Of Vocabulary (OOV)
    • It uses Transformers.
  • XLM (Pre-trained): 
    • Enhancing BERT for Cross-lingual Language Model.
  • Elmo:
    • Sentence embedding is similar to word embedding, but instead of words, they encode the whole sentence into a vector representation.
    • Contextual: The representation for each word depends on the entire context in which it is used.
    • Deep: The word representations combine all layers of a deep pre-trained neural network.
    • Character-based: Elmo representations are purely character-based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.
    • Include order
  • BERT:
    • Take into account word order.
    • It uses Transformers (Attention-based model with positional encoding)
  • XLM:
    • Enhancing BERT for Cross-lingual Language Model.

Final remarks

There is a huge interest in the potential that Natural Language Processing (NLP) techniques can provide to solve daily business problems. This post attempts to stretch a road map on how the NLP is and has evolved with the recent developments.


[1] Kulkarni, A. and A. Shivananda, Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning Using Python. 2019: Springer.

One thought on “Natural Language Processing Introduction

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: