🧠 Natural Language Processing (NLP) — Complete Notes


Table of Contents

  1. Introduction to NLP
  2. Applications of NLP
  3. Challenges in NLP
  4. Linguistics in NLP
  5. Text Preprocessing
  6. N-Grams
  7. Vectorization
  8. Word Embeddings
  9. Cosine Similarity
  10. Part-of-Speech Tagging & NER
  11. NLP Libraries & Tools
  12. Deep Learning for NLP
  13. RNN
  14. LSTM
  15. GRU
  16. Bidirectional RNN
  17. CNN for NLP

1. Introduction to NLP

Natural Language Processing (NLP) is a field at the intersection of linguistics, computer science, and artificial intelligence. It focuses on enabling machines to understand, interpret, and generate human language through specific algorithms and models.

NLP bridges the gap between human communication and computer understanding.

Key Goals:

  • Enable machines to read and understand text
  • Allow computers to generate meaningful human language
  • Facilitate seamless human-computer interaction

2. Applications of NLP

ApplicationDescription
Voice AssistantsSiri, Alexa, Google Assistant — understand spoken commands
ChatbotsEvolved from basic Q&A to intent-aware systems using LLMs
Machine TranslationGoogle Translate — real-time cross-language communication
Sentiment AnalysisClassifying text as positive, negative, or neutral (e.g., product reviews)
Text SummarizationCondensing long documents into key points
Search EnginesQuery understanding and document retrieval
Named Entity RecognitionIdentifying persons, organizations, locations in text
Question AnsweringSystems that answer questions from a knowledge base

3. Challenges in NLP

3.1 Ambiguity in Language

Human language is context-dependent. The same phrase can carry different meanings:

  • "I saw the man with the telescope" — who has the telescope?

3.2 Nuances and Variations

  • Idioms & Colloquialisms: "It's raining cats and dogs" doesn't mean animals are falling
  • Sarcasm & Humor: Hard for machines to detect without context
  • Slang: Constantly evolving and domain-specific

3.3 Data Quality and Diversity

  • NLP models require large, high-quality datasets
  • Biased or incomplete data → skewed model behavior
  • Must cover diverse dialects, languages, and contexts

3.4 Named Entity Recognition (NER) Challenges

  • Uncommon names, emerging organizations
  • Context-dependent entity disambiguation

3.5 Computational Challenges

  • Converting text → numerical data requires sophisticated techniques (embeddings)
  • Processing large corpora is computationally expensive

3.6 Integration with ML Models

  • Deep understanding of both NLP and machine learning is required
  • Steep learning curve for practitioners

4. Linguistics in NLP

Linguistics provides the structural foundation that NLP systems build upon:

Linguistic FieldRole in NLP
Phonetics & PhonologySound patterns → Speech recognition
MorphologyWord structure → Stemming, Lemmatization
SyntaxSentence structure → Parsing, Grammar checks
SemanticsWord/sentence meaning → Word sense disambiguation
PragmaticsContext of language use → Dialogue systems

Example: The word "home" can mean a house, a hometown, or a sense of belonging — semantics handles this variation.


5. Text Preprocessing

Preprocessing is a critical pipeline step that cleans and normalizes raw text before analysis.

Typical NLP Preprocessing Pipeline:

Raw Text → Case Folding → Special Char Removal → Tokenization → Stop Word Removal → Stemming/Lemmatization

5.1 Case Folding

Definition: Converting all characters to lowercase to ensure uniform representation.

Why it matters: "Apple" and "apple" should be treated as the same word in most contexts.

Methods in Python:

text = "Hello World"

# Method 1: lower() - standard lowercase
print(text.lower())       # "hello world"

# Method 2: casefold() - more aggressive, handles Unicode
print(text.casefold())    # "hello world" (better for non-ASCII characters)

Caution: Case folding can lose meaning for proper nouns ("US" vs "us") and abbreviations. Be careful in tasks like machine translation.


5.2 Special Character Removal

Purpose: Remove symbols like @, #, $, !, %, URLs, and punctuation that add noise to data.

Method 1: Using re (Regular Expressions)

import re

text = "Hello! Are you coming to the #party @ 8pm? Check out www.example.com!"
clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print(clean_text)
# Output: Hello Are you coming to the party 8pm Check out wwwexamplecom

Method 2: Using SpaCy

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Here's an example: @user #hashtag https://example.com!"
doc = nlp(text)

clean_tokens = [token.text for token in doc if token.is_alpha]
clean_text = " ".join(clean_tokens)
print(clean_text)
# Output: Here s an example user hashtag

Method 3: Using NLTK

import nltk
from nltk.tokenize import RegexpTokenizer

text = "Good morning! Let's meet at 5:00 pm @ the café."
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
clean_text = " ".join(tokens)
print(clean_text)
# Output: Good morning Let s meet at 5 00 pm the café

5.3 Stop Words Removal

Stop words are common, low-information words like "the", "is", "in", "and" that are typically filtered out to reduce noise.

Why remove them?

  • Improves signal-to-noise ratio
  • Reduces feature space
  • Helps models focus on meaningful content
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

sample_sentence = "This is a sample sentence showing off the stop words filtration."
tokens = word_tokenize(sample_sentence)
filtered = [word for word in tokens if word.lower() not in stopwords.words('english')]

print("Original:", sample_sentence)
print("Filtered:", " ".join(filtered))
# Filtered: sample sentence showing stop words filtration .

Note: Stop word lists are language-specific. NLTK supports multiple languages.


5.4 Tokenization

Tokenization is the process of breaking text into smaller units called tokens (words, subwords, or sentences).

Types of Tokenization

TypeDescriptionExample
Word TokenizationSplit text into individual words"Good morning!"["Good", "morning", "!"]
Sentence TokenizationSplit text into sentences"Hello! How are you?"["Hello!", "How are you?"]
Subword TokenizationSplit into character n-grams (used in BERT, GPT)"playing"["play", "##ing"]
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Tokenization is a fundamental step in NLP. It breaks down paragraphs into sentences or words."

sentences = sent_tokenize(text)
print(sentences)
# ['Tokenization is a fundamental step in NLP.', 'It breaks down paragraphs into sentences or words.']

words = word_tokenize(text)
print(words)
# ['Tokenization', 'is', 'a', 'fundamental', 'step', ...]

Handling Contractions

import contractions

text = "I'm happy to see you! It's a great day."
expanded = contractions.fix(text)
print(expanded)
# Output: "I am happy to see you! It is a great day."

5.5 Stemming & Lemmatization

Both techniques reduce words to their base form, but differ in approach:

TechniqueMethodExampleOutput
StemmingChops suffix (rule-based, fast)"running", "runs""run" (may not be a real word)
LemmatizationUses vocabulary + grammar (slower, accurate)"better""good"
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(stemmer.stem("running"))       # "run"
print(lemmatizer.lemmatize("better", pos="a"))  # "good"

✅ Use lemmatization when accuracy matters. Use stemming for speed.


6. N-Grams

An n-gram is a contiguous sequence of n items (words or characters) from a given text. N-grams are used in language modeling and text prediction.

NNameExample
1Unigram("I"), ("love")
2Bigram("I", "love")
3Trigram("I", "love", "NLP")
import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def generate_ngrams(text, n):
    tokens = word_tokenize(text)
    return list(ngrams(tokens, n))

sample_text = "I am going to the hospital."
bigrams = generate_ngrams(sample_text, 2)
trigrams = generate_ngrams(sample_text, 3)

print("Bigrams:", bigrams)
# [('I', 'am'), ('am', 'going'), ('going', 'to'), ('to', 'the'), ('the', 'hospital')]

print("Trigrams:", trigrams)
# [('I', 'am', 'going'), ('am', 'going', 'to'), ('going', 'to', 'the'), ('to', 'the', 'hospital')]

Applications: Language models, spell correction, machine translation, next-word prediction.


7. Vectorization

Vectorization converts text into numerical vectors so machine learning algorithms can process it. Machines cannot understand raw text — they operate on numbers.


7.1 Count Vectorization

Converts text documents into a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love coding.", "I love AI."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # ['ai', 'coding', 'love']
print(X.toarray())
# [[0, 1, 1],
#  [1, 0, 1]]

Limitation: Does not account for word importance — all words treated equally.


7.2 TF-IDF

TF-IDF (Term Frequency – Inverse Document Frequency) reflects how important a word is in a document relative to a corpus.

Formulas

Term Frequency (TF): $$TF(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total terms in document } d}$$

Inverse Document Frequency (IDF): $$IDF(t) = \log\left(\frac{\text{Total documents}}{\text{Documents containing } t}\right)$$

TF-IDF: $$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$

Example Calculation

  • Word "AI" appears 3 times in a 100-word document → TF = 3/100 = 0.03
  • 10 documents total, "AI" in 3 → IDF = log(10/3) ≈ 0.523
  • TF-IDF = 0.03 × 0.523 ≈ 0.0157

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love machine learning.", "Machine learning is amazing.", "I love NLP."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

8. Word Embeddings

Word embeddings are dense, continuous vector representations of words that capture semantic relationships and contextual similarity.

Words with similar meanings → similar vectors in embedding space

Types:

CategoryMethods
Frequency-basedBag of Words, TF-IDF, GloVe
Prediction-basedWord2Vec (CBOW, Skip-gram), FastText

8.1 Word2Vec (CBOW & Skip-gram)

Word2Vec learns word representations by training a neural network on a "dummy task."

import gensim
from gensim.models import Word2Vec

sentences = [["I", "love", "NLP"], ["NLP", "is", "amazing"], ["deep", "learning", "is", "powerful"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Find similar words
similar = model.wv.most_similar("NLP")
print(similar)

CBOW (Continuous Bag of Words)

  • Task: Predict the target word given surrounding context words
  • Better for smaller datasets
  • Architecture: Input (context words) → Hidden Layer → Output (target word)

Skip-gram

  • Task: Predict context words given a target word (reverse of CBOW)
  • Better for large datasets and rare words

Training Process (Neural Network)

1. Initialize weights randomly
2. Feed input → compute output via forward pass
3. Calculate loss function (cross-entropy)
4. If loss is high → backpropagation → update weights
5. Repeat until loss converges

The weights of the hidden layer after training become the word vectors (embeddings).

How to improve CBOW?

  • Increase training data
  • Increase hidden layer size (more dimensions)

8.2 GloVe

GloVe (Global Vectors for Word Representation) uses global word-word co-occurrence statistics to build embeddings.

  • Creates a co-occurrence matrix from the entire corpus
  • Captures semantic relationships through co-occurrence frequency
  • Example: "ice" and "cold" co-occur often → placed close in vector space

8.3 FastText

An upgraded version of Word2Vec by Facebook AI Research.

Word2VecFastText
Word-level embeddingsCharacter/subword n-gram embeddings
Cannot handle OOV wordsHandles Out-Of-Vocabulary words

Example: "capability"cap, apa, pab, abi, bil, ili, lit, ity

FastText is especially useful for morphologically rich languages and handling typos.


9. Cosine Similarity

Cosine Similarity measures the similarity between two vectors based on the angle between them — not their magnitude.

$$\cos(\theta) = \frac{\sum A_i \cdot B_i}{\sqrt{\sum A_i^2} \cdot \sqrt{\sum B_i^2}}$$

ValueMeaning
1Identical direction (most similar)
0Perpendicular (unrelated)
-1Opposite direction (most dissimilar)
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

A = np.array([[1, 2, 3]])
B = np.array([[4, 5, 6]])

similarity = cosine_similarity(A, B)
print(similarity)  # [[0.97463185]]

Works in any number of dimensions (2D, 5D, 100D) — this is why it's preferred in NLP over Euclidean distance.


10. Part-of-Speech Tagging & NER

Part-of-Speech (POS) Tagging

Definition: Labelling each word in a sentence with its grammatical role (noun, verb, adjective, etc.)

  • Uses Hidden Markov Models (HMM) or deep learning
  • Essential preprocessing step for many NLP tasks
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.pos_, token.dep_)

Named Entity Recognition (NER)

Definition: Identifying and classifying named entities (persons, organizations, locations, dates) in text.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was the 44th president of the United States.")

for ent in doc.ents:
    print(ent.text, ent.label_)
# Barack Obama   PERSON
# 44th           ORDINAL
# United States  GPE

NER Entity Types:

LabelMeaning
PERSONPeople's names
ORGOrganizations
GPEGeopolitical entities (countries, cities)
DATEDates and time periods
MONEYMonetary values

Applications: Information extraction, Q&A systems, chatbots, word sense disambiguation.


11. NLP Libraries & Tools

LibraryPurpose
NLTKGeneral NLP toolkit — tokenization, stemming, POS tagging
SpaCyIndustrial-strength NLP — fast, production-ready
GensimTopic modeling, Word2Vec, Doc2Vec
Transformers (HuggingFace)Pre-trained models (BERT, GPT, T5)
sklearnClassical ML models + vectorization (TF-IDF, CountVectorizer)
TensorFlow / KerasDeep learning for NLP
PyTorchDeep learning research and production

12. Deep Learning for NLP

12.1 Activation Functions

Activation functions introduce non-linearity and determine whether a neuron "fires."

FunctionRangeFormulaUse Case
Step{0, 1}1 if x > 0 else 0Binary classification
Linear(-∞, ∞)y = mx + cRarely used alone (no non-linearity)
Sigmoid(0, 1)y = 1 / (1 + e^(-x))Output layer for binary classification
Tanh(-1, 1)y = 2/(1 + e^(-2x)) - 1Hidden layers
ReLU[0, ∞)A(x) = max(0, x)Most common in hidden layers
Leaky ReLU(-∞, ∞)max(0.01x, x)Fixes "Dying ReLU" problem

Dying ReLU Problem: Neurons get stuck at 0 for all inputs → fixed by Leaky ReLU (allows small negative gradient).


12.2 ANN (Artificial Neural Network)

Architecture: Input Layer → Hidden Layer(s) → Output Layer

Implementation Steps:
1. Import libraries (TensorFlow/Keras or PyTorch)
2. Load and preprocess dataset
3. Initialize the ANN
4. Add Layers:
   - Input Layer: match input feature dimensions
   - Hidden Layers: choose neurons + activation functions (usually ReLU)
   - Output Layer: neurons = number of classes; softmax or sigmoid
5. Compile: optimizer + loss function + metrics
6. Train: fit on training data (batch_size, epochs)
7. Evaluate on test data
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(input_dim,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')   # binary classification
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, epochs=10)

Hyperparameter Optimization:

  • GridSearchCV — exhaustive search over parameter grid
  • RandomizedSearchCV — random sampling (faster)
  • Manual tuning based on domain knowledge

12.3 Backpropagation & Forward Pass

PhaseDescription
Forward PassInput flows through network → prediction computed
Loss CalculationDifference between predicted and actual output (cost function)
BackpropagationGradients computed layer by layer using chain rule
Weight UpdateWeights adjusted using optimizer (SGD, Adam)
  • Epoch: One complete pass through all training data
  • Iteration: One forward + backward pass on a batch
  • Stochastic Gradient Descent (SGD): Updates weights using one sample (or mini-batch) at a time

13. RNN (Recurrent Neural Network)

RNNs process sequential data by maintaining a hidden state that carries information from previous time steps.

x(t) ──► [RNN Cell] ──► y(t)
              ▲
              │ h(t) (hidden state fed back)

Issues with RNNs:

  • Vanishing Gradient Problem: Gradients shrink as they backpropagate through many time steps → early context forgotten
  • Exploding Gradient Problem: Gradients grow exponentially → unstable training
  • Short-term memory: Cannot handle long-range dependencies

Example: "Today I need..." (fine) vs "Last year I had..." (RNN struggles to connect)


14. LSTM (Long Short-Term Memory)

LSTM is a special type of RNN designed to solve the vanishing gradient problem by maintaining two memory states.

StateTypePurpose
c(t) — Cell StateLong-term memoryCarries information across long sequences
h(t) — Hidden StateShort-term memoryUsed for immediate output computation

Architecture

[Forget Gate] → [Input Gate] → [Output Gate]
      ↓               ↓              ↓
 Remove old info  Add new info  Produce output

Forget Gate

Decides what to remove from long-term memory.

$$f(t) = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ $$\text{Updated: } c_{t-1} \times f(t)$$

Input Gate

Decides what new information to add to cell state.

$$i(t) = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}(t) = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$$ $$c(t) = f(t) \cdot c(t-1) + i(t) \cdot \tilde{c}(t)$$

Output Gate

Computes the hidden state (output).

$$o(t) = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ $$h(t) = o(t) \times \tanh(c(t))$$

Activation Functions in LSTM

  • Sigmoid σ: Output in [0, 1] — used as a "gate" (0 = block, 1 = pass)
  • Tanh: Output in [-1, 1] — used to squash values

LSTM Network Architecture for NLP

Input → Embedding Layer → LSTM Layer → Dense Layer → Output

15. GRU (Gated Recurrent Unit)

GRU is a simplified version of LSTM with fewer parameters and comparable performance.

FeatureLSTMGRU
Memory states2 (cell + hidden)1 (hidden only)
Gates3 (forget, input, output)2 (reset, update)
ParametersMoreFewer
Training timeSlowerFaster
PerformanceSlightly better on large dataComparable

GRU Gates

Reset Gate r(t) — controls short-term memory (how much past to forget): $$r(t) = \sigma(W_r \cdot [h_{t-1}, x_t])$$

Update Gate z(t) — balances old and new information: $$z(t) = \sigma(W_z \cdot [h_{t-1}, x_t])$$

Candidate Hidden State: $$\tilde{h}(t) = \tanh(W \cdot [r(t) \cdot h_{t-1}, x_t])$$

Final Hidden State: $$h(t) = (1 - z(t)) \cdot h_{t-1} + z(t) \cdot \tilde{h}(t)$$

Steps Summary

  1. Calculate Reset Gate r(t)
  2. Calculate Candidate Hidden State h̃(t)
  3. Calculate Update Gate z(t)
  4. Calculate Final Hidden State h(t)

16. Bidirectional RNN

A Bidirectional RNN processes sequences in both forward and backward directions, capturing context from both past and future tokens.

Forward:   x₁ → x₂ → x₃ → x₄
Backward:  x₄ → x₃ → x₂ → x₁
                ↓
         Combined Output

Applications:

  • Named Entity Recognition (NER)
  • Part-of-Speech Tagging
  • Machine Translation
  • Sentiment Analysis

Drawbacks:

  • Higher computational cost (more parameters)
  • Requires full sequence at inference → not suitable for real-time generation
  • Needs more data to generalize well

17. CNN for NLP

Convolutional Neural Networks (CNNs) are primarily used for image processing but are also applied in NLP for text classification and feature extraction.

CNN Pipeline for NLP

Input Text → Embedding Layer → Convolution → ReLU Activation → Pooling → Flattening → Fully Connected Layer → Output
LayerRole
EmbeddingConvert words to dense vectors
ConvolutionExtract local n-gram features
ReLU ActivationIntroduce non-linearity
PoolingDownsample — keep most important features
FlatteningConvert feature maps to 1D vector
Fully ConnectedClassification / Regression

Use Cases: Sentence classification, spam detection, sentiment analysis.


Quick Reference: Naive Bayes Text Classifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Summary Cheat Sheet

NLP Pipeline:
Raw Text
   ↓
Preprocessing (case fold, remove special chars, stop words)
   ↓
Tokenization
   ↓
Stemming / Lemmatization
   ↓
Vectorization (Count / TF-IDF / Word2Vec / GloVe)
   ↓
Model (Naive Bayes / ANN / RNN / LSTM / GRU / CNN)
   ↓
Output (Classification / Generation / Translation)
ConceptKey Idea
TokenizationSplit text into tokens
Stop WordsRemove low-info words
TF-IDFWeight words by importance
Word2VecPredict word from context (CBOW) or context from word (Skip-gram)
GloVeEmbeddings from co-occurrence matrix
FastTextSubword-level embeddings
LSTMLong + short term memory via gates
GRUSimplified LSTM, faster training
Cosine SimilarityAngle-based vector similarity
NERIdentify entities in text
POS TaggingLabel words by grammatical role