🧠 Natural Language Processing (NLP) — Complete Notes

Introduction to NLP
Applications of NLP
Challenges in NLP
Linguistics in NLP
Text Preprocessing
N-Grams
Vectorization
- Count Vectorization
- TF-IDF
Word Embeddings
Cosine Similarity
Part-of-Speech Tagging & NER
NLP Libraries & Tools
Deep Learning for NLP
RNN
LSTM
GRU
Bidirectional RNN
CNN for NLP

1. Introduction to NLP

Natural Language Processing (NLP) is a field at the intersection of linguistics, computer science, and artificial intelligence. It focuses on enabling machines to understand, interpret, and generate human language through specific algorithms and models.

NLP bridges the gap between human communication and computer understanding.

Key Goals:

Enable machines to read and understand text
Allow computers to generate meaningful human language
Facilitate seamless human-computer interaction

2. Applications of NLP

Application	Description
Voice Assistants	Siri, Alexa, Google Assistant — understand spoken commands
Chatbots	Evolved from basic Q&A to intent-aware systems using LLMs
Machine Translation	Google Translate — real-time cross-language communication
Sentiment Analysis	Classifying text as positive, negative, or neutral (e.g., product reviews)
Text Summarization	Condensing long documents into key points
Search Engines	Query understanding and document retrieval
Named Entity Recognition	Identifying persons, organizations, locations in text
Question Answering	Systems that answer questions from a knowledge base

3. Challenges in NLP

3.1 Ambiguity in Language

Human language is context-dependent. The same phrase can carry different meanings:

"I saw the man with the telescope" — who has the telescope?

3.2 Nuances and Variations

Idioms & Colloquialisms: "It's raining cats and dogs" doesn't mean animals are falling
Sarcasm & Humor: Hard for machines to detect without context
Slang: Constantly evolving and domain-specific

3.3 Data Quality and Diversity

NLP models require large, high-quality datasets
Biased or incomplete data → skewed model behavior
Must cover diverse dialects, languages, and contexts

3.4 Named Entity Recognition (NER) Challenges

Uncommon names, emerging organizations
Context-dependent entity disambiguation

3.5 Computational Challenges

Converting text → numerical data requires sophisticated techniques (embeddings)
Processing large corpora is computationally expensive

3.6 Integration with ML Models

Deep understanding of both NLP and machine learning is required
Steep learning curve for practitioners

4. Linguistics in NLP

Linguistics provides the structural foundation that NLP systems build upon:

Linguistic Field	Role in NLP
Phonetics & Phonology	Sound patterns → Speech recognition
Morphology	Word structure → Stemming, Lemmatization
Syntax	Sentence structure → Parsing, Grammar checks
Semantics	Word/sentence meaning → Word sense disambiguation
Pragmatics	Context of language use → Dialogue systems

Example: The word "home" can mean a house, a hometown, or a sense of belonging — semantics handles this variation.

5. Text Preprocessing

Preprocessing is a critical pipeline step that cleans and normalizes raw text before analysis.

Typical NLP Preprocessing Pipeline:

Raw Text → Case Folding → Special Char Removal → Tokenization → Stop Word Removal → Stemming/Lemmatization

5.1 Case Folding

Definition: Converting all characters to lowercase to ensure uniform representation.

Why it matters: "Apple" and "apple" should be treated as the same word in most contexts.

Methods in Python:

text = "Hello World"

# Method 1: lower() - standard lowercase
print(text.lower())       # "hello world"

# Method 2: casefold() - more aggressive, handles Unicode
print(text.casefold())    # "hello world" (better for non-ASCII characters)

Caution: Case folding can lose meaning for proper nouns ("US" vs "us") and abbreviations. Be careful in tasks like machine translation.

5.2 Special Character Removal

Purpose: Remove symbols like @, #, $, !, %, URLs, and punctuation that add noise to data.

Method 1: Using re (Regular Expressions)

import re

text = "Hello! Are you coming to the #party @ 8pm? Check out www.example.com!"
clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print(clean_text)
# Output: Hello Are you coming to the party 8pm Check out wwwexamplecom

Method 2: Using SpaCy

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Here's an example: @user #hashtag https://example.com!"
doc = nlp(text)

clean_tokens = [token.text for token in doc if token.is_alpha]
clean_text = " ".join(clean_tokens)
print(clean_text)
# Output: Here s an example user hashtag

Method 3: Using NLTK

import nltk
from nltk.tokenize import RegexpTokenizer

text = "Good morning! Let's meet at 5:00 pm @ the café."
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
clean_text = " ".join(tokens)
print(clean_text)
# Output: Good morning Let s meet at 5 00 pm the café

5.3 Stop Words Removal

Stop words are common, low-information words like "the", "is", "in", "and" that are typically filtered out to reduce noise.

Why remove them?

Improves signal-to-noise ratio
Reduces feature space
Helps models focus on meaningful content

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

sample_sentence = "This is a sample sentence showing off the stop words filtration."
tokens = word_tokenize(sample_sentence)
filtered = [word for word in tokens if word.lower() not in stopwords.words('english')]

print("Original:", sample_sentence)
print("Filtered:", " ".join(filtered))
# Filtered: sample sentence showing stop words filtration .

Note: Stop word lists are language-specific. NLTK supports multiple languages.

5.4 Tokenization

Tokenization is the process of breaking text into smaller units called tokens (words, subwords, or sentences).

Types of Tokenization

Type	Description	Example
Word Tokenization	Split text into individual words	`"Good morning!"` → `["Good", "morning", "!"]`
Sentence Tokenization	Split text into sentences	`"Hello! How are you?"` → `["Hello!", "How are you?"]`
Subword Tokenization	Split into character n-grams (used in BERT, GPT)	`"playing"` → `["play", "##ing"]`

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Tokenization is a fundamental step in NLP. It breaks down paragraphs into sentences or words."

sentences = sent_tokenize(text)
print(sentences)
# ['Tokenization is a fundamental step in NLP.', 'It breaks down paragraphs into sentences or words.']

words = word_tokenize(text)
print(words)
# ['Tokenization', 'is', 'a', 'fundamental', 'step', ...]

Handling Contractions

import contractions

text = "I'm happy to see you! It's a great day."
expanded = contractions.fix(text)
print(expanded)
# Output: "I am happy to see you! It is a great day."

5.5 Stemming & Lemmatization

Both techniques reduce words to their base form, but differ in approach:

Technique	Method	Example	Output
Stemming	Chops suffix (rule-based, fast)	`"running"`, `"runs"`	`"run"` (may not be a real word)
Lemmatization	Uses vocabulary + grammar (slower, accurate)	`"better"`	`"good"`

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print(stemmer.stem("running"))       # "run"
print(lemmatizer.lemmatize("better", pos="a"))  # "good"

✅ Use lemmatization when accuracy matters. Use stemming for speed.

6. N-Grams

An n-gram is a contiguous sequence of n items (words or characters) from a given text. N-grams are used in language modeling and text prediction.

N	Name	Example
1	Unigram	`("I")`, `("love")`
2	Bigram	`("I", "love")`
3	Trigram	`("I", "love", "NLP")`

import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def generate_ngrams(text, n):
    tokens = word_tokenize(text)
    return list(ngrams(tokens, n))

sample_text = "I am going to the hospital."
bigrams = generate_ngrams(sample_text, 2)
trigrams = generate_ngrams(sample_text, 3)

print("Bigrams:", bigrams)
# [('I', 'am'), ('am', 'going'), ('going', 'to'), ('to', 'the'), ('the', 'hospital')]

print("Trigrams:", trigrams)
# [('I', 'am', 'going'), ('am', 'going', 'to'), ('going', 'to', 'the'), ('to', 'the', 'hospital')]

Applications: Language models, spell correction, machine translation, next-word prediction.

7. Vectorization

Vectorization converts text into numerical vectors so machine learning algorithms can process it. Machines cannot understand raw text — they operate on numbers.

7.1 Count Vectorization

Converts text documents into a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love coding.", "I love AI."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # ['ai', 'coding', 'love']
print(X.toarray())
# [[0, 1, 1],
#  [1, 0, 1]]

Limitation: Does not account for word importance — all words treated equally.

7.2 TF-IDF

TF-IDF (Term Frequency – Inverse Document Frequency) reflects how important a word is in a document relative to a corpus.

Formulas

Term Frequency (TF): $$TF(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total terms in document } d}$$

Inverse Document Frequency (IDF): $$IDF(t) = \log\left(\frac{\text{Total documents}}{\text{Documents containing } t}\right)$$

TF-IDF: $$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$

Example Calculation

Word "AI" appears 3 times in a 100-word document → TF = 3/100 = 0.03
10 documents total, "AI" in 3 → IDF = log(10/3) ≈ 0.523
TF-IDF = 0.03 × 0.523 ≈ 0.0157

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love machine learning.", "Machine learning is amazing.", "I love NLP."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

8. Word Embeddings

Word embeddings are dense, continuous vector representations of words that capture semantic relationships and contextual similarity.

Words with similar meanings → similar vectors in embedding space

Types:

Category	Methods
Frequency-based	Bag of Words, TF-IDF, GloVe
Prediction-based	Word2Vec (CBOW, Skip-gram), FastText

8.1 Word2Vec (CBOW & Skip-gram)

Word2Vec learns word representations by training a neural network on a "dummy task."

import gensim
from gensim.models import Word2Vec

sentences = [["I", "love", "NLP"], ["NLP", "is", "amazing"], ["deep", "learning", "is", "powerful"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Find similar words
similar = model.wv.most_similar("NLP")
print(similar)

CBOW (Continuous Bag of Words)

Task: Predict the target word given surrounding context words
Better for smaller datasets
Architecture: Input (context words) → Hidden Layer → Output (target word)

Skip-gram

Task: Predict context words given a target word (reverse of CBOW)
Better for large datasets and rare words

Training Process (Neural Network)

1. Initialize weights randomly
2. Feed input → compute output via forward pass
3. Calculate loss function (cross-entropy)
4. If loss is high → backpropagation → update weights
5. Repeat until loss converges

The weights of the hidden layer after training become the word vectors (embeddings).

How to improve CBOW?

Increase training data
Increase hidden layer size (more dimensions)

8.2 GloVe

GloVe (Global Vectors for Word Representation) uses global word-word co-occurrence statistics to build embeddings.

Creates a co-occurrence matrix from the entire corpus
Captures semantic relationships through co-occurrence frequency
Example: "ice" and "cold" co-occur often → placed close in vector space

8.3 FastText

An upgraded version of Word2Vec by Facebook AI Research.

Word2Vec	FastText
Word-level embeddings	Character/subword n-gram embeddings
Cannot handle OOV words	Handles Out-Of-Vocabulary words

Example: "capability" → cap, apa, pab, abi, bil, ili, lit, ity

FastText is especially useful for morphologically rich languages and handling typos.

9. Cosine Similarity

Cosine Similarity measures the similarity between two vectors based on the angle between them — not their magnitude.

$$\cos(\theta) = \frac{\sum A_i \cdot B_i}{\sqrt{\sum A_i^2} \cdot \sqrt{\sum B_i^2}}$$

Value	Meaning
`1`	Identical direction (most similar)
`0`	Perpendicular (unrelated)
`-1`	Opposite direction (most dissimilar)

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

A = np.array([[1, 2, 3]])
B = np.array([[4, 5, 6]])

similarity = cosine_similarity(A, B)
print(similarity)  # [[0.97463185]]

Works in any number of dimensions (2D, 5D, 100D) — this is why it's preferred in NLP over Euclidean distance.

10. Part-of-Speech Tagging & NER

Part-of-Speech (POS) Tagging

Definition: Labelling each word in a sentence with its grammatical role (noun, verb, adjective, etc.)

Uses Hidden Markov Models (HMM) or deep learning
Essential preprocessing step for many NLP tasks

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.pos_, token.dep_)

Named Entity Recognition (NER)

Definition: Identifying and classifying named entities (persons, organizations, locations, dates) in text.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was the 44th president of the United States.")

for ent in doc.ents:
    print(ent.text, ent.label_)
# Barack Obama   PERSON
# 44th           ORDINAL
# United States  GPE

NER Entity Types:

Label	Meaning
PERSON	People's names
ORG	Organizations
GPE	Geopolitical entities (countries, cities)
DATE	Dates and time periods
MONEY	Monetary values

Applications: Information extraction, Q&A systems, chatbots, word sense disambiguation.

11. NLP Libraries & Tools

Library	Purpose
NLTK	General NLP toolkit — tokenization, stemming, POS tagging
SpaCy	Industrial-strength NLP — fast, production-ready
Gensim	Topic modeling, Word2Vec, Doc2Vec
Transformers (HuggingFace)	Pre-trained models (BERT, GPT, T5)
sklearn	Classical ML models + vectorization (TF-IDF, CountVectorizer)
TensorFlow / Keras	Deep learning for NLP
PyTorch	Deep learning research and production

12. Deep Learning for NLP

12.1 Activation Functions

Activation functions introduce non-linearity and determine whether a neuron "fires."

Function	Range	Formula	Use Case
Step	{0, 1}	`1 if x > 0 else 0`	Binary classification
Linear	(-∞, ∞)	`y = mx + c`	Rarely used alone (no non-linearity)
Sigmoid	(0, 1)	`y = 1 / (1 + e^(-x))`	Output layer for binary classification
Tanh	(-1, 1)	`y = 2/(1 + e^(-2x)) - 1`	Hidden layers
ReLU	[0, ∞)	`A(x) = max(0, x)`	Most common in hidden layers
Leaky ReLU	(-∞, ∞)	`max(0.01x, x)`	Fixes "Dying ReLU" problem

Dying ReLU Problem: Neurons get stuck at 0 for all inputs → fixed by Leaky ReLU (allows small negative gradient).

12.2 ANN (Artificial Neural Network)

Architecture: Input Layer → Hidden Layer(s) → Output Layer

Implementation Steps:
1. Import libraries (TensorFlow/Keras or PyTorch)
2. Load and preprocess dataset
3. Initialize the ANN
4. Add Layers:
   - Input Layer: match input feature dimensions
   - Hidden Layers: choose neurons + activation functions (usually ReLU)
   - Output Layer: neurons = number of classes; softmax or sigmoid
5. Compile: optimizer + loss function + metrics
6. Train: fit on training data (batch_size, epochs)
7. Evaluate on test data

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(input_dim,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')   # binary classification
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, epochs=10)

Hyperparameter Optimization:

GridSearchCV — exhaustive search over parameter grid
RandomizedSearchCV — random sampling (faster)
Manual tuning based on domain knowledge

12.3 Backpropagation & Forward Pass

Phase	Description
Forward Pass	Input flows through network → prediction computed
Loss Calculation	Difference between predicted and actual output (cost function)
Backpropagation	Gradients computed layer by layer using chain rule
Weight Update	Weights adjusted using optimizer (SGD, Adam)

Epoch: One complete pass through all training data
Iteration: One forward + backward pass on a batch
Stochastic Gradient Descent (SGD): Updates weights using one sample (or mini-batch) at a time

13. RNN (Recurrent Neural Network)

RNNs process sequential data by maintaining a hidden state that carries information from previous time steps.

x(t) ──► [RNN Cell] ──► y(t)
              ▲
              │ h(t) (hidden state fed back)

Issues with RNNs:

Vanishing Gradient Problem: Gradients shrink as they backpropagate through many time steps → early context forgotten
Exploding Gradient Problem: Gradients grow exponentially → unstable training
Short-term memory: Cannot handle long-range dependencies

Example: "Today I need..." (fine) vs "Last year I had..." (RNN struggles to connect)

14. LSTM (Long Short-Term Memory)

LSTM is a special type of RNN designed to solve the vanishing gradient problem by maintaining two memory states.

State	Type	Purpose
`c(t)` — Cell State	Long-term memory	Carries information across long sequences
`h(t)` — Hidden State	Short-term memory	Used for immediate output computation

Architecture

[Forget Gate] → [Input Gate] → [Output Gate]
      ↓               ↓              ↓
 Remove old info  Add new info  Produce output

Forget Gate

Decides what to remove from long-term memory.

$$f(t) = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ $$\text{Updated: } c_{t-1} \times f(t)$$

Input Gate

Decides what new information to add to cell state.

$$i(t) = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}(t) = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$$ $$c(t) = f(t) \cdot c(t-1) + i(t) \cdot \tilde{c}(t)$$

Output Gate

Computes the hidden state (output).

$$o(t) = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ $$h(t) = o(t) \times \tanh(c(t))$$

Activation Functions in LSTM

Sigmoid σ: Output in [0, 1] — used as a "gate" (0 = block, 1 = pass)
Tanh: Output in [-1, 1] — used to squash values

LSTM Network Architecture for NLP

Input → Embedding Layer → LSTM Layer → Dense Layer → Output

15. GRU (Gated Recurrent Unit)

GRU is a simplified version of LSTM with fewer parameters and comparable performance.

Feature	LSTM	GRU
Memory states	2 (cell + hidden)	1 (hidden only)
Gates	3 (forget, input, output)	2 (reset, update)
Parameters	More	Fewer
Training time	Slower	Faster
Performance	Slightly better on large data	Comparable

GRU Gates

Reset Gate r(t) — controls short-term memory (how much past to forget): $$r(t) = \sigma(W_r \cdot [h_{t-1}, x_t])$$

Update Gate z(t) — balances old and new information: $$z(t) = \sigma(W_z \cdot [h_{t-1}, x_t])$$

Candidate Hidden State: $$\tilde{h}(t) = \tanh(W \cdot [r(t) \cdot h_{t-1}, x_t])$$

Final Hidden State: $$h(t) = (1 - z(t)) \cdot h_{t-1} + z(t) \cdot \tilde{h}(t)$$

Steps Summary

Calculate Reset Gate r(t)
Calculate Candidate Hidden State h̃(t)
Calculate Update Gate z(t)
Calculate Final Hidden State h(t)

16. Bidirectional RNN

A Bidirectional RNN processes sequences in both forward and backward directions, capturing context from both past and future tokens.

Forward:   x₁ → x₂ → x₃ → x₄
Backward:  x₄ → x₃ → x₂ → x₁
                ↓
         Combined Output

Applications:

Named Entity Recognition (NER)
Part-of-Speech Tagging
Machine Translation
Sentiment Analysis

Drawbacks:

Higher computational cost (more parameters)
Requires full sequence at inference → not suitable for real-time generation
Needs more data to generalize well

17. CNN for NLP

Convolutional Neural Networks (CNNs) are primarily used for image processing but are also applied in NLP for text classification and feature extraction.

CNN Pipeline for NLP

Input Text → Embedding Layer → Convolution → ReLU Activation → Pooling → Flattening → Fully Connected Layer → Output

Layer	Role
Embedding	Convert words to dense vectors
Convolution	Extract local n-gram features
ReLU Activation	Introduce non-linearity
Pooling	Downsample — keep most important features
Flattening	Convert feature maps to 1D vector
Fully Connected	Classification / Regression

Use Cases: Sentence classification, spam detection, sentiment analysis.

Quick Reference: Naive Bayes Text Classifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Summary Cheat Sheet

NLP Pipeline:
Raw Text
   ↓
Preprocessing (case fold, remove special chars, stop words)
   ↓
Tokenization
   ↓
Stemming / Lemmatization
   ↓
Vectorization (Count / TF-IDF / Word2Vec / GloVe)
   ↓
Model (Naive Bayes / ANN / RNN / LSTM / GRU / CNN)
   ↓
Output (Classification / Generation / Translation)

Concept	Key Idea
Tokenization	Split text into tokens
Stop Words	Remove low-info words
TF-IDF	Weight words by importance
Word2Vec	Predict word from context (CBOW) or context from word (Skip-gram)
GloVe	Embeddings from co-occurrence matrix
FastText	Subword-level embeddings
LSTM	Long + short term memory via gates
GRU	Simplified LSTM, faster training
Cosine Similarity	Angle-based vector similarity
NER	Identify entities in text
POS Tagging	Label words by grammatical role