Natural Language Processing (NLP) is a field at the intersection of linguistics, computer science, and artificial intelligence. It focuses on enabling machines to understand, interpret, and generate human language through specific algorithms and models.
NLP bridges the gap between human communication and computer understanding.
Key Goals:
| Application | Description |
|---|---|
| Voice Assistants | Siri, Alexa, Google Assistant — understand spoken commands |
| Chatbots | Evolved from basic Q&A to intent-aware systems using LLMs |
| Machine Translation | Google Translate — real-time cross-language communication |
| Sentiment Analysis | Classifying text as positive, negative, or neutral (e.g., product reviews) |
| Text Summarization | Condensing long documents into key points |
| Search Engines | Query understanding and document retrieval |
| Named Entity Recognition | Identifying persons, organizations, locations in text |
| Question Answering | Systems that answer questions from a knowledge base |
Human language is context-dependent. The same phrase can carry different meanings:
Linguistics provides the structural foundation that NLP systems build upon:
| Linguistic Field | Role in NLP |
|---|---|
| Phonetics & Phonology | Sound patterns → Speech recognition |
| Morphology | Word structure → Stemming, Lemmatization |
| Syntax | Sentence structure → Parsing, Grammar checks |
| Semantics | Word/sentence meaning → Word sense disambiguation |
| Pragmatics | Context of language use → Dialogue systems |
Example: The word "home" can mean a house, a hometown, or a sense of belonging — semantics handles this variation.
Preprocessing is a critical pipeline step that cleans and normalizes raw text before analysis.
Typical NLP Preprocessing Pipeline:
Raw Text → Case Folding → Special Char Removal → Tokenization → Stop Word Removal → Stemming/Lemmatization
Definition: Converting all characters to lowercase to ensure uniform representation.
Why it matters: "Apple" and "apple" should be treated as the same word in most contexts.
Methods in Python:
text = "Hello World"
# Method 1: lower() - standard lowercase
print(text.lower()) # "hello world"
# Method 2: casefold() - more aggressive, handles Unicode
print(text.casefold()) # "hello world" (better for non-ASCII characters)
Caution: Case folding can lose meaning for proper nouns (
"US"vs"us") and abbreviations. Be careful in tasks like machine translation.
Purpose: Remove symbols like @, #, $, !, %, URLs, and punctuation that add noise to data.
Method 1: Using re (Regular Expressions)
import re
text = "Hello! Are you coming to the #party @ 8pm? Check out www.example.com!"
clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print(clean_text)
# Output: Hello Are you coming to the party 8pm Check out wwwexamplecom
Method 2: Using SpaCy
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Here's an example: @user #hashtag https://example.com!"
doc = nlp(text)
clean_tokens = [token.text for token in doc if token.is_alpha]
clean_text = " ".join(clean_tokens)
print(clean_text)
# Output: Here s an example user hashtag
Method 3: Using NLTK
import nltk
from nltk.tokenize import RegexpTokenizer
text = "Good morning! Let's meet at 5:00 pm @ the café."
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
clean_text = " ".join(tokens)
print(clean_text)
# Output: Good morning Let s meet at 5 00 pm the café
Stop words are common, low-information words like "the", "is", "in", "and" that are typically filtered out to reduce noise.
Why remove them?
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
sample_sentence = "This is a sample sentence showing off the stop words filtration."
tokens = word_tokenize(sample_sentence)
filtered = [word for word in tokens if word.lower() not in stopwords.words('english')]
print("Original:", sample_sentence)
print("Filtered:", " ".join(filtered))
# Filtered: sample sentence showing stop words filtration .
Note: Stop word lists are language-specific. NLTK supports multiple languages.
Tokenization is the process of breaking text into smaller units called tokens (words, subwords, or sentences).
| Type | Description | Example |
|---|---|---|
| Word Tokenization | Split text into individual words | "Good morning!" → ["Good", "morning", "!"] |
| Sentence Tokenization | Split text into sentences | "Hello! How are you?" → ["Hello!", "How are you?"] |
| Subword Tokenization | Split into character n-grams (used in BERT, GPT) | "playing" → ["play", "##ing"] |
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Tokenization is a fundamental step in NLP. It breaks down paragraphs into sentences or words."
sentences = sent_tokenize(text)
print(sentences)
# ['Tokenization is a fundamental step in NLP.', 'It breaks down paragraphs into sentences or words.']
words = word_tokenize(text)
print(words)
# ['Tokenization', 'is', 'a', 'fundamental', 'step', ...]
import contractions
text = "I'm happy to see you! It's a great day."
expanded = contractions.fix(text)
print(expanded)
# Output: "I am happy to see you! It is a great day."
Both techniques reduce words to their base form, but differ in approach:
| Technique | Method | Example | Output |
|---|---|---|---|
| Stemming | Chops suffix (rule-based, fast) | "running", "runs" | "run" (may not be a real word) |
| Lemmatization | Uses vocabulary + grammar (slower, accurate) | "better" | "good" |
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(stemmer.stem("running")) # "run"
print(lemmatizer.lemmatize("better", pos="a")) # "good"
✅ Use lemmatization when accuracy matters. Use stemming for speed.
An n-gram is a contiguous sequence of n items (words or characters) from a given text. N-grams are used in language modeling and text prediction.
| N | Name | Example |
|---|---|---|
| 1 | Unigram | ("I"), ("love") |
| 2 | Bigram | ("I", "love") |
| 3 | Trigram | ("I", "love", "NLP") |
import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize
nltk.download('punkt')
def generate_ngrams(text, n):
tokens = word_tokenize(text)
return list(ngrams(tokens, n))
sample_text = "I am going to the hospital."
bigrams = generate_ngrams(sample_text, 2)
trigrams = generate_ngrams(sample_text, 3)
print("Bigrams:", bigrams)
# [('I', 'am'), ('am', 'going'), ('going', 'to'), ('to', 'the'), ('the', 'hospital')]
print("Trigrams:", trigrams)
# [('I', 'am', 'going'), ('am', 'going', 'to'), ('going', 'to', 'the'), ('to', 'the', 'hospital')]
Applications: Language models, spell correction, machine translation, next-word prediction.
Vectorization converts text into numerical vectors so machine learning algorithms can process it. Machines cannot understand raw text — they operate on numbers.
Converts text documents into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["I love coding.", "I love AI."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['ai', 'coding', 'love']
print(X.toarray())
# [[0, 1, 1],
# [1, 0, 1]]
Limitation: Does not account for word importance — all words treated equally.
TF-IDF (Term Frequency – Inverse Document Frequency) reflects how important a word is in a document relative to a corpus.
Term Frequency (TF): $$TF(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total terms in document } d}$$
Inverse Document Frequency (IDF): $$IDF(t) = \log\left(\frac{\text{Total documents}}{\text{Documents containing } t}\right)$$
TF-IDF: $$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$
"AI" appears 3 times in a 100-word document → TF = 3/100 = 0.03"AI" in 3 → IDF = log(10/3) ≈ 0.523from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I love machine learning.", "Machine learning is amazing.", "I love NLP."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Word embeddings are dense, continuous vector representations of words that capture semantic relationships and contextual similarity.
Words with similar meanings → similar vectors in embedding space
Types:
| Category | Methods |
|---|---|
| Frequency-based | Bag of Words, TF-IDF, GloVe |
| Prediction-based | Word2Vec (CBOW, Skip-gram), FastText |
Word2Vec learns word representations by training a neural network on a "dummy task."
import gensim
from gensim.models import Word2Vec
sentences = [["I", "love", "NLP"], ["NLP", "is", "amazing"], ["deep", "learning", "is", "powerful"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Find similar words
similar = model.wv.most_similar("NLP")
print(similar)
1. Initialize weights randomly
2. Feed input → compute output via forward pass
3. Calculate loss function (cross-entropy)
4. If loss is high → backpropagation → update weights
5. Repeat until loss converges
The weights of the hidden layer after training become the word vectors (embeddings).
How to improve CBOW?
GloVe (Global Vectors for Word Representation) uses global word-word co-occurrence statistics to build embeddings.
"ice" and "cold" co-occur often → placed close in vector spaceAn upgraded version of Word2Vec by Facebook AI Research.
| Word2Vec | FastText |
|---|---|
| Word-level embeddings | Character/subword n-gram embeddings |
| Cannot handle OOV words | Handles Out-Of-Vocabulary words |
Example: "capability" → cap, apa, pab, abi, bil, ili, lit, ity
FastText is especially useful for morphologically rich languages and handling typos.
Cosine Similarity measures the similarity between two vectors based on the angle between them — not their magnitude.
$$\cos(\theta) = \frac{\sum A_i \cdot B_i}{\sqrt{\sum A_i^2} \cdot \sqrt{\sum B_i^2}}$$
| Value | Meaning |
|---|---|
1 | Identical direction (most similar) |
0 | Perpendicular (unrelated) |
-1 | Opposite direction (most dissimilar) |
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
A = np.array([[1, 2, 3]])
B = np.array([[4, 5, 6]])
similarity = cosine_similarity(A, B)
print(similarity) # [[0.97463185]]
Works in any number of dimensions (2D, 5D, 100D) — this is why it's preferred in NLP over Euclidean distance.
Definition: Labelling each word in a sentence with its grammatical role (noun, verb, adjective, etc.)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)
Definition: Identifying and classifying named entities (persons, organizations, locations, dates) in text.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was the 44th president of the United States.")
for ent in doc.ents:
print(ent.text, ent.label_)
# Barack Obama PERSON
# 44th ORDINAL
# United States GPE
NER Entity Types:
| Label | Meaning |
|---|---|
| PERSON | People's names |
| ORG | Organizations |
| GPE | Geopolitical entities (countries, cities) |
| DATE | Dates and time periods |
| MONEY | Monetary values |
Applications: Information extraction, Q&A systems, chatbots, word sense disambiguation.
| Library | Purpose |
|---|---|
| NLTK | General NLP toolkit — tokenization, stemming, POS tagging |
| SpaCy | Industrial-strength NLP — fast, production-ready |
| Gensim | Topic modeling, Word2Vec, Doc2Vec |
| Transformers (HuggingFace) | Pre-trained models (BERT, GPT, T5) |
| sklearn | Classical ML models + vectorization (TF-IDF, CountVectorizer) |
| TensorFlow / Keras | Deep learning for NLP |
| PyTorch | Deep learning research and production |
Activation functions introduce non-linearity and determine whether a neuron "fires."
| Function | Range | Formula | Use Case |
|---|---|---|---|
| Step | {0, 1} | 1 if x > 0 else 0 | Binary classification |
| Linear | (-∞, ∞) | y = mx + c | Rarely used alone (no non-linearity) |
| Sigmoid | (0, 1) | y = 1 / (1 + e^(-x)) | Output layer for binary classification |
| Tanh | (-1, 1) | y = 2/(1 + e^(-2x)) - 1 | Hidden layers |
| ReLU | [0, ∞) | A(x) = max(0, x) | Most common in hidden layers |
| Leaky ReLU | (-∞, ∞) | max(0.01x, x) | Fixes "Dying ReLU" problem |
Dying ReLU Problem: Neurons get stuck at 0 for all inputs → fixed by Leaky ReLU (allows small negative gradient).
Architecture: Input Layer → Hidden Layer(s) → Output Layer
Implementation Steps:
1. Import libraries (TensorFlow/Keras or PyTorch)
2. Load and preprocess dataset
3. Initialize the ANN
4. Add Layers:
- Input Layer: match input feature dimensions
- Hidden Layers: choose neurons + activation functions (usually ReLU)
- Output Layer: neurons = number of classes; softmax or sigmoid
5. Compile: optimizer + loss function + metrics
6. Train: fit on training data (batch_size, epochs)
7. Evaluate on test data
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(64, activation='relu', input_shape=(input_dim,)),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid') # binary classification
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=32, epochs=10)
Hyperparameter Optimization:
GridSearchCV — exhaustive search over parameter gridRandomizedSearchCV — random sampling (faster)| Phase | Description |
|---|---|
| Forward Pass | Input flows through network → prediction computed |
| Loss Calculation | Difference between predicted and actual output (cost function) |
| Backpropagation | Gradients computed layer by layer using chain rule |
| Weight Update | Weights adjusted using optimizer (SGD, Adam) |
RNNs process sequential data by maintaining a hidden state that carries information from previous time steps.
x(t) ──► [RNN Cell] ──► y(t)
▲
│ h(t) (hidden state fed back)
Issues with RNNs:
Example: "Today I need..." (fine) vs "Last year I had..." (RNN struggles to connect)
LSTM is a special type of RNN designed to solve the vanishing gradient problem by maintaining two memory states.
| State | Type | Purpose |
|---|---|---|
c(t) — Cell State | Long-term memory | Carries information across long sequences |
h(t) — Hidden State | Short-term memory | Used for immediate output computation |
[Forget Gate] → [Input Gate] → [Output Gate]
↓ ↓ ↓
Remove old info Add new info Produce output
Decides what to remove from long-term memory.
$$f(t) = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ $$\text{Updated: } c_{t-1} \times f(t)$$
Decides what new information to add to cell state.
$$i(t) = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}(t) = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$$ $$c(t) = f(t) \cdot c(t-1) + i(t) \cdot \tilde{c}(t)$$
Computes the hidden state (output).
$$o(t) = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ $$h(t) = o(t) \times \tanh(c(t))$$
σ: Output in [0, 1] — used as a "gate" (0 = block, 1 = pass)Input → Embedding Layer → LSTM Layer → Dense Layer → Output
GRU is a simplified version of LSTM with fewer parameters and comparable performance.
| Feature | LSTM | GRU |
|---|---|---|
| Memory states | 2 (cell + hidden) | 1 (hidden only) |
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| Parameters | More | Fewer |
| Training time | Slower | Faster |
| Performance | Slightly better on large data | Comparable |
Reset Gate r(t) — controls short-term memory (how much past to forget):
$$r(t) = \sigma(W_r \cdot [h_{t-1}, x_t])$$
Update Gate z(t) — balances old and new information:
$$z(t) = \sigma(W_z \cdot [h_{t-1}, x_t])$$
Candidate Hidden State: $$\tilde{h}(t) = \tanh(W \cdot [r(t) \cdot h_{t-1}, x_t])$$
Final Hidden State: $$h(t) = (1 - z(t)) \cdot h_{t-1} + z(t) \cdot \tilde{h}(t)$$
r(t)h̃(t)z(t)h(t)A Bidirectional RNN processes sequences in both forward and backward directions, capturing context from both past and future tokens.
Forward: x₁ → x₂ → x₃ → x₄
Backward: x₄ → x₃ → x₂ → x₁
↓
Combined Output
Applications:
Drawbacks:
Convolutional Neural Networks (CNNs) are primarily used for image processing but are also applied in NLP for text classification and feature extraction.
Input Text → Embedding Layer → Convolution → ReLU Activation → Pooling → Flattening → Fully Connected Layer → Output
| Layer | Role |
|---|---|
| Embedding | Convert words to dense vectors |
| Convolution | Extract local n-gram features |
| ReLU Activation | Introduce non-linearity |
| Pooling | Downsample — keep most important features |
| Flattening | Convert feature maps to 1D vector |
| Fully Connected | Classification / Regression |
Use Cases: Sentence classification, spam detection, sentiment analysis.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
NLP Pipeline:
Raw Text
↓
Preprocessing (case fold, remove special chars, stop words)
↓
Tokenization
↓
Stemming / Lemmatization
↓
Vectorization (Count / TF-IDF / Word2Vec / GloVe)
↓
Model (Naive Bayes / ANN / RNN / LSTM / GRU / CNN)
↓
Output (Classification / Generation / Translation)
| Concept | Key Idea |
|---|---|
| Tokenization | Split text into tokens |
| Stop Words | Remove low-info words |
| TF-IDF | Weight words by importance |
| Word2Vec | Predict word from context (CBOW) or context from word (Skip-gram) |
| GloVe | Embeddings from co-occurrence matrix |
| FastText | Subword-level embeddings |
| LSTM | Long + short term memory via gates |
| GRU | Simplified LSTM, faster training |
| Cosine Similarity | Angle-based vector similarity |
| NER | Identify entities in text |
| POS Tagging | Label words by grammatical role |