NLP and LLMs

What you'll build

You will tokenise text, compute TF-IDF features, call an LLM API to classify customer reviews, and build a simple retrieval-augmented prompt, all using less than 100 lines of code per section.

Concepts

Tokenisation

Before any model can process text it needs to be converted into numbers. Tokenisation is the process of splitting text into units (tokens) and mapping each to an integer.

# Simple word tokeniser
import re
from collections import Counter

text = "The cat sat on the mat. The cat ate the rat."

# Basic: split on spaces and punctuation
tokens = re.findall(r"[a-z]+", text.lower())
print(tokens)

vocab  = {word: i for i, word in enumerate(sorted(set(tokens)))}
ids    = [vocab[t] for t in tokens]
print(ids)

# Byte-Pair Encoding (BPE) via tiktoken (OpenAI's tokeniser)
# pip install tiktoken
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")   # GPT-4 tokeniser

text2 = "Machine learning is transforming how we build software."
token_ids = enc.encode(text2)
decoded   = enc.decode(token_ids)
print(f"Tokens: {token_ids}")
print(f"Token count: {len(token_ids)}")
print(f"Decoded: {decoded}")

BPE splits rare words into subwords. "transforming" might become ["transform", "ing"]. This keeps vocabulary size manageable while handling rare words gracefully.

Word embeddings

An embedding maps each token ID to a dense vector of floats. Similar words end up close together in embedding space.

import numpy as np

# Toy embeddings: 5 words, 3-dimensional
vocab_size = 5
embed_dim  = 3
embedding_matrix = np.random.randn(vocab_size, embed_dim)

# Look up embedding for token ID 2
token_id = 2
vec = embedding_matrix[token_id]
print(f"Embedding for token {token_id}: {vec}")

# In PyTorch
import torch
import torch.nn as nn

emb   = nn.Embedding(num_embeddings=10000, embedding_dim=128)
ids   = torch.tensor([3, 7, 1, 2])
vecs  = emb(ids)               # shape (4, 128)
print(vecs.shape)

Pre-trained embeddings (Word2Vec, GloVe, fastText) encode semantic relationships. In Word2Vec, the vector for "king" minus "man" plus "woman" is close to "queen". LLMs learn contextual embeddings, the same word has different vectors depending on context.

The transformer architecture (diagram in text)

The transformer is the architecture behind GPT, BERT, and most modern LLMs. The key components are:

Input tokens
    |
Token Embeddings + Positional Encodings
    |
[Self-Attention]                      <- "which other tokens matter for this one?"
    |                                    Q, K, V matrices: Attention = softmax(QK^T / sqrt(d)) V
[Add & LayerNorm]
    |
[Feed-Forward Network (MLP)]
    |
[Add & LayerNorm]
    |
... (repeated N times, e.g., 12 or 96) ...
    |
Output

Self-attention allows every token to "look at" every other token in the sequence. The key intuition: for the word "bank" in "river bank" vs "bank account", attention on the surrounding words determines which meaning is used.

The Q K^T / sqrt(d) scaling prevents the dot products from getting too large before the softmax.

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    d_k   = Q.shape[-1]
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return weights @ V

# Tiny example: 3 tokens, 4-dimensional keys/queries/values
Q = torch.randn(1, 3, 4)
K = torch.randn(1, 3, 4)
V = torch.randn(1, 3, 4)

out = scaled_dot_product_attention(Q, K, V)
print(out.shape)  # (1, 3, 4)

Calling LLM APIs

You rarely train transformers from scratch. You call an existing LLM via an API and provide context in the prompt.

# pip install openai
from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY_HERE")

reviews = [
    "This product is absolutely amazing! Best purchase I have made.",
    "Terrible quality. Broke after two days. Avoid.",
    "Average product. Nothing special.",
]

def classify_sentiment(review):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system",
             "content": "You are a sentiment classifier. Reply with only one word: Positive, Negative, or Neutral."},
            {"role": "user",
             "content": review},
        ],
        max_tokens=5,
        temperature=0,
    )
    return response.choices[0].message.content.strip()

for r in reviews:
    label = classify_sentiment(r)
    print(f"[{label}] {r[:50]}...")

temperature=0 makes the model deterministic (always picks the most likely token). Use higher temperatures for creative tasks.

When not to fine-tune

Fine-tuning means taking a pre-trained model and further training it on your specific dataset. It is powerful but expensive. Do not fine-tune when:

Prompting is enough. Good system prompts and few-shot examples can handle most tasks.
Your dataset is small. Fine-tuning on fewer than a few thousand examples often hurts, the model forgets general knowledge (catastrophic forgetting).
The task changes frequently. A fine-tuned model is static. A prompted model adapts the moment you change the prompt.
You cannot afford the compute. Fine-tuning GPT-4-class models costs thousands of dollars.

Fine-tune when:

You need a specific output format the model consistently gets wrong even with prompting., You have tens of thousands of labelled examples and latency cost matters enough to use a smaller fine-tuned model., You need to teach the model domain-specific knowledge that was not in its training data (e.g., a proprietary programming language).

Hands-on

Build a simple text classification pipeline: TF-IDF features for a scikit-learn model, then compare it against an LLM-based zero-shot classifier.

import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# --- Synthetic reviews dataset ---
positive = [
    "Love this product, works perfectly!",
    "Excellent quality, very satisfied.",
    "Great value for money, would recommend.",
    "Fast delivery and product is exactly as described.",
    "Five stars, will buy again.",
]
negative = [
    "Worst purchase ever, do not buy.",
    "Product stopped working after one week.",
    "Complete waste of money.",
    "Very disappointed, not as advertised.",
    "Broke on first use, terrible quality.",
]
neutral = [
    "Average product, nothing special.",
    "It works, I suppose.",
    "Decent enough for the price.",
    "Expected more but not bad.",
    "Just okay, not great not terrible.",
]

texts  = positive + negative + neutral
labels = [2] * 5 + [0] * 5 + [1] * 5   # 0=neg, 1=neutral, 2=pos

df = pd.DataFrame({"text": texts, "label": labels})

# --- TF-IDF + Logistic Regression ---
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.33, random_state=42, stratify=df["label"]
)

tfidf = TfidfVectorizer(
    ngram_range=(1, 2),   # unigrams and bigrams
    max_features=500,
    stop_words="english",
)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf  = tfidf.transform(X_test)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)

label_names = ["Negative", "Neutral", "Positive"]
print("TF-IDF + LR Report:")
print(classification_report(y_test, y_pred, target_names=label_names, zero_division=0))

# --- Top TF-IDF features per class ---
feature_names = tfidf.get_feature_names_out()
for i, name in enumerate(label_names):
    top_idx = clf.coef_[i].argsort()[-5:][::-1]
    top_features = [feature_names[j] for j in top_idx]
    print(f"Top features for {name}: {top_features}")

# --- Retrieval-Augmented Prompt (no API key needed to run the structure) ---
# This shows how RAG works in principle.
print("\n--- RAG prompt structure ---")
context_docs = [
    "Our return policy allows returns within 30 days for a full refund.",
    "Shipping usually takes 3-5 business days.",
    "For product defects, contact support@shop.com with your order number.",
]

question = "What do I do if my product is defective?"

# In a real RAG system: embed docs, embed question, retrieve top-k by cosine similarity
# Here we mock the retrieval step
retrieved = context_docs[2]   # pretend we retrieved the relevant doc

prompt = f"""Use the following context to answer the question.
Context: {retrieved}
Question: {question}
Answer:"""

print(prompt)
# You would then send this `prompt` to an LLM API

TF-IDF + logistic regression is still a reasonable baseline for text classification. For short texts with limited training data, it often beats fine-tuned models. The LLM approach shines when you have no labelled data at all.

Common pitfalls

Using too high a temperature for factual tasks. Temperature > 0 introduces randomness. For tasks like extraction or classification, use temperature=0.

Ignoring token limits. LLMs have a context window (e.g., 128k tokens for GPT-4o). If your prompt + document + expected output exceeds the limit, the API will truncate or error. Always estimate token usage before production.

Assuming the model knows recent events. LLMs have a training cutoff date. They do not know about events after that date unless you provide the information in the prompt (RAG).

Fine-tuning on dirty labels. If your labelled dataset has 15% label errors, your fine-tuned model will learn those errors. Clean your training data before fine-tuning.

Not tracking API costs. Each token costs money. A prompt that works fine for 10 requests can bankrupt a project at 10 million requests. Cache responses, truncate context, and monitor spend.

What to try next

Use the Hugging Face transformers library to run a small BERT model locally for sentiment analysis., Try semantic search: embed sentences with sentence-transformers and use cosine similarity to find related sentences., Build a simple RAG system: embed a PDF's paragraphs, store them in a vector database (ChromaDB or FAISS), and retrieve relevant chunks before calling the LLM., Move on to Lesson 10 (Deploy an ML App) where you will serve a model as a web API.