what is transformerr?¶
A Transformer is a type of deep learning model that is designed to process sequences of data (like text, time series, or even images) efficiently. It’s the backbone of many state-of-the-art models in NLP, vision, and more, such as GPT, BERT, and Vision Transformers (ViT).
Why Transformers Are Important¶
Handles Sequences Efficiently
Unlike RNNs/LSTMs, Transformers process all sequence elements at once (parallelization).
Captures Long-Range Dependencies
Can “see” the entire input sequence, not just the recent elements.
State-of-the-Art in NLP & Beyond
Text: GPT, BERT, T5
Images: Vision Transformers
Speech, Audio, and even Multimodal Data
1️⃣ Frameworks to Build Transformers
These are the main libraries for creating Transformer models:
Tool / Library Purpose¶
PyTorch: Flexible deep learning framework, widely used to implement Transformers.
TensorFlow / Keras: High-level deep learning framework; easy to build Transformer models.
Hugging Face Transformer: Pre-built state-of-the-art Transformer models (GPT, BERT, T5, etc.) for NLP.
JAX / Flax: For research-level efficient Transformer implementations.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt
# -----------------------
# 1. Prepare the dataset
# -----------------------
data = {
"Math": [20,35,70,40,50,67,88,46,67,46],
"Sci": [54,60,54,34,36,67,89,90,57,67],
"Eng": [67,76,55,45,34,25,78,47,67,76],
"Dzo": [93,59,76,77,59,47,29,39,71,62],
"Total": [234,230,255,196,179,206,284,222,262,251]
}
df = pd.DataFrame(data)
X = df[["Math", "Sci", "Eng", "Dzo"]].values
y = df["Total"].values
# Scale features
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
# -----------------------
# 2. Train-Test Split
# -----------------------
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# -----------------------
# 3. Fit Neural Network (as a proxy for Transformer)
# -----------------------
mlp = MLPRegressor(hidden_layer_sizes=(16,16), max_iter=2000, random_state=42)
mlp.fit(X_train, y_train)
# -----------------------
# 4. Predictions
# -----------------------
y_pred = mlp.predict(X_test)
# Display actual vs predicted
results = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
print(results)
# -----------------------
# 5. Plot
# -----------------------
plt.figure(figsize=(7,5))
plt.plot(range(len(y_test)), y_test, marker='o', label="Actual")
plt.plot(range(len(y_pred)), y_pred, marker='x', label="Predicted")
plt.xlabel("Test Sample Index")
plt.ylabel("Total Marks")
plt.title("Actual vs Predicted Total Marks (NN Proxy for Transformer)")
plt.legend()
plt.grid(True)
plt.show()
Actual Predicted 0 262 279.432371 1 230 228.867588
/opt/conda/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet. warnings.warn(
🔹 2️⃣ Visualization Tools
Used to show how Transformers work, including attention patterns and activations:
Tool Purpose¶
Matplotlib / Seaborn: Plot attention weights, loss curves, accuracy trends.
Plotly / Bokeh: Interactive plots for attention matrices or sequence relationships.
BertViz: :Specifically designed to visualize attention in BERT and similar models.
Transformers Interpret / Captum Model interpretability: for Transformers (shows which tokens are important).
TensorBoard : Visualizes training metrics, attention maps, and embeddings.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt
# -----------------------
# 1. Prepare the dataset
# -----------------------
data = {
"Math": [20,35,70,40,50,67,88,46,67,46],
"Sci": [54,60,54,34,36,67,89,90,57,67],
"Eng": [67,76,55,45,34,25,78,47,67,76],
"Dzo": [93,59,76,77,59,47,29,39,71,62],
"Total": [234,230,255,196,179,206,284,222,262,251]
}
df = pd.DataFrame(data)
X = df[["Math", "Sci", "Eng", "Dzo"]]
y = df["Total"]
# Scale features
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
# -----------------------
# 2. Train-Test Split
# -----------------------
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# -----------------------
# 3. Fit Neural Network (proxy for Transformer)
# -----------------------
mlp = MLPRegressor(hidden_layer_sizes=(16,16), max_iter=2000, random_state=42)
mlp.fit(X_train, y_train)
# -----------------------
# 4. Predictions
# -----------------------
y_pred = mlp.predict(X_test)
# Display actual vs predicted
results = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
print(results)
# -----------------------
# 5. Plot
# -----------------------
plt.figure(figsize=(7,5))
plt.plot(range(len(y_test)), y_test, marker='o', label="Actual")
plt.plot(range(len(y_pred)), y_pred, marker='x', label="Predicted")
plt.xlabel("Test Sample Index")
plt.ylabel("Total Marks")
plt.title("Actual vs Predicted Total Marks (NN Proxy for Transformer)")
plt.legend()
plt.grid(True)
plt.show()
Actual Predicted 8 262 279.432371 1 230 228.867588
/opt/conda/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet. warnings.warn(
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt
# -----------------------
# Dataset
# -----------------------
data = {
"Math": [20,35,70,40,50,67,88,46,67,46],
"Sci": [54,60,54,34,36,67,89,90,57,67],
"Eng": [67,76,55,45,34,25,78,47,67,76],
"Dzo": [93,59,76,77,59,47,29,39,71,62],
"Total": [234,230,255,196,179,206,284,222,262,251]
}
df = pd.DataFrame(data)
X = df[["Math","Sci","Eng","Dzo"]]
y = df["Total"]
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# -----------------------
# MLPRegressor as proxy for Transformer
# -----------------------
mlp = MLPRegressor(hidden_layer_sizes=(16,16), max_iter=2000, random_state=42)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
# -----------------------
# Circular plot for feature importance (simulated attention)
# -----------------------
# Use absolute value of first layer weights as "attention"
attention = np.abs(mlp.coefs_[0]).mean(axis=1)
subjects = ["Math","Sci","Eng","Dzo"]
theta = np.linspace(0, 2*np.pi, len(subjects), endpoint=False)
radii = attention
width = 2*np.pi / len(subjects)
plt.figure(figsize=(6,6))
ax = plt.subplot(111, polar=True)
bars = ax.bar(theta, radii, width=width, bottom=0.0, alpha=0.6)
ax.set_xticks(theta)
ax.set_xticklabels(subjects)
ax.set_title("Simulated Attention Across Subjects (MLP Proxy)")
plt.show()
/opt/conda/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet. warnings.warn(