[Dorji Tshezom] - Fab Futures - Data Science
Home About

what is transformerr?¶

A Transformer is a type of deep learning model that is designed to process sequences of data (like text, time series, or even images) efficiently. It’s the backbone of many state-of-the-art models in NLP, vision, and more, such as GPT, BERT, and Vision Transformers (ViT).

Why Transformers Are Important¶

Handles Sequences Efficiently

Unlike RNNs/LSTMs, Transformers process all sequence elements at once (parallelization).

Captures Long-Range Dependencies

Can “see” the entire input sequence, not just the recent elements.

State-of-the-Art in NLP & Beyond

Text: GPT, BERT, T5

Images: Vision Transformers

Speech, Audio, and even Multimodal Data

1️⃣ Frameworks to Build Transformers

These are the main libraries for creating Transformer models:

Tool / Library Purpose¶

PyTorch: Flexible deep learning framework, widely used to implement Transformers.

TensorFlow / Keras: High-level deep learning framework; easy to build Transformer models.

Hugging Face Transformer: Pre-built state-of-the-art Transformer models (GPT, BERT, T5, etc.) for NLP.

JAX / Flax: For research-level efficient Transformer implementations.

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt

# -----------------------
# 1. Prepare the dataset
# -----------------------
data = {
    "Math": [20,35,70,40,50,67,88,46,67,46],
    "Sci": [54,60,54,34,36,67,89,90,57,67],
    "Eng": [67,76,55,45,34,25,78,47,67,76],
    "Dzo": [93,59,76,77,59,47,29,39,71,62],
    "Total": [234,230,255,196,179,206,284,222,262,251]
}

df = pd.DataFrame(data)

X = df[["Math", "Sci", "Eng", "Dzo"]].values
y = df["Total"].values

# Scale features
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)

# -----------------------
# 2. Train-Test Split
# -----------------------
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# -----------------------
# 3. Fit Neural Network (as a proxy for Transformer)
# -----------------------
mlp = MLPRegressor(hidden_layer_sizes=(16,16), max_iter=2000, random_state=42)
mlp.fit(X_train, y_train)

# -----------------------
# 4. Predictions
# -----------------------
y_pred = mlp.predict(X_test)

# Display actual vs predicted
results = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
print(results)

# -----------------------
# 5. Plot
# -----------------------
plt.figure(figsize=(7,5))
plt.plot(range(len(y_test)), y_test, marker='o', label="Actual")
plt.plot(range(len(y_pred)), y_pred, marker='x', label="Predicted")
plt.xlabel("Test Sample Index")
plt.ylabel("Total Marks")
plt.title("Actual vs Predicted Total Marks (NN Proxy for Transformer)")
plt.legend()
plt.grid(True)
plt.show()
   Actual   Predicted
0     262  279.432371
1     230  228.867588
/opt/conda/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
No description has been provided for this image

🔹 2️⃣ Visualization Tools

Used to show how Transformers work, including attention patterns and activations:

Tool Purpose¶

Matplotlib / Seaborn: Plot attention weights, loss curves, accuracy trends.

Plotly / Bokeh: Interactive plots for attention matrices or sequence relationships.

BertViz: :Specifically designed to visualize attention in BERT and similar models.

Transformers Interpret / Captum Model interpretability: for Transformers (shows which tokens are important).

TensorBoard : Visualizes training metrics, attention maps, and embeddings.

In [6]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt

# -----------------------
# 1. Prepare the dataset
# -----------------------
data = {
    "Math": [20,35,70,40,50,67,88,46,67,46],
    "Sci": [54,60,54,34,36,67,89,90,57,67],
    "Eng": [67,76,55,45,34,25,78,47,67,76],
    "Dzo": [93,59,76,77,59,47,29,39,71,62],
    "Total": [234,230,255,196,179,206,284,222,262,251]
}

df = pd.DataFrame(data)

X = df[["Math", "Sci", "Eng", "Dzo"]]
y = df["Total"]

# Scale features
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X)

# -----------------------
# 2. Train-Test Split
# -----------------------
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# -----------------------
# 3. Fit Neural Network (proxy for Transformer)
# -----------------------
mlp = MLPRegressor(hidden_layer_sizes=(16,16), max_iter=2000, random_state=42)
mlp.fit(X_train, y_train)

# -----------------------
# 4. Predictions
# -----------------------
y_pred = mlp.predict(X_test)

# Display actual vs predicted
results = pd.DataFrame({"Actual": y_test, "Predicted": y_pred})
print(results)

# -----------------------
# 5. Plot
# -----------------------
plt.figure(figsize=(7,5))
plt.plot(range(len(y_test)), y_test, marker='o', label="Actual")
plt.plot(range(len(y_pred)), y_pred, marker='x', label="Predicted")
plt.xlabel("Test Sample Index")
plt.ylabel("Total Marks")
plt.title("Actual vs Predicted Total Marks (NN Proxy for Transformer)")
plt.legend()
plt.grid(True)
plt.show() 
   Actual   Predicted
8     262  279.432371
1     230  228.867588
/opt/conda/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
No description has been provided for this image
In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
import matplotlib.pyplot as plt

# -----------------------
# Dataset
# -----------------------
data = {
    "Math": [20,35,70,40,50,67,88,46,67,46],
    "Sci": [54,60,54,34,36,67,89,90,57,67],
    "Eng": [67,76,55,45,34,25,78,47,67,76],
    "Dzo": [93,59,76,77,59,47,29,39,71,62],
    "Total": [234,230,255,196,179,206,284,222,262,251]
}
df = pd.DataFrame(data)

X = df[["Math","Sci","Eng","Dzo"]]
y = df["Total"]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# -----------------------
# MLPRegressor as proxy for Transformer
# -----------------------
mlp = MLPRegressor(hidden_layer_sizes=(16,16), max_iter=2000, random_state=42)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

# -----------------------
# Circular plot for feature importance (simulated attention)
# -----------------------
# Use absolute value of first layer weights as "attention"
attention = np.abs(mlp.coefs_[0]).mean(axis=1)
subjects = ["Math","Sci","Eng","Dzo"]

theta = np.linspace(0, 2*np.pi, len(subjects), endpoint=False)
radii = attention
width = 2*np.pi / len(subjects)

plt.figure(figsize=(6,6))
ax = plt.subplot(111, polar=True)
bars = ax.bar(theta, radii, width=width, bottom=0.0, alpha=0.6)
ax.set_xticks(theta)
ax.set_xticklabels(subjects)
ax.set_title("Simulated Attention Across Subjects (MLP Proxy)")
plt.show()
/opt/conda/lib/python3.13/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
No description has been provided for this image
In [ ]: