Masato Takemura - Fab Futures - Data Science
Home About

< Home

Week 4: Machine Learning¶

- Class Material

- Video

What I learned¶

・These are some curve for machine learning.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3,3,100)
plt.plot(x,1/(1+np.exp(-x)),label='sigmoid')
plt.plot(x,np.tanh(x),label='tanh')
plt.plot(x,np.where(x < 0,0,x),label='ReLU')
plt.plot(x,np.where(x < 0,0.1*x,x),'--',label='leaky ReLU')
plt.legend()
plt.show()
No description has been provided for this image

・Also Neil suggested Jax, taht is a powerfull framework.

Practrice¶

In this class, Neil showed us the example of MIST mashine learning. So I tried to start his example code.

In [2]:
from sklearn.neural_network import MLPClassifier
import numpy as np
import matplotlib.pyplot as plt
xtrain = np.load('../class/datasets/MNIST/xtrain.npy')
ytrain = np.load('../class/datasets/MNIST/ytrain.npy')
xtest = np.load('../class/datasets/MNIST/xtest.npy')
ytest = np.load('../class/datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
classifier = MLPClassifier(solver='adam',hidden_layer_sizes=(100),activation='relu',random_state=1,verbose=True,tol=0.05)
classifier.fit(xtrain,ytrain)
print(f"\ntest score: {classifier.score(xtest,ytest)}\n")
predictions = classifier.predict(xtest)
fig,axs = plt.subplots(1,5)
for i in range(5):
    axs[i].imshow(np.reshape(xtest[i],(28,28)))
    axs[i].axis('off')
    axs[i].set_title(f"predict: {predictions[i]}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples

Iteration 1, loss = 3.36992820
Iteration 2, loss = 1.13264743
Iteration 3, loss = 0.67881655
Iteration 4, loss = 0.44722907
Iteration 5, loss = 0.31658618
Iteration 6, loss = 0.23506685
Iteration 7, loss = 0.19331921
Iteration 8, loss = 0.15768276
Iteration 9, loss = 0.13673548
Iteration 10, loss = 0.12379790
Iteration 11, loss = 0.10733766
Iteration 12, loss = 0.11199584
Iteration 13, loss = 0.09769195
Iteration 14, loss = 0.09220702
Iteration 15, loss = 0.09282348
Iteration 16, loss = 0.08964422
Iteration 17, loss = 0.08613192
Training loss did not improve more than tol=0.050000 for 10 consecutive epochs. Stopping.

test score: 0.9548

No description has been provided for this image
In [3]:
from sklearn.neural_network import MLPClassifier
import numpy as np
xtrain = np.load('../class/datasets/MNIST/xtrain.npy')
ytrain = np.load('../class/datasets/MNIST/ytrain.npy')
xtest = np.load('../class/datasets/MNIST/xtest.npy')
ytest = np.load('../class/datasets/MNIST/ytest.npy')
print(f"read {xtrain.shape[1]} byte data records, {xtrain.shape[0]} training examples, {xtest.shape[0]} testing examples\n")
classifier = MLPClassifier(solver='adam',hidden_layer_sizes=(100),activation='relu',random_state=1,verbose=True,tol=0.05)
classifier.fit(xtrain,ytrain)
print(f"\ntest score: {classifier.score(xtest,ytest)}\n")
predictions = classifier.predict(xtest)
fig,axs = plt.subplots(1,5)
for i in range(5):
    axs[i].imshow(np.reshape(xtest[i],(28,28)))
    axs[i].axis('off')
    axs[i].set_title(f"predict: {predictions[i]}")
plt.tight_layout()
plt.show()
read 784 byte data records, 60000 training examples, 10000 testing examples

Iteration 1, loss = 3.36992820
Iteration 2, loss = 1.13264743
Iteration 3, loss = 0.67881655
Iteration 4, loss = 0.44722907
Iteration 5, loss = 0.31658618
Iteration 6, loss = 0.23506685
Iteration 7, loss = 0.19331921
Iteration 8, loss = 0.15768276
Iteration 9, loss = 0.13673548
Iteration 10, loss = 0.12379790
Iteration 11, loss = 0.10733766
Iteration 12, loss = 0.11199584
Iteration 13, loss = 0.09769195
Iteration 14, loss = 0.09220702
Iteration 15, loss = 0.09282348
Iteration 16, loss = 0.08964422
Iteration 17, loss = 0.08613192
Training loss did not improve more than tol=0.050000 for 10 consecutive epochs. Stopping.

test score: 0.9548

No description has been provided for this image
In [ ]:
 

Assignment¶

Fit a machine learning model to your data

At first I don't know where should I start to ask to LLM, Gemini

I want to practice deep learning using this data.
Can you think of anything interesting we could do with this data?

ID - Unique number for each athlete;
Name - Athlete's name;
Sex - M or F;
Age - Integer;
Height - In centimeters;
Weight - In kilograms;
Team - Team name;
NOC - National Olympic Committee 3-letter code;
Games - Year and season;
Year - Integer;
Season - Summer or Winter;
City - Host city;
Sport - Sport;
Event - Event;
Medal - Gold, Silver, Bronze, or NA.
In [16]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import os
print(os.listdir("datasets"))

#import datasets
data = pd.read_csv('datasets/olympic_athlete_events.csv')
regions = pd.read_csv('datasets/olympic_noc_regions.csv')

merged = pd.merge(data, regions, on='NOC', how='left')
['olympic_athlete_events.csv', 'factory_sensor_simulator_2040.csv', 'olympic_noc_regions.csv', '.gitignore']
In [5]:
goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals = goldMedals.sort_values(by="Age", ascending=False)
goldMedals.head()

goldMedals.isnull().any()

#plt.figure(figsize=(50, 10))
clean_age = goldMedals['Age'].dropna().astype(int)
In [6]:
age_counts = clean_age.value_counts().sort_index()
x = age_counts.index
y = age_counts.values

plt.plot(x,y,'o')
#plt.figure(figsize=(8, 5))
plt.title("Olympic gold medalist vs age")
plt.xlabel("Age")
plt.ylabel("Gold medalist Count")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

1. Data Preprocessing¶

First, I converted the data into numerical format (NumPy arrays) compatible with JAX using Pandas and Scikit-Learn.

In [7]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv('datasets/olympic_athlete_events.csv')

# --- pre processing ---

# 1. dependant variable: having medal 1, NA 0
df['Target'] = df['Medal'].apply(lambda x: 0 if pd.isna(x) else 1)

# 2. convert Categorical variable into numerical variable
# Sex, NOC, Sport will be 0, 1, 2...(should be integer)
cat_cols = ['Sex', 'NOC', 'Sport', 'Season']
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))

# 3. Imputation and Normalization of Missing Values in Numeric Variables
num_cols = ['Age', 'Height', 'Weight']
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# 4. Split data for JAX/Flax
# Separate category input (x_cat) and numeric input (x_num).
X_cat = df[cat_cols].values  #  (int32)
X_num = df[num_cols].values  #  (float32)
y = df['Target'].values      # 

# Split into training and testing sets
X_cat_train, X_cat_test, X_num_train, X_num_test, y_train, y_test = train_test_split(
    X_cat, X_num, y, test_size=0.2, random_state=42
)

# Number of unique items per category (required to determine the input size of the Embedding layer)
vocab_sizes = [df[c].max() + 1 for c in cat_cols]
In [8]:
df.head()
Out[8]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal Target
0 1 A Dijiang 1 -0.247880 5.023699e-01 0.739392 China 41 1992 Summer 1992 0 Barcelona 8 Basketball Men's Basketball NaN 0
1 2 A Lamusi 1 -0.407095 -5.754389e-01 -0.851107 China 41 2012 Summer 2012 0 London 32 Judo Men's Extra-Lightweight NaN 0
2 3 Gunnar Nielsen Aaby 1 -0.247880 3.063317e-15 0.000000 Denmark 55 1920 Summer 1920 0 Antwerpen 24 Football Men's Football NaN 0
3 4 Edgar Lindenau Aabye 1 1.344262 3.063317e-15 0.000000 Denmark/Sweden 55 1900 Summer 1900 0 Paris 61 Tug-Of-War Men's Tug-Of-War Gold 1
4 5 Christine Jacoba Aaftink 0 -0.725523 1.041274e+00 0.898442 Netherlands 145 1988 Winter 1988 1 Calgary 53 Speed Skating Women's 500 metres NaN 0

Before I define the model, I installed "flax" from terminal. image.png

Then, I tried to define the model.

2. Model Definition (Flax)¶

This is the core component of JAX/Flax. Using nn.Module, we create an architecture that passes categorical variables through an embedding layer and concatenates them with numerical variables.

In [10]:
import jax
import jax.numpy as jnp
from flax import linen as nn
import optax

class OlympicModel(nn.Module):
    vocab_sizes: list  # Number of words per category (e.g., 200 for 200 countries)
    emb_dim: int = 10  # Embedded vector dimension

    @nn.compact
    def __call__(self, x_cat, x_num):
        # 1. Handling Categorical Variables (Embedding)
        # x_cat is of shape [batch_size, 4] (Sex, NOC, Sport, Season)
        embs = []
        for i, vocab_size in enumerate(self.vocab_sizes):
            # Create an embedding layer for each category
            # Input: Integer index -> Output: 10-dimensional vector
            emb = nn.Embed(num_embeddings=vocab_size, features=self.emb_dim, name=f'emb_{i}')(x_cat[:, i])
            embs.append(emb)
        
        # Concatenate Embedded Vectors [batch_size, 4 * 10]
        x_emb = jnp.concatenate(embs, axis=1)

        # 2. Numeric Variables and Concatenation
        # [batch_size, 40] + [batch_size, 3] -> [batch_size, 43]
        x = jnp.concatenate([x_emb, x_num], axis=1)

        # --- 3. Fully connected layer (MLP) ---
        x = nn.Dense(features=64)(x)
        x = nn.relu(x)
        x = nn.Dense(features=32)(x)
        x = nn.relu(x)
        x = nn.Dense(features=1)(x) # 出力は1つ (Logits)
        
        return x

3. Defining Learning Steps (JAX Transformation)¶

We utilize JAX's distinctive features: jax.jit (Just-In-Time compilation) and jax.value_and_grad (automatic differentiation).

In [11]:
# TrainState: A class that holds the model parameters and optimizer state
from flax.training import train_state

def create_train_state(rng, learning_rate, vocab_sizes, sample_cat, sample_num):
    model = OlympicModel(vocab_sizes=vocab_sizes)
    # tx = optax.adam(learning_rate) (init)
    params = model.init(rng, sample_cat, sample_num)['params']
    tx = optax.adam(learning_rate)
    return train_state.TrainState.create(apply_fn=model.apply, params=params, tx=tx)

@jax.jit
def train_step(state, x_cat, x_num, y):
    def loss_fn(params):
        logits = state.apply_fn({'params': params}, x_cat, x_num)
        # Binary cross-entropy error (including sigmoid)
        loss = optax.sigmoid_binary_cross_entropy(logits, y.reshape(-1, 1)).mean()
        return loss

    # Simultaneously calculate loss and gradient
    loss, grads = jax.value_and_grad(loss_fn)(state.params)
    # Parameter update with the optimizer
    state = state.apply_gradients(grads=grads)
    return state, loss

4. Execution Loop¶

This is where the actual learning process runs.

In [12]:
# Initialize
rng = jax.random.PRNGKey(0)
learning_rate = 0.001
state = create_train_state(rng, learning_rate, vocab_sizes, X_cat_train[:1], X_num_train[:1])

# Simplified Learning Loop (Batch processing omitted; demo uses entire dataset)
# Mini-batch learning should be used in practice
epochs = 100

print("Training Start...")
for epoch in range(epochs):
    # Large batch sizes may cause an OutOfMemoryError, so please use DataLoader in actual production
    # This is simplified for testing purposes only
    state, loss = train_step(state, X_cat_train, X_num_train, y_train)
    
    if epoch % 10 == 0:
        print(f"Epoch: {epoch}, Loss: {loss:.4f}")

print("Training Finished!")
Training Start...
Epoch: 0, Loss: 0.7669
Epoch: 10, Loss: 0.6165
Epoch: 20, Loss: 0.5129
Epoch: 30, Loss: 0.4443
Epoch: 40, Loss: 0.4186
Epoch: 50, Loss: 0.4133
Epoch: 60, Loss: 0.4068
Epoch: 70, Loss: 0.4023
Epoch: 80, Loss: 0.3989
Epoch: 90, Loss: 0.3956
Training Finished!

I could see the process that decreasing loss function.
image.png

In [14]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# --- Definition of the Prediction Step (JAX) ---
@jax.jit
def predict_step(state, x_cat, x_num):
    # Feed data into the model (do not use apply_gradients since this is not training)
    logits = state.apply_fn({'params': state.params}, x_cat, x_num)
    # Convert logit to probability (0.0 - 1.0)
    probs = nn.sigmoid(logits)
    return probs

# --- Run prediction on test data ---
# If memory becomes insufficient with large batch sizes, split the data and loop through it.
pred_probs = predict_step(state, X_cat_test, X_num_test)

# Convert JAX DeviceArray to NumPy array (for plotting)
pred_probs = np.array(pred_probs).flatten() # (N, 1) -> (N,)
actual_labels = y_test
In [15]:
# --- Graph 1: Distribution of Prediction Probabilities ---
plt.figure(figsize=(10, 6))

# Combine into a dataframe and plot with Seaborn
res_df = pd.DataFrame({
    'Actual': actual_labels,
    'Prediction Probability': pred_probs
})
# Replace labels for better readability
res_df['Label'] = res_df['Actual'].map({0: 'No Medal', 1: 'Medal'})

# Histogram Plotting
sns.histplot(
    data=res_df, 
    x='Prediction Probability', 
    hue='Label', 
    bins=50, 
    kde=True, # Display the approximation curve
    stat="density", # Display by density rather than frequency (to improve readability due to significant variations in item counts)
    common_norm=False, # Normalization in each class
    palette={'No Medal': 'blue', 'Medal': 'orange'},
    alpha=0.5
)

plt.title('Distribution of Predicted Probabilities by Actual Outcome')
plt.xlabel('Predicted Probability (0 = Sure No, 1 = Sure Yes)')
plt.ylabel('Density')
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Meaning of this histgram?¶

It would be perfect if the blue mountain were completely separated at the left edge at 0.0 and the orange mountain at the right edge at 1.0.

Analysis:¶

Due to the overwhelming abundance of “no medal” data, the model has adopted a sneaky strategy: “Rather than risking incorrect predictions by labeling some as ‘medal’ and getting them wrong, predicting ‘no medal’ for everyone yields a higher overall accuracy rate.” image.png

Solution:¶

Apply “Weights” to the Loss Function (Weighted Loss) Modify the JAX/Optax code slightly to increase the penalty for medal-winning data by about 10 to 15 times.

Then I asked to Gemini again to improve model.

In [17]:
import pandas as pd
import numpy as np
import jax
import jax.numpy as jnp
from flax import linen as nn
from flax.training import train_state
import optax
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# ==========================================
# 1. Data Preprocessing
# ==========================================
print("Loading and preprocessing data...")

df = pd.read_csv('datasets/olympic_athlete_events.csv')
df['Target'] = df['Medal'].apply(lambda x: 0 if pd.isna(x) else 1)

cat_cols = ['Sex', 'NOC', 'Sport', 'Season']
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))

num_cols = ['Age', 'Height', 'Weight']
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

X_cat = df[cat_cols].values.astype(int)
X_num = df[num_cols].values.astype(float)
y = df['Target'].values.astype(int)

X_cat_train, X_cat_test, X_num_train, X_num_test, y_train, y_test = train_test_split(
    X_cat, X_num, y, test_size=0.2, random_state=42
)

# ---------------------------------------------------------
# ### NOTE: MODIFICATION 1 - Calculate Class Weight
# We calculate how many times more negative samples exist compared to positive ones.
# This value (pos_weight) will be used to penalize the model more when it misses a medal.
# ---------------------------------------------------------
n_pos = y_train.sum()
n_neg = len(y_train) - n_pos
pos_weight = n_neg / n_pos
print(f"Positive Weight calculated: {pos_weight:.2f}")

vocab_sizes = [df[c].max() + 1 for c in cat_cols]

# ==========================================
# 2. Model Definition (No changes here)
# ==========================================
class OlympicModel(nn.Module):
    vocab_sizes: list
    emb_dim: int = 10

    @nn.compact
    def __call__(self, x_cat, x_num):
        embs = []
        for i, vocab_size in enumerate(self.vocab_sizes):
            emb = nn.Embed(num_embeddings=vocab_size, features=self.emb_dim, name=f'emb_{i}')(x_cat[:, i])
            embs.append(emb)
        x_emb = jnp.concatenate(embs, axis=1)
        x = jnp.concatenate([x_emb, x_num], axis=1)
        x = nn.Dense(features=64)(x)
        x = nn.relu(x)
        x = nn.Dense(features=32)(x)
        x = nn.relu(x)
        x = nn.Dense(features=1)(x)
        return x

# ==========================================
# 3. Training Step
# ==========================================
def create_train_state(rng, learning_rate, vocab_sizes, sample_cat, sample_num):
    model = OlympicModel(vocab_sizes=vocab_sizes)
    params = model.init(rng, sample_cat, sample_num)['params']
    tx = optax.adam(learning_rate)
    return train_state.TrainState.create(apply_fn=model.apply, params=params, tx=tx)

@jax.jit
def train_step(state, x_cat, x_num, y, pos_weight): # ### NOTE: Added 'pos_weight' argument
    def loss_fn(params):
        logits = state.apply_fn({'params': params}, x_cat, x_num)
        
        # Standard Binary Cross Entropy
        bce = optax.sigmoid_binary_cross_entropy(logits, y.reshape(-1, 1))
        
        # ---------------------------------------------------------
        # ### NOTE: MODIFICATION 2 - Apply Weighted Mask
        # We create a mask where:
        # - If label is 1 (Medal), weight = pos_weight (e.g., 6.0)
        # - If label is 0 (No Medal), weight = 1.0
        # ---------------------------------------------------------
        weight_mask = jnp.where(y.reshape(-1, 1) == 1, pos_weight, 1.0)
        
        # ### NOTE: Multiply the raw loss by the mask before averaging
        loss = (bce * weight_mask).mean()
        return loss

    loss, grads = jax.value_and_grad(loss_fn)(state.params)
    state = state.apply_gradients(grads=grads)
    return state, loss

@jax.jit
def predict_step(state, x_cat, x_num):
    logits = state.apply_fn({'params': state.params}, x_cat, x_num)
    return nn.sigmoid(logits)

# ==========================================
# 4. Training Loop
# ==========================================
print("Training started...")

rng = jax.random.PRNGKey(42)
learning_rate = 0.001
epochs = 100

state = create_train_state(rng, learning_rate, vocab_sizes, X_cat_train[:1], X_num_train[:1])

for epoch in range(epochs):
    # ---------------------------------------------------------
    # ### NOTE: MODIFICATION 3 - Pass 'pos_weight' to train_step
    # ---------------------------------------------------------
    state, loss = train_step(state, X_cat_train, X_num_train, y_train, pos_weight)
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss:.4f}")

print("Training finished.")

# ==========================================
# 5. Visualization (Standard)
# ==========================================
# (Code below is for visualization and remains mostly standard, 
# aside from plotting the result of the new weighted model)

pred_probs = predict_step(state, X_cat_test, X_num_test)
pred_probs = np.array(pred_probs).flatten()

plt.figure(figsize=(10, 6))
res_df = pd.DataFrame({'Actual': y_test, 'Prob': pred_probs})
res_df['Label'] = res_df['Actual'].map({0: 'No Medal', 1: 'Medal'})

sns.histplot(
    data=res_df, x='Prob', hue='Label', 
    bins=50, kde=True, stat="density", common_norm=False,
    palette={'No Medal': 'blue', 'Medal': 'orange'}, alpha=0.5
)
plt.title(f'Prediction Distribution (Weighted Loss, Weight={pos_weight:.1f})')
plt.xlabel('Predicted Probability')
plt.grid(True, alpha=0.3)
plt.show()

pred_labels = (pred_probs >= 0.5).astype(int)
cm = confusion_matrix(y_test, pred_labels)

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Pred No', 'Pred Yes'],
            yticklabels=['Actual No', 'Actual Yes'])
plt.title('Confusion Matrix')
plt.show()
Loading and preprocessing data...
Positive Weight calculated: 5.81
Training started...
Epoch 0: Loss = 1.1889
Epoch 10: Loss = 1.1604
Epoch 20: Loss = 1.1431
Epoch 30: Loss = 1.1260
Epoch 40: Loss = 1.1080
Epoch 50: Loss = 1.0888
Epoch 60: Loss = 1.0696
Epoch 70: Loss = 1.0505
Epoch 80: Loss = 1.0322
Epoch 90: Loss = 1.0153
Training finished.
No description has been provided for this image
No description has been provided for this image

I could get the result! Let's analys this result.

Analysis①¶

The result getting better.
Firstly predicted probability got higher point. The model can predict medalist with over 0.5 point. It means the model caould predicted as a medalist for people who took medal with 50% probability.
image.png

Analysis②¶

Positive aspect (bottom right ④): Correctly identified 5,631 medalists!

Side effect (top right ②): However, it incorrectly predicted that 15,000 athletes could win medals (“You can win a medal too!”) (false positives).
image.png

Conclusion¶

I could play with the data relate to olympic game medalists. Also I could try to make the prediction model using deep learning library called Jax. This library analyse the 54224 data in few minutes. Also I could improve the prediction model by changing the penalty adjustment.

In [ ]: