Yuichi TAMIYA - Fab Futures 2025 - Data Science
Home About Tips

GMM-Based Clustering of Pilgrim Age Group Patterns by year¶

Prompt(ChatGPT)
Generate Python code to cluster yearly age-group population patterns using a Gaussian Mixture Model.

The input is a CSV file where rows are age groups and columns are years.

Steps required:

  • Load the CSV with pandas and set age groups as index
  • Transpose the data so that each year is a sample
  • Standardize features
  • Apply GMM with 3 clusters
  • Visualize the results using PCA (2D) and matplotlib
  • Label each point with the corresponding year
  • Print the estimated GMM means

Use clear comments and best practices.

Answer
The code generaed is below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# ----------------------------------------------------------------------
# STEP 1: Load CSV data
# ----------------------------------------------------------------------

csv_path = "datasets/finalproject/age.csv"
df_age = pd.read_csv(csv_path)

# Set age groups as index
df_age = df_age.set_index("Age_Group")

# Extract year columns only
# Transpose so that each year is a sample and age groups are features
X = df_age.values.T

# ----------------------------------------------------------------------
# STEP 2: Standardization
# ----------------------------------------------------------------------

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ----------------------------------------------------------------------
# STEP 3: Gaussian Mixture Model (GMM)
# ----------------------------------------------------------------------

n_components = 3
gmm = GaussianMixture(n_components=n_components, random_state=42)
gmm.fit(X_scaled)

labels = gmm.predict(X_scaled)

# ----------------------------------------------------------------------
# STEP 4: PCA visualization
# ----------------------------------------------------------------------

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=100, alpha=0.7)

# Annotate each point with the corresponding year
years = df_age.columns.values
for i, year in enumerate(years):
    plt.annotate(year, (X_pca[i, 0], X_pca[i, 1]), fontsize=9)

plt.title('Density estimation of pilgrim patterns by age group (GMM)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(scatter, label='Estimated density clusters')
plt.grid(True, alpha=0.3)
plt.show()

print("\n--- GMM-based estimation results ---")
print("Estimated means (after standardization):\n", gmm.means_)
No description has been provided for this image
--- GMM-based estimation results ---
Estimated means (after standardization):
 [[ 0.59564086  0.81094218  1.00403509]
 [-0.81839126 -0.86224064 -1.10803003]
 [-0.13476914 -0.43535432 -0.49854989]]

Pronpt(ChatGPT)
What dose this result mean?

Answer
The results indicate that yearly pilgrim age distributions can be grouped into several distinct demographic patterns, suggesting structural shifts in age composition rather than a smooth continuous trend.