GMM-Based Clustering of Pilgrim Age Group Patterns by year¶
Prompt(ChatGPT)
Generate Python code to cluster yearly age-group population patterns using a Gaussian Mixture Model.
The input is a CSV file where rows are age groups and columns are years.
Steps required:
- Load the CSV with pandas and set age groups as index
- Transpose the data so that each year is a sample
- Standardize features
- Apply GMM with 3 clusters
- Visualize the results using PCA (2D) and matplotlib
- Label each point with the corresponding year
- Print the estimated GMM means
Use clear comments and best practices.
Answer
The code generaed is below.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# ----------------------------------------------------------------------
# STEP 1: Load CSV data
# ----------------------------------------------------------------------
csv_path = "datasets/finalproject/age.csv"
df_age = pd.read_csv(csv_path)
# Set age groups as index
df_age = df_age.set_index("Age_Group")
# Extract year columns only
# Transpose so that each year is a sample and age groups are features
X = df_age.values.T
# ----------------------------------------------------------------------
# STEP 2: Standardization
# ----------------------------------------------------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ----------------------------------------------------------------------
# STEP 3: Gaussian Mixture Model (GMM)
# ----------------------------------------------------------------------
n_components = 3
gmm = GaussianMixture(n_components=n_components, random_state=42)
gmm.fit(X_scaled)
labels = gmm.predict(X_scaled)
# ----------------------------------------------------------------------
# STEP 4: PCA visualization
# ----------------------------------------------------------------------
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=100, alpha=0.7)
# Annotate each point with the corresponding year
years = df_age.columns.values
for i, year in enumerate(years):
plt.annotate(year, (X_pca[i, 0], X_pca[i, 1]), fontsize=9)
plt.title('Density estimation of pilgrim patterns by age group (GMM)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(scatter, label='Estimated density clusters')
plt.grid(True, alpha=0.3)
plt.show()
print("\n--- GMM-based estimation results ---")
print("Estimated means (after standardization):\n", gmm.means_)
--- GMM-based estimation results --- Estimated means (after standardization): [[ 0.59564086 0.81094218 1.00403509] [-0.81839126 -0.86224064 -1.10803003] [-0.13476914 -0.43535432 -0.49854989]]
Pronpt(ChatGPT)
What dose this result mean?
Answer
The results indicate that yearly pilgrim age distributions can be grouped into several distinct demographic patterns, suggesting structural shifts in age composition rather than a smooth continuous trend.