Masato Takemura - Fab Futures - Data Science
Home About

< Home

Week 5: Probability¶

This is an example file to introduce you to Juypter Labs and show you how you can organise and document your work. Feel free to edit this page as you please. The topmost cell is a small navigation to go back home and optionally you could link the following week here (ie week 2), when you start working on it to help visitors.

Course Material¶

  • video
  • course material

Assignment¶

Investigate the probability distribution of your data Set up template notebooks and slides for your data set analysis

I tried to use "pairplot" by seaboan. I used the data from previous week, "120 years of Olympic history: athletes and results".

5_1.png

ID - Unique number for each athlete;
Name - Athlete's name;
Sex - M or F;
Age - Integer;
Height - In centimeters;
Weight - In kilograms;
Team - Team name;
NOC - National Olympic Committee 3-letter code;
Games - Year and season;
Year - Integer;
Season - Summer or Winter;
City - Host city;
Sport - Sport;
Event - Event;
Medal - Gold, Silver, Bronze, or NA.

This data pertains to Olympic athletes and medalists.
I aim to uncover the relationships between each variable using this data.

I want to create histograms of relationships or covariance graphs for every possible combination to identify correlated variables.
I am using Jupyter Notebook as my tool.
In [6]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# データの読み込み (ファイル名は実際のcsvファイル名に変更してください)
df = pd.read_csv('datasets/olympic_athlete_events.csv') 

# データの中身を確認(既に読み込み済みの場合はスキップ)
df.head()

# --- 前処理: 相関を見るためにカテゴリデータを数値化する ---

# 分析用のデータフレームをコピー
df_analyze = df.copy()

# 'Medal' を数値化 (Gold: 3, Silver: 2, Bronze: 1, NA: 0)
medal_mapping = {'Gold': 3, 'Silver': 2, 'Bronze': 1, 'NA': 0}
df_analyze['Medal_Value'] = df_analyze['Medal'].fillna('NA').map(medal_mapping)

# 'Sex' を数値化 (M: 0, F: 1)
df_analyze['Sex_Value'] = df_analyze['Sex'].map({'M': 0, 'F': 1})

# 相関行列の計算対象とする列を選択 (IDや名前などのユニークキーは除外)
target_columns = ['Age', 'Height', 'Weight', 'Year', 'Medal_Value', 'Sex_Value']
df_subset = df_analyze[target_columns].dropna() # 欠損値がある行は除外
In [8]:
# データを少し間引いて可視化(全データだと重すぎる場合)
# データが1万件以上ある場合は、以下のようにサンプリング推奨
sample_data = df_analyze.sample(n=5000, random_state=42) 

# ペアプロットの描画
# hue='Medal' とすることで、メダルの有無で色分けされます
sns.pairplot(sample_data[['Age', 'Height', 'Weight', 'Year', 'Medal']], hue='Medal', palette='viridis')
plt.show()
No description has been provided for this image
In [9]:
plt.figure(figsize=(10, 8))
# 相関係数を計算
correlation_matrix = df_subset.corr()

# ヒートマップを描画
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Olympic Data')
plt.show()
No description has been provided for this image
In [14]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import os
print(os.listdir("datasets"))

#import datasets
data = pd.read_csv('datasets/olympic_athlete_events.csv')
regions = pd.read_csv('datasets/olympic_noc_regions.csv')

merged = pd.merge(data, regions, on='NOC', how='left')
merged.head(5)
['olympic_athlete_events.csv', 'factory_sensor_simulator_2040.csv', 'olympic_noc_regions.csv', '.gitignore']
Out[14]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal region notes
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN China NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN China NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN Denmark NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN Netherlands NaN
In [ ]:
goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals = goldMedals.sort_values(by="Age", ascending=False)
goldMedals.head()

goldMedals.isnull().any()

#plt.figure(figsize=(50, 10))
clean_age = goldMedals['Age'].dropna().astype(int)

age_counts = clean_age.value_counts().sort_index()
x = age_counts.index
y = age_counts.values

plt.plot(x,y,'o')
#plt.figure(figsize=(8, 5))
plt.title("Olympic gold medalist vs age")
plt.xlabel("Age")
plt.ylabel("Gold medalist Count")
plt.grid(True)
plt.tight_layout()
plt.show()
In [ ]: