< Home
Week 5: Probability¶
This is an example file to introduce you to Juypter Labs and show you how you can organise and document your work. Feel free to edit this page as you please. The topmost cell is a small navigation to go back home and optionally you could link the following week here (ie week 2), when you start working on it to help visitors.
Course Material¶
Assignment¶
Investigate the probability distribution of your data Set up template notebooks and slides for your data set analysis
I tried to use "pairplot" by seaboan. I used the data from previous week, "120 years of Olympic history: athletes and results".
ID - Unique number for each athlete;
Name - Athlete's name;
Sex - M or F;
Age - Integer;
Height - In centimeters;
Weight - In kilograms;
Team - Team name;
NOC - National Olympic Committee 3-letter code;
Games - Year and season;
Year - Integer;
Season - Summer or Winter;
City - Host city;
Sport - Sport;
Event - Event;
Medal - Gold, Silver, Bronze, or NA.
This data pertains to Olympic athletes and medalists.
I aim to uncover the relationships between each variable using this data.
I want to create histograms of relationships or covariance graphs for every possible combination to identify correlated variables.
I am using Jupyter Notebook as my tool.
In [6]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# データの読み込み (ファイル名は実際のcsvファイル名に変更してください)
df = pd.read_csv('datasets/olympic_athlete_events.csv')
# データの中身を確認(既に読み込み済みの場合はスキップ)
df.head()
# --- 前処理: 相関を見るためにカテゴリデータを数値化する ---
# 分析用のデータフレームをコピー
df_analyze = df.copy()
# 'Medal' を数値化 (Gold: 3, Silver: 2, Bronze: 1, NA: 0)
medal_mapping = {'Gold': 3, 'Silver': 2, 'Bronze': 1, 'NA': 0}
df_analyze['Medal_Value'] = df_analyze['Medal'].fillna('NA').map(medal_mapping)
# 'Sex' を数値化 (M: 0, F: 1)
df_analyze['Sex_Value'] = df_analyze['Sex'].map({'M': 0, 'F': 1})
# 相関行列の計算対象とする列を選択 (IDや名前などのユニークキーは除外)
target_columns = ['Age', 'Height', 'Weight', 'Year', 'Medal_Value', 'Sex_Value']
df_subset = df_analyze[target_columns].dropna() # 欠損値がある行は除外
In [8]:
# データを少し間引いて可視化(全データだと重すぎる場合)
# データが1万件以上ある場合は、以下のようにサンプリング推奨
sample_data = df_analyze.sample(n=5000, random_state=42)
# ペアプロットの描画
# hue='Medal' とすることで、メダルの有無で色分けされます
sns.pairplot(sample_data[['Age', 'Height', 'Weight', 'Year', 'Medal']], hue='Medal', palette='viridis')
plt.show()
In [9]:
plt.figure(figsize=(10, 8))
# 相関係数を計算
correlation_matrix = df_subset.corr()
# ヒートマップを描画
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Olympic Data')
plt.show()
In [14]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
print(os.listdir("datasets"))
#import datasets
data = pd.read_csv('datasets/olympic_athlete_events.csv')
regions = pd.read_csv('datasets/olympic_noc_regions.csv')
merged = pd.merge(data, regions, on='NOC', how='left')
merged.head(5)
['olympic_athlete_events.csv', 'factory_sensor_simulator_2040.csv', 'olympic_noc_regions.csv', '.gitignore']
Out[14]:
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN | Netherlands | NaN |
In [ ]:
goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals = goldMedals.sort_values(by="Age", ascending=False)
goldMedals.head()
goldMedals.isnull().any()
#plt.figure(figsize=(50, 10))
clean_age = goldMedals['Age'].dropna().astype(int)
age_counts = clean_age.value_counts().sort_index()
x = age_counts.index
y = age_counts.values
plt.plot(x,y,'o')
#plt.figure(figsize=(8, 5))
plt.title("Olympic gold medalist vs age")
plt.xlabel("Age")
plt.ylabel("Gold medalist Count")
plt.grid(True)
plt.tight_layout()
plt.show()
In [ ]: