Masato Takemura - Fab Futures - Data Science
Home About

< Home

Week 3: Fitting¶

In this week,

About this week¶

- Class material
- Video

Practrice¶

In this class, Neil showed us the example of plotting with fitting.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
xmin = 0
xmax = 2
noise = 0.05
npts = 100
a = 0.5
b = 1
c = -.3
np.random.seed(0)
x = xmin+(xmax-xmin)*np.random.rand(npts) # generate random x
y = a+b*x+c*x*x+np.random.normal(0,noise,npts) # evaluate polynomial at x and add noise

plt.plot(x,y,'o')
Out[1]:
[<matplotlib.lines.Line2D at 0xe3d0af3860d0>]
No description has been provided for this image
In [2]:
coeff1 = np.polyfit(x,y,1) # fit first-order polynomial
coeff2 = np.polyfit(x,y,2) # fit second-order polynomial
xfit = np.arange(xmin,xmax,(xmax-xmin)/npts)
pfit1 = np.poly1d(coeff1)
yfit1 = pfit1(xfit) # evaluate first-order fit
print(f"first-order fit coefficients: {coeff1}")
pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(xfit) # evaluate second-order fit
print(f"second-order fit coefficients: {coeff2}")
plt.plot(x,y,'o')
plt.plot(xfit,yfit1,'g-',label='linear')
plt.plot(xfit,yfit2,'r-',label='quadratic')
plt.legend()
plt.show()
first-order fit coefficients: [0.41918275 0.69084816]
second-order fit coefficients: [-0.3225953   1.04205042  0.49756991]
No description has been provided for this image
In [ ]:
 

Assignment¶

In this week, the assignment is "Fit a function to your data". So I tried to Fit a function for my picked up data.

1. Import the data¶

In this week, I picked up a data from kaggle that relate to the olympic.

This data have following columns.

  • ID - Unique number for each athlete;
  • Name - Athlete's name;
  • Sex - M or F;
  • Age - Integer;
  • Height - In centimeters;
  • Weight - In kilograms;
  • Team - Team name;
  • NOC - National Olympic Committee 3-letter code;
  • Games - Year and season;
  • Year - Integer;
  • Season - Summer or Winter;
  • City - Host city;
  • Sport - Sport;
  • Event - Event;
  • Medal - Gold, Silver, Bronze, or NA.
In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
In [4]:
import os
print(os.listdir("datasets"))
['olympic_athlete_events.csv', 'factory_sensor_simulator_2040.csv', 'olympic_noc_regions.csv', '.gitignore']
In [5]:
#import datasets
data = pd.read_csv('datasets/olympic_athlete_events.csv')
regions = pd.read_csv('datasets/olympic_noc_regions.csv')
In [6]:
print(data)
            ID                      Name Sex   Age  Height  Weight  \
0            1                 A Dijiang   M  24.0   180.0    80.0   
1            2                  A Lamusi   M  23.0   170.0    60.0   
2            3       Gunnar Nielsen Aaby   M  24.0     NaN     NaN   
3            4      Edgar Lindenau Aabye   M  34.0     NaN     NaN   
4            5  Christine Jacoba Aaftink   F  21.0   185.0    82.0   
...        ...                       ...  ..   ...     ...     ...   
271111  135569                Andrzej ya   M  29.0   179.0    89.0   
271112  135570                  Piotr ya   M  27.0   176.0    59.0   
271113  135570                  Piotr ya   M  27.0   176.0    59.0   
271114  135571        Tomasz Ireneusz ya   M  30.0   185.0    96.0   
271115  135571        Tomasz Ireneusz ya   M  34.0   185.0    96.0   

                  Team  NOC        Games  Year  Season            City  \
0                China  CHN  1992 Summer  1992  Summer       Barcelona   
1                China  CHN  2012 Summer  2012  Summer          London   
2              Denmark  DEN  1920 Summer  1920  Summer       Antwerpen   
3       Denmark/Sweden  DEN  1900 Summer  1900  Summer           Paris   
4          Netherlands  NED  1988 Winter  1988  Winter         Calgary   
...                ...  ...          ...   ...     ...             ...   
271111        Poland-1  POL  1976 Winter  1976  Winter       Innsbruck   
271112          Poland  POL  2014 Winter  2014  Winter           Sochi   
271113          Poland  POL  2014 Winter  2014  Winter           Sochi   
271114          Poland  POL  1998 Winter  1998  Winter          Nagano   
271115          Poland  POL  2002 Winter  2002  Winter  Salt Lake City   

                Sport                                     Event Medal  
0          Basketball               Basketball Men's Basketball   NaN  
1                Judo              Judo Men's Extra-Lightweight   NaN  
2            Football                   Football Men's Football   NaN  
3          Tug-Of-War               Tug-Of-War Men's Tug-Of-War  Gold  
4       Speed Skating          Speed Skating Women's 500 metres   NaN  
...               ...                                       ...   ...  
271111           Luge                Luge Mixed (Men)'s Doubles   NaN  
271112    Ski Jumping  Ski Jumping Men's Large Hill, Individual   NaN  
271113    Ski Jumping        Ski Jumping Men's Large Hill, Team   NaN  
271114      Bobsleigh                      Bobsleigh Men's Four   NaN  
271115      Bobsleigh                      Bobsleigh Men's Four   NaN  

[271116 rows x 15 columns]
In [7]:
print(regions)
     NOC       region                 notes
0    AFG  Afghanistan                   NaN
1    AHO      Curacao  Netherlands Antilles
2    ALB      Albania                   NaN
3    ALG      Algeria                   NaN
4    AND      Andorra                   NaN
..   ...          ...                   ...
225  YEM        Yemen                   NaN
226  YMD        Yemen           South Yemen
227  YUG       Serbia            Yugoslavia
228  ZAM       Zambia                   NaN
229  ZIM     Zimbabwe                   NaN

[230 rows x 3 columns]
In [8]:
data.head(5)
Out[8]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
In [9]:
data.describe()
Out[9]:
ID Age Height Weight Year
count 271116.000000 261642.000000 210945.000000 208241.000000 271116.000000
mean 68248.954396 25.556898 175.338970 70.702393 1978.378480
std 39022.286345 6.393561 10.518462 14.348020 29.877632
min 1.000000 10.000000 127.000000 25.000000 1896.000000
25% 34643.000000 21.000000 168.000000 60.000000 1960.000000
50% 68205.000000 24.000000 175.000000 70.000000 1988.000000
75% 102097.250000 28.000000 183.000000 79.000000 2002.000000
max 135571.000000 97.000000 226.000000 214.000000 2016.000000
In [10]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB
In [11]:
regions.head(5)
Out[11]:
NOC region notes
0 AFG Afghanistan NaN
1 AHO Curacao Netherlands Antilles
2 ALB Albania NaN
3 ALG Algeria NaN
4 AND Andorra NaN
In [12]:
merged = pd.merge(data, regions, on='NOC', how='left')

merged = pd.merge(data, regions, on='NOC', on='NOC')

merge function will merge two datasets into onte.
"on='NOC'": common collumn that use for key "how='left'": keep data on left

This parameters mean connect two different datasets using NOC(National code).

In [13]:
merged.head(5)
Out[13]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal region notes
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN China NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN China NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN Denmark NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN Netherlands NaN
In [14]:
goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals.head()
Out[14]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal region notes
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN
42 17 Paavo Johannes Aaltonen M 28.0 175.0 64.0 Finland FIN 1948 Summer 1948 Summer London Gymnastics Gymnastics Men's Team All-Around Gold Finland NaN
44 17 Paavo Johannes Aaltonen M 28.0 175.0 64.0 Finland FIN 1948 Summer 1948 Summer London Gymnastics Gymnastics Men's Horse Vault Gold Finland NaN
48 17 Paavo Johannes Aaltonen M 28.0 175.0 64.0 Finland FIN 1948 Summer 1948 Summer London Gymnastics Gymnastics Men's Pommelled Horse Gold Finland NaN
60 20 Kjetil Andr Aamodt M 20.0 176.0 85.0 Norway NOR 1992 Winter 1992 Winter Albertville Alpine Skiing Alpine Skiing Men's Super G Gold Norway NaN
In [15]:
goldMedals = goldMedals.sort_values(by="Age", ascending=False)
goldMedals.head()
Out[15]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal region notes
233390 117046 Oscar Gomer Swahn M 64.0 NaN NaN Sweden SWE 1912 Summer 1912 Summer Stockholm Shooting Shooting Men's Running Target, Single Shot, Team Gold Sweden NaN
105199 53238 Charles Jacobus M 64.0 NaN NaN United States USA 1904 Summer 1904 Summer St. Louis Roque Roque Men's Singles Gold USA NaN
226374 113773 Galen Carter "G. C." Spencer M 63.0 165.0 NaN Potomac Archers USA 1904 Summer 1904 Summer St. Louis Archery Archery Men's Team Round Gold USA NaN
190952 95906 Lida Peyton "Eliza" Pollock (McMillen-) F 63.0 NaN NaN Cincinnati Archers USA 1904 Summer 1904 Summer St. Louis Archery Archery Women's Team Round Gold USA NaN
261102 130662 Robert W. Williams, Jr. M 63.0 NaN NaN Potomac Archers USA 1904 Summer 1904 Summer St. Louis Archery Archery Men's Team Round Gold USA NaN
In [16]:
goldMedals.isnull().any()
Out[16]:
ID        False
Name      False
Sex       False
Age        True
Height     True
Weight     True
Team      False
NOC       False
Games     False
Year      False
Season    False
City      False
Sport     False
Event     False
Medal     False
region     True
notes      True
dtype: bool
In [17]:
plt.figure(figsize=(50, 10))
clean_age = goldMedals['Age'].dropna().astype(int)
<Figure size 5000x1000 with 0 Axes>

↑I faced to the issue because I didn't put "dropna()".
This is a function that eliminate the NaN data from the dataset.
Other wise I couldn't make count chart.

In [18]:
sns.countplot(x=clean_age)
Out[18]:
<Axes: xlabel='Age', ylabel='count'>
No description has been provided for this image
In [19]:
age_counts = clean_age.value_counts().sort_index()
x = age_counts.index
y = age_counts.values
In [20]:
coeff1 = np.polyfit(x,y,1) # fit first-order polynomial
pfit1 = np.poly1d(coeff1)

coeff2 = np.polyfit(x,y,2) # fit second-order polynomial
pfit2 = np.poly1d(coeff2)

coeff3 = np.polyfit(x,y,3) # fit second-order polynomial
pfit3 = np.poly1d(coeff3)

coeff4 = np.polyfit(x,y,4) # fit second-order polynomial
pfit4 = np.poly1d(coeff4)

coeff5 = np.polyfit(x,y,5) # fit second-order polynomial
pfit5 = np.poly1d(coeff5)

coeff6 = np.polyfit(x,y,6) # fit second-order polynomial
pfit6 = np.poly1d(coeff6)

coeff10 = np.polyfit(x,y,10) # fit second-order polynomial
pfit10 = np.poly1d(coeff10)
In [23]:
plt.plot(x,y,'o')
plt.plot(x, pfit1(x), "r--")
plt.plot(x, pfit2(x), "g--")
plt.plot(x, pfit3(x), "y--")
plt.plot(x, pfit4(x), "b--")
plt.plot(x, pfit5(x), "m--")
plt.plot(x, pfit6(x), "r--")
plt.plot(x, pfit10(x), "g--")


#plt.plot(xfit,yfit1,'g-',label='linear')
#plt.plot(xfit,yfit2,'r-',label='quadratic')
#plt.legend()
plt.title("Olympic gold medalist vs age")
plt.xlabel("Age")
plt.ylabel("Gold medalist Count")
plt.show()
No description has been provided for this image
In [22]:
print(f"10th-order fit coefficients: {coeff10}")
10th-order fit coefficients: [-2.30147944e-10  9.07035869e-08 -1.56937561e-05  1.56559886e-03
 -9.94100686e-02  4.18290898e+00 -1.17642875e+02  2.17458732e+03
 -2.51913162e+04  1.64841363e+05 -4.62779587e+05]

Result of assignment¶

I could make polynomial that fit to the data with 10th order polynomial. I used polyfit function for getting the coefficient.

In [ ]: