< Home
Practrice¶
In this class, Neil showed us the example of plotting with fitting.
import numpy as np
import matplotlib.pyplot as plt
xmin = 0
xmax = 2
noise = 0.05
npts = 100
a = 0.5
b = 1
c = -.3
np.random.seed(0)
x = xmin+(xmax-xmin)*np.random.rand(npts) # generate random x
y = a+b*x+c*x*x+np.random.normal(0,noise,npts) # evaluate polynomial at x and add noise
plt.plot(x,y,'o')
[<matplotlib.lines.Line2D at 0xe3d0af3860d0>]
coeff1 = np.polyfit(x,y,1) # fit first-order polynomial
coeff2 = np.polyfit(x,y,2) # fit second-order polynomial
xfit = np.arange(xmin,xmax,(xmax-xmin)/npts)
pfit1 = np.poly1d(coeff1)
yfit1 = pfit1(xfit) # evaluate first-order fit
print(f"first-order fit coefficients: {coeff1}")
pfit2 = np.poly1d(coeff2)
yfit2 = pfit2(xfit) # evaluate second-order fit
print(f"second-order fit coefficients: {coeff2}")
plt.plot(x,y,'o')
plt.plot(xfit,yfit1,'g-',label='linear')
plt.plot(xfit,yfit2,'r-',label='quadratic')
plt.legend()
plt.show()
first-order fit coefficients: [0.41918275 0.69084816] second-order fit coefficients: [-0.3225953 1.04205042 0.49756991]
Assignment¶
In this week, the assignment is "Fit a function to your data". So I tried to Fit a function for my picked up data.
1. Import the data¶
In this week, I picked up a data from kaggle that relate to the olympic.
This data have following columns.
- ID - Unique number for each athlete;
- Name - Athlete's name;
- Sex - M or F;
- Age - Integer;
- Height - In centimeters;
- Weight - In kilograms;
- Team - Team name;
- NOC - National Olympic Committee 3-letter code;
- Games - Year and season;
- Year - Integer;
- Season - Summer or Winter;
- City - Host city;
- Sport - Sport;
- Event - Event;
- Medal - Gold, Silver, Bronze, or NA.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
print(os.listdir("datasets"))
['olympic_athlete_events.csv', 'factory_sensor_simulator_2040.csv', 'olympic_noc_regions.csv', '.gitignore']
#import datasets
data = pd.read_csv('datasets/olympic_athlete_events.csv')
regions = pd.read_csv('datasets/olympic_noc_regions.csv')
print(data)
ID Name Sex Age Height Weight \
0 1 A Dijiang M 24.0 180.0 80.0
1 2 A Lamusi M 23.0 170.0 60.0
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0
... ... ... .. ... ... ...
271111 135569 Andrzej ya M 29.0 179.0 89.0
271112 135570 Piotr ya M 27.0 176.0 59.0
271113 135570 Piotr ya M 27.0 176.0 59.0
271114 135571 Tomasz Ireneusz ya M 30.0 185.0 96.0
271115 135571 Tomasz Ireneusz ya M 34.0 185.0 96.0
Team NOC Games Year Season City \
0 China CHN 1992 Summer 1992 Summer Barcelona
1 China CHN 2012 Summer 2012 Summer London
2 Denmark DEN 1920 Summer 1920 Summer Antwerpen
3 Denmark/Sweden DEN 1900 Summer 1900 Summer Paris
4 Netherlands NED 1988 Winter 1988 Winter Calgary
... ... ... ... ... ... ...
271111 Poland-1 POL 1976 Winter 1976 Winter Innsbruck
271112 Poland POL 2014 Winter 2014 Winter Sochi
271113 Poland POL 2014 Winter 2014 Winter Sochi
271114 Poland POL 1998 Winter 1998 Winter Nagano
271115 Poland POL 2002 Winter 2002 Winter Salt Lake City
Sport Event Medal
0 Basketball Basketball Men's Basketball NaN
1 Judo Judo Men's Extra-Lightweight NaN
2 Football Football Men's Football NaN
3 Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 Speed Skating Speed Skating Women's 500 metres NaN
... ... ... ...
271111 Luge Luge Mixed (Men)'s Doubles NaN
271112 Ski Jumping Ski Jumping Men's Large Hill, Individual NaN
271113 Ski Jumping Ski Jumping Men's Large Hill, Team NaN
271114 Bobsleigh Bobsleigh Men's Four NaN
271115 Bobsleigh Bobsleigh Men's Four NaN
[271116 rows x 15 columns]
print(regions)
NOC region notes 0 AFG Afghanistan NaN 1 AHO Curacao Netherlands Antilles 2 ALB Albania NaN 3 ALG Algeria NaN 4 AND Andorra NaN .. ... ... ... 225 YEM Yemen NaN 226 YMD Yemen South Yemen 227 YUG Serbia Yugoslavia 228 ZAM Zambia NaN 229 ZIM Zimbabwe NaN [230 rows x 3 columns]
data.head(5)
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
data.describe()
| ID | Age | Height | Weight | Year | |
|---|---|---|---|---|---|
| count | 271116.000000 | 261642.000000 | 210945.000000 | 208241.000000 | 271116.000000 |
| mean | 68248.954396 | 25.556898 | 175.338970 | 70.702393 | 1978.378480 |
| std | 39022.286345 | 6.393561 | 10.518462 | 14.348020 | 29.877632 |
| min | 1.000000 | 10.000000 | 127.000000 | 25.000000 | 1896.000000 |
| 25% | 34643.000000 | 21.000000 | 168.000000 | 60.000000 | 1960.000000 |
| 50% | 68205.000000 | 24.000000 | 175.000000 | 70.000000 | 1988.000000 |
| 75% | 102097.250000 | 28.000000 | 183.000000 | 79.000000 | 2002.000000 |
| max | 135571.000000 | 97.000000 | 226.000000 | 214.000000 | 2016.000000 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 271116 non-null int64 1 Name 271116 non-null object 2 Sex 271116 non-null object 3 Age 261642 non-null float64 4 Height 210945 non-null float64 5 Weight 208241 non-null float64 6 Team 271116 non-null object 7 NOC 271116 non-null object 8 Games 271116 non-null object 9 Year 271116 non-null int64 10 Season 271116 non-null object 11 City 271116 non-null object 12 Sport 271116 non-null object 13 Event 271116 non-null object 14 Medal 39783 non-null object dtypes: float64(3), int64(2), object(10) memory usage: 31.0+ MB
regions.head(5)
| NOC | region | notes | |
|---|---|---|---|
| 0 | AFG | Afghanistan | NaN |
| 1 | AHO | Curacao | Netherlands Antilles |
| 2 | ALB | Albania | NaN |
| 3 | ALG | Algeria | NaN |
| 4 | AND | Andorra | NaN |
merged = pd.merge(data, regions, on='NOC', how='left')
merged = pd.merge(data, regions, on='NOC', on='NOC')
merge function will merge two datasets into onte.
"on='NOC'": common collumn that use for key
"how='left'": keep data on left
This parameters mean connect two different datasets using NOC(National code).
merged.head(5)
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN | Netherlands | NaN |
goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN |
| 42 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Team All-Around | Gold | Finland | NaN |
| 44 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Horse Vault | Gold | Finland | NaN |
| 48 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Pommelled Horse | Gold | Finland | NaN |
| 60 | 20 | Kjetil Andr Aamodt | M | 20.0 | 176.0 | 85.0 | Norway | NOR | 1992 Winter | 1992 | Winter | Albertville | Alpine Skiing | Alpine Skiing Men's Super G | Gold | Norway | NaN |
goldMedals = goldMedals.sort_values(by="Age", ascending=False)
goldMedals.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 233390 | 117046 | Oscar Gomer Swahn | M | 64.0 | NaN | NaN | Sweden | SWE | 1912 Summer | 1912 | Summer | Stockholm | Shooting | Shooting Men's Running Target, Single Shot, Team | Gold | Sweden | NaN |
| 105199 | 53238 | Charles Jacobus | M | 64.0 | NaN | NaN | United States | USA | 1904 Summer | 1904 | Summer | St. Louis | Roque | Roque Men's Singles | Gold | USA | NaN |
| 226374 | 113773 | Galen Carter "G. C." Spencer | M | 63.0 | 165.0 | NaN | Potomac Archers | USA | 1904 Summer | 1904 | Summer | St. Louis | Archery | Archery Men's Team Round | Gold | USA | NaN |
| 190952 | 95906 | Lida Peyton "Eliza" Pollock (McMillen-) | F | 63.0 | NaN | NaN | Cincinnati Archers | USA | 1904 Summer | 1904 | Summer | St. Louis | Archery | Archery Women's Team Round | Gold | USA | NaN |
| 261102 | 130662 | Robert W. Williams, Jr. | M | 63.0 | NaN | NaN | Potomac Archers | USA | 1904 Summer | 1904 | Summer | St. Louis | Archery | Archery Men's Team Round | Gold | USA | NaN |
goldMedals.isnull().any()
ID False Name False Sex False Age True Height True Weight True Team False NOC False Games False Year False Season False City False Sport False Event False Medal False region True notes True dtype: bool
plt.figure(figsize=(50, 10))
clean_age = goldMedals['Age'].dropna().astype(int)
<Figure size 5000x1000 with 0 Axes>
↑I faced to the issue because I didn't put "dropna()".
This is a function that eliminate the NaN data from the dataset.
Other wise I couldn't make count chart.
sns.countplot(x=clean_age)
<Axes: xlabel='Age', ylabel='count'>
age_counts = clean_age.value_counts().sort_index()
x = age_counts.index
y = age_counts.values
coeff1 = np.polyfit(x,y,1) # fit first-order polynomial
pfit1 = np.poly1d(coeff1)
coeff2 = np.polyfit(x,y,2) # fit second-order polynomial
pfit2 = np.poly1d(coeff2)
coeff3 = np.polyfit(x,y,3) # fit second-order polynomial
pfit3 = np.poly1d(coeff3)
coeff4 = np.polyfit(x,y,4) # fit second-order polynomial
pfit4 = np.poly1d(coeff4)
coeff5 = np.polyfit(x,y,5) # fit second-order polynomial
pfit5 = np.poly1d(coeff5)
coeff6 = np.polyfit(x,y,6) # fit second-order polynomial
pfit6 = np.poly1d(coeff6)
coeff10 = np.polyfit(x,y,10) # fit second-order polynomial
pfit10 = np.poly1d(coeff10)
plt.plot(x,y,'o')
plt.plot(x, pfit1(x), "r--")
plt.plot(x, pfit2(x), "g--")
plt.plot(x, pfit3(x), "y--")
plt.plot(x, pfit4(x), "b--")
plt.plot(x, pfit5(x), "m--")
plt.plot(x, pfit6(x), "r--")
plt.plot(x, pfit10(x), "g--")
#plt.plot(xfit,yfit1,'g-',label='linear')
#plt.plot(xfit,yfit2,'r-',label='quadratic')
#plt.legend()
plt.title("Olympic gold medalist vs age")
plt.xlabel("Age")
plt.ylabel("Gold medalist Count")
plt.show()
print(f"10th-order fit coefficients: {coeff10}")
10th-order fit coefficients: [-2.30147944e-10 9.07035869e-08 -1.56937561e-05 1.56559886e-03 -9.94100686e-02 4.18290898e+00 -1.17642875e+02 2.17458732e+03 -2.51913162e+04 1.64841363e+05 -4.62779587e+05]
Result of assignment¶
I could make polynomial that fit to the data with 10th order polynomial. I used polyfit function for getting the coefficient.