[Sonam Dendup] - Fab Futures - Data Science
Home About

< Home

Week 1: Playground¶

This is an example file to introduce you to Juypter Labs and show you how you can organise and document your work. Feel free to edit this page as you please. The topmost cell is a small navigation to go back home and optionally you could link the following week here (ie week 2), when you start working on it to help visitors.

Python code¶

In [77]:
print("Hello world All greeting from Bhutan")
Hello world All greeting from Bhutan

Week 01 ( Introduction and Tools)¶

Import dataset 01: Kaggel¶

In [117]:
# Import relevant libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
In [118]:
 data = pd.read_csv ('~/work/sonam-dendup/datasets/StudentsPerformance.csv') # Dataset import from Kaggel Database
In [119]:
data.head(3) 
Out[119]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
In [120]:
data.columns
Out[120]:
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')
In [121]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
In [122]:
data.describe()
Out[122]:
math score reading score writing score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000
In [123]:
data['gender'].nunique()
Out[123]:
2
In [124]:
data['parental level of education'].nunique() # check the unique data
Out[124]:
6
In [125]:
data['parental level of education'].value_counts()
Out[125]:
parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64
In [126]:
data['lunch'].nunique()
Out[126]:
2
In [127]:
data['gender'].value_counts()
Out[127]:
gender
female    518
male      482
Name: count, dtype: int64
In [133]:
data.isnull().sum()
Out[133]:
gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

Data Visualization with Matplotlib # =>pyplot API¶

In [90]:
x_first_50= data['math score'].iloc[0:50]

# Line Plot using the sliced data
plt.plot(x_first_50, color="black", marker='o', linestyle="-", linewidth="1.5")
plt.xlabel("Student Index (First 50)")
plt.ylabel("Math score")
plt.title("Student Performance Math (First 50 Students)")
#plt.savefig("fig3.png", dpi=300, bbox_inches='tight', transparent=True)
plt.show()
No description has been provided for this image
In [91]:
# BOX Plot
plt.boxplot(data['reading score'])
plt.show()
No description has been provided for this image
In [92]:
data['reading score'].shape
Out[92]:
(1000,)
In [93]:
# Univariate =>catogerical data

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
In [94]:
# Pie Chart
count = data['parental level of education'].value_counts()
count
Out[94]:
parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64
In [95]:
plt.pie(count, labels = count.index, autopct = "%1.0f", explode =[0,0.05,0.05,0,0.05,0])
plt.axis('equal')
plt.title("Parental level of education")
#plt.savefig("fig2.png", dpi=300, bbox_inches='tight', transparent=True)
plt.show()
No description has been provided for this image
In [48]:
# Count Plot

gen = data['gender'].value_counts()       
gen
Out[48]:
gender
female    518
male      482
Name: count, dtype: int64
In [49]:
#Barplot
plt.bar(gen.index, gen, color = ['gray', 'black'], )
plt.title("Gender  count")
plt.xlabel(" Gender")
plt.ylabel("Count")
plt.show()
No description has been provided for this image
In [64]:
sort_read = data.sort_values("reading score")
In [65]:
data.head(2)
Out[65]:
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
In [66]:
# Bivariate => numerical + categorical
mal_mat =data[data["gender"] == "male"]["math score"]
mal_mat
Out[66]:
3      47
4      76
7      40
8      64
10     58
       ..
985    57
987    81
990    86
994    63
996    62
Name: math score, Length: 482, dtype: int64
In [67]:
fem_mat =data[data["gender"] == "female"]["math score"]
fem_mat
Out[67]:
0      72
1      69
2      90
5      71
6      88
       ..
993    62
995    88
997    59
998    68
999    77
Name: math score, Length: 518, dtype: int64
In [68]:
# Box Plot
plt.boxplot([mal_mat, fem_mat], labels = ["Male" , "Female"])
plt.title("BOX PLOT")
plt.show()
/tmp/ipykernel_8900/1166900140.py:2: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  plt.boxplot([mal_mat, fem_mat], labels = ["Male" , "Female"])
No description has been provided for this image
In [128]:
color_map = {"male": "skyblue", "female": "gray"}

for gender, color in color_map.items():
    df_gender = data[data["gender"] == gender]
    
    # Plot only the filtered data (df_gender) using the corresponding color and label
    plt.scatter(df_gender["reading score"], df_gender["writing score"], c=color, label=gender) 
    
plt.legend()
plt.xlabel("Reading score")
plt.ylabel("Writing score")
plt.title("Scatter Plot ")
#plt.savefig("fig2.png", dpi=300, bbox_inches='tight', transparent=True)
plt.show() 
No description has been provided for this image

Object Oriented API ( Application Programming Interface)¶

In [162]:
fig, axs = plt.subplots(1,3, figsize = (15,5))

axs[0].plot(data["reading score"].iloc[0:50], color = 'black', marker = 'o',linestyle = '-',markersize = 2, linewidth = '2')
axs[0].grid()
axs[0].set_title('LINE PLOT')
axs[0].set_xlabel('Reading Score')
axs[0].set_xlabel('Index')

axs[1].hist(data["math score"], bins = 10, color = 'skyblue' ,edgecolor = 'black', linewidth = '2') 
axs[1].set_title('HISTROGRAM')
axs[1].set_xlabel('Math Score')
axs[1].set_ylabel('Frequency') 
                 
axs[2].boxplot(data["math score"])
axs[2].set_title('BOX PLOT')
axs[2].set_xlabel('Math Score')
# plt.savefig("fig.png") 
plt.show() 
No description has been provided for this image
In [134]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
file_path = "Exam_Score_Prediction.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "kundanbedmutha/exam-score-prediction-dataset",
  file_path,
)
/tmp/ipykernel_8900/1666180015.py:6: DeprecationWarning: Use dataset_load() instead of load_dataset(). load_dataset() will be removed in a future version.
  df = kagglehub.load_dataset(
Downloading from https://www.kaggle.com/api/v1/datasets/download/kundanbedmutha/exam-score-prediction-dataset?dataset_version_number=2&file_name=Exam_Score_Prediction.csv...
100%|██████████| 1.37M/1.37M [00:00<00:00, 3.07MB/s]
In [167]:
import seaborn as sns  
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6)) 
sns.scatterplot(data=df.loc[:100], x='study_hours', y='exam_score', hue='gender')
plt.title('Scatter Plot of X vs Y by Gender')
plt.show()
No description has been provided for this image
In [ ]:
 
In [194]:
import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(7, 5))
sns.regplot(x="study_hours", y="exam_score", data=df.loc[:100])

plt.title("Linear Best Fit Line (Default)")
plt.show()
No description has been provided for this image
In [207]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(7, 5))
sns.regplot(
    x="study_hours",
    y="exam_score",
    data=df.loc[:100],
    order=2,     
)
plt.title("Non-Linear Best Fit Line (Polynomial Order 2)")
plt.show()
No description has been provided for this image