Week 2 AssignmentΒΆ
In data science, fitting means using given data to find a model that explains how different values are connected. The goal is to draw a line, curve, or formula that matches the data as closely as possible. This model helps us understand patterns, make predictions, and see how one variable changes when another changes. In simple words, fitting is about finding the best rule that describes the data accurately.
What is a βFitβ?ΒΆ
What is a βFitβ? In data science, a fit means adjusting a function so it matches the given data. When we fit data, we create a model that explains what we observe and shows the relationship between values.
Why Do We Do Fitting?ΒΆ
Fitting helps us understand data better and gain useful insights. It is used to predict results (regression), estimate missing values within the data (interpolation), predict future or unknown values beyond the data (extrapolation), and build models that can make predictions and support decision-making.
Important Data Science TerminologiesΒΆ
- Regression is a method in data science that finds a relationship between a number (output) and one or more factors (inputs). -Regression is the process of finding the above equation using data. discovering a formula that connects βhoursβ (input) with βscoreβ (output).
- Function is a rule or formula that takes an input and produces an output. -The above equation becomes a function, which is simply a formula that takes an input and gives an output.
- Model is a function that has been learned from data. It is the tool that makes predictions. -A model is basically a function that has been trained and tested using data.
- Coefficient tells how strongly an input affects the output. It is the number in front of a variable. -In the above model, the coefficient is 10. Indicating how strongly hours affects the score. A prediction is the value your model thinks will happen. modelβs guess for a given input.
- intercept is the value your model predicts when all inputs are zero. The model will predict that if hour is Zero the the student will score 30.
- Error is the difference between: what really happened? what the model predicted?
- Residual is just another word for error, used in data science. Residual = Actual value β Predicted value It tells: -how wrong the prediction was -whether the model guessed too high or too low
import pandas as pd
df = pd.read_excel("datasets/Large Numbers.xlsx")
print("--- First 5 Rows ---")
--- First 5 Rows ---
print(df.head())
Name Accuracy Time (total) Score (780) Score (%) \ 0 Abhishek Subba 0.691000 6951 748 0.958974 1 Abishek Adhikari 0.637108 4985 785 1.006410 2 Anjana Subba 0.820000 5311 846 1.084615 3 Arpan Rai 0.828077 5547 790 1.012821 4 Arpana Ghimirey 0.783438 4773 509 0.652564 Exercises started Trophies Easy Moderate Hard Last submission date 0 29 Gold 4 0 0 2025-10-22T14:53:12 1 30 Diamond 4 0 0 2025-08-18T11:21:05 2 33 Diamond 2 2 0 2025-09-10T13:22:29 3 29 Diamond 4 0 0 2025-08-09T18:04:17 4 21 Bronze 1 0 0 2025-10-22T12:40:02
distinct_count = df.drop_duplicates().shape[0]
print(distinct_count)
28
data = [
["69.10%", "Gold"],
["63.71%", "Diamond"],
["82.00%", "Diamond"],
["82.81%", "Diamond"],
["78.34%", "Bronze"],
["70.38%", "Bronze"],
["73.07%", "Diamond"],
["77.57%", "Diamond"],
["75.23%", "Diamond"],
["75.83%", "Diamond"],
["75.55%", "Diamond"],
["79.70%", "Diamond"],
["62.28%", "Bronze"],
["65.09%", "Diamond"],
["83.23%", "Bronze"],
["71.27%", "Bronze"],
["67.59%", "Diamond"],
["56.79%", "Diamond"],
["77.49%", "Diamond"],
["72.74%", "Diamond"],
["76.60%", "Diamond"],
["73.73%", "Diamond"],
["81.97%", "Diamond"],
["85.39%", "Bronze"],
["80.88%", "Diamond"],
["85.73%", "Bronze"],
["80.36%", "Diamond"],
]
# Parse accuracy & trophies
acc = [float(r[0].rstrip('%')) for r in data]
tro = [{"Bronze":1, "Gold":2, "Diamond":3}[r[1]] for r in data]
print("Max Acc:", max(acc))
print("Min Acc:", min(acc))
print("Max Trophy:", max(tro), "(Diamond)")
print("Min Trophy:", min(tro), "(Bronze)")
Max Acc: 85.73 Min Acc: 56.79 Max Trophy: 3 (Diamond) Min Trophy: 1 (Bronze)
import matplotlib.pyplot as plt
# Data: [accuracy_float, trophy_rank]
data = [
[69.10, 2], # Gold = 2
[63.71, 3], # Diamond = 3
[82.00, 3],
[82.81, 3],
[78.34, 1], # Bronze = 1
[70.38, 1],
[73.07, 3],
[77.57, 3],
[75.23, 3],
[75.83, 3],
[75.55, 3],
[79.70, 3],
[62.28, 1],
[65.09, 3],
[83.23, 1],
[71.27, 1],
[67.59, 3],
[56.79, 3],
[77.49, 3],
[72.74, 3],
[76.60, 3],
[73.73, 3],
[81.97, 3],
[85.39, 1],
[80.88, 3],
[85.73, 1],
[80.36, 3],
]
acc, tro = zip(*data)
# Map trophy numbers back to labels for y-tick
trophy_labels = {1: "Bronze", 2: "Gold", 3: "Diamond"}
plt.figure(figsize=(8, 4))
plt.scatter(acc, tro, c=tro, cmap='viridis', s=60, alpha=0.8)
plt.yticks([1, 2, 3], ["Bronze", "Gold", "Diamond"])
plt.xlabel("Accuracy (%)")
plt.ylabel("Trophy Level")
plt.title("Accuracy vs Trophy Level")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# Raw data from PDF
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
# Create DataFrame
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# Clean numeric columns
df["Accuracy"] = df["Accuracy"].str.rstrip('%').astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip('%').astype(float)
df["TrophyValue"] = df["Trophies"].map({"Bronze": 1, "Gold": 2, "Diamond": 3})
# Choose relevant numeric columns (like your example)
cols = [
"Accuracy",
"Time (total)",
"Score (%)",
"Exercises started",
"TrophyValue",
"Score (780)"
]
# Scatter matrix
pd.plotting.scatter_matrix(df[cols], figsize=(12, 12), alpha=0.7, diagonal="hist")
plt.suptitle("Scatter Matrix: Performance Metrics", y=1.02)
plt.tight_layout()
plt.show()
Probability Distribution Analysis of Key Student Data FeaturesΒΆ
This notebook examines how the main numerical values in the student dataset are spread. Understanding these distributions helps us see patterns in the data and choose the right statistical techniques and machine learning models.
# Define key numerical features from the PDF data
KEY_NUMERIC_FEATURES = [
"Accuracy", # Converted from "69.10%" β 69.10 (float)
"Time (total)", # Total time spent (seconds)
"Score (%)", # Final score percentage
"Exercises started",
"Score (780)", # Raw score out of 780
"TrophyValue" # Mapped: Bronze=1, Gold=2, Diamond=3
]
# Build DataFrame from PDF
import pandas as pd
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# Clean & convert
df["Accuracy"] = df["Accuracy"].str.rstrip('%').astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip('%').astype(float)
df["TrophyValue"] = df["Trophies"].map({"Bronze": 1, "Gold": 2, "Diamond": 3})
# Select & clean
df_clean = df[KEY_NUMERIC_FEATURES].dropna()
print(f"Data shape for analysis: {df_clean.shape}")
print("Features to analyze:", KEY_NUMERIC_FEATURES)
Data shape for analysis: (27, 6) Features to analyze: ['Accuracy', 'Time (total)', 'Score (%)', 'Exercises started', 'Score (780)', 'TrophyValue']
1. Visual Inspection: Histograms and Q-Q PlotsΒΆ
I will use histograms to see the shape of the data distribution and Quantile-Quantile (Q-Q) plots to check how closely my data follows a normal distribution. If my data is normally distributed, the points on the Q-Q plot should fall near the straight line.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
# Load and preprocess data from PDF
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# Clean & convert numeric columns
df["Accuracy"] = df["Accuracy"].str.rstrip('%').astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip('%').astype(float)
df["TrophyValue"] = df["Trophies"].map({"Bronze": 1, "Gold": 2, "Diamond": 3})
# Select key numeric features (no missing values in this dataset)
KEY_NUMERIC_FEATURES = [
"Accuracy",
"Time (total)",
"Score (%)",
"Exercises started",
"Score (780)",
"TrophyValue"
]
df_clean = df[KEY_NUMERIC_FEATURES].copy() # all rows complete
# Define plotting function
def plot_distribution(data, feature):
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Histogram + KDE
sns.histplot(data[feature], kde=True, ax=axes[0], color='steelblue', bins=12)
axes[0].set_title(f'Histogram of {feature}', fontsize=14)
axes[0].set_xlabel(feature)
axes[0].set_ylabel('Frequency')
# Q-Q Plot
stats.probplot(data[feature], dist="norm", plot=axes[1])
axes[1].set_title(f'Q-Q Plot of {feature}', fontsize=14)
axes[1].get_lines()[0].set_markerfacecolor('orange')
axes[1].get_lines()[0].set_markersize(6)
plt.tight_layout()
plt.show()
# Generate plots for all key features
for feature in KEY_NUMERIC_FEATURES:
plot_distribution(df_clean, feature)
2. Statistical Testing: Shapiro-Wilk Test for NormalityΒΆ
The Shapiro-Wilk test is a statistical method used to check whether data follows a normal distribution.
Null hypothesis (Hβ): The data comes from a normal distribution.
Alternative hypothesis (Hβ): The data does not come from a normal distribution.
If the p-value is smaller than the chosen level (for example, 0.05), we reject the null hypothesis and say the data is not normally distributed.
import pandas as pd
import numpy as np
from scipy.stats import shapiro
import io
# --- 1. Load and clean data from the table in Large Numbers.pdf ---
data_str = """
Name,Accuracy,Time (total),Score (780),Score (%),Exercises started,Trophies,Easy,Moderate,Hard,Last submission date
Abhishek Subba,69.10%,6951,748,95.90%,29,Gold,4,0,0,2025-10-22T14:53:12
Abishek Adhikari,63.71%,4985,785,100.64%,30,Diamond,4,0,0,2025-08-18T11:21:05
Anjana Subba,82.00%,5311,846,108.46%,33,Diamond,2,2,0,2025-09-10T13:22:29
Arpan Rai,82.81%,5547,790,101.28%,29,Diamond,4,0,0,2025-08-09T18:04:17
Arpana Ghimirey,78.34%,4773,509,65.26%,21,Bronze,1,0,0,2025-10-22T12:40:02
Chimi Dolma Gurung,70.38%,4093,468,60.00%,23,Bronze,1,0,0,2025-10-01T12:20:50
Dawa Kelwang Keltshok,73.07%,4601,782,100.26%,31,Diamond,4,0,0,2025-05-20T13:15:02
Jamyang Gurung,77.57%,5469,781,100.13%,30,Diamond,4,0,0,2025-05-15T20:20:30
Jamyang Tenzin Namgyel,75.23%,5180,797,102.18%,30,Diamond,2,3,0,2025-09-03T14:34:27
Jigme Tenzin Wangpo,75.83%,5037,782,100.26%,30,Diamond,2,0,0,2025-10-22T08:31:26
Karma Dema Chokey,75.55%,16432,788,101.03%,30,Diamond,4,0,0,2025-09-25T13:18:29
Kishan Rai,79.70%,4460,800,102.56%,31,Diamond,0,3,0,2025-09-29T12:12:10
Kuenga Rinchen,62.28%,9502,451,57.82%,22,Bronze,1,0,0,2025-08-08T17:23:50
Leki Tshomo,65.09%,15455,782,100.26%,30,Diamond,1,3,0,2025-11-03T20:48:32
Lhakey Choden,83.23%,2665,459,58.85%,20,Bronze,0,2,0,2025-09-09T13:46:40
Melan Rai,71.27%,7520,448,57.44%,21,Bronze,1,1,0,2025-08-28T13:22:57
Mercy Jeshron Subba,67.59%,7630,786,100.77%,31,Diamond,3,0,0,2025-10-15T15:00:19
Najimul Mia,56.79%,10148,788,101.03%,30,Diamond,3,1,1,2025-08-29T19:06:48
Nima Kelwang Keltshok,77.49%,5491,785,100.64%,30,Diamond,4,0,0,2025-05-13T17:56:59
Radha Dulal,72.74%,7431,800,102.56%,31,Diamond,3,1,0,2025-09-10T17:06:07
Rigyel Singer,76.60%,10525,787,100.90%,30,Diamond,0,4,1,2025-10-08T13:28:29
Susil Acharja,73.73%,5372,794,101.79%,31,Diamond,4,0,0,2025-06-08T19:19:10
Tashi Tshokey Wangmo,81.97%,9897,800,102.56%,30,Diamond,4,0,0,2025-08-20T12:29:57
Tashi Wangchuk,85.39%,5708,472,60.51%,22,Bronze,0,3,0,2025-09-08T12:30:39
Tenzin Sonam Dolkar,80.88%,9247,808,103.59%,31,Diamond,1,2,0,2025-09-29T13:38:06
Yeshey Tshoki,85.73%,2958,412,52.82%,19,Bronze,1,0,0,2025-08-06T14:36:48
Yogira Kami,80.36%,7782,783,100.38%,31,Diamond,2,0,0,2025-10-08T13:25:35
"""
# Read into DataFrame
df = pd.read_csv(io.StringIO(data_str.strip()), sep=',')
# Clean percentage columns β float
df['Accuracy'] = df['Accuracy'].str.rstrip('%').astype(float)
df['Score (%)'] = df['Score (%)'].str.rstrip('%').astype(float)
# Select numeric columns for normality testing
numeric_cols = ['Accuracy', 'Time (total)', 'Score (780)', 'Score (%)', 'Exercises started']
# --- 2. Perform Shapiro-Wilk test ---
results = []
alpha = 0.05
for col in numeric_cols:
data = df[col].dropna()
if len(data) < 3:
w, p = np.nan, np.nan
conclusion = 'Insufficient data (<3)'
else:
w, p = shapiro(data)
conclusion = 'Normal' if p > alpha else 'Not Normal'
results.append({
'Variable': col,
'W_statistic': round(w, 6) if not np.isnan(w) else np.nan,
'p_value': round(p, 6) if not np.isnan(p) else np.nan,
'Conclusion (Ξ±=0.05)': conclusion
})
# Create results DataFrame
results_df = pd.DataFrame(results)
# --- 3. Export to CSV ---
output_file = 'shapiro_wilk_results.csv'
results_df.to_csv(output_file, index=False)
print(f"β
Shapiro-Wilk test results saved to '{output_file}'")
# Optional: display results
print("\nShapiro-Wilk Test Results:")
print(results_df.to_string(index=False))
β
Shapiro-Wilk test results saved to 'shapiro_wilk_results.csv'
Shapiro-Wilk Test Results:
Variable W_statistic p_value Conclusion (Ξ±=0.05)
Accuracy 0.958028 0.332961 Normal
Time (total) 0.861891 0.001999 Not Normal
Score (780) 0.659634 0.000001 Not Normal
Score (%) 0.659573 0.000001 Not Normal
Exercises started 0.736337 0.000013 Not Normal
Predictive Modeling and Visualization: Time (total) PredictionΒΆ
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
# β
Load FULL data (27 students)
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# Clean % columns
df["Accuracy"] = df["Accuracy"].str.rstrip("%").astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip("%").astype(float)
# Use only rows with valid numeric data (all 27 are valid)
X = df[["Accuracy", "Time (total)", "Exercises started", "Trophies"]]
y = df["Score (%)"]
# β
Preprocessor with handle_unknown="ignore"
preprocessor = ColumnTransformer(
transformers=[
("num", "passthrough", ["Accuracy", "Time (total)", "Exercises started"]),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["Trophies"]),
],
remainder="drop"
)
model = Pipeline([
("prep", preprocessor),
("rf", RandomForestRegressor(n_estimators=50, random_state=42))
])
# β
Use cv=min(5, n_samples-1) to avoid error
n_samples = len(df)
cv = min(5, n_samples - 1) # max folds = 26, but 5 is safe for n=27
cv_scores = cross_val_score(model, X, y, cv=cv, scoring="r2")
print(f"β
CV RΒ² scores ({cv}-fold):", [round(s, 4) for s in cv_scores])
print("β
Mean CV RΒ²:", round(np.mean(cv_scores), 4))
β CV RΒ² scores (5-fold): [np.float64(0.9047), np.float64(-0.3168), np.float64(0.972), np.float64(-0.6609), np.float64(0.9702)] β Mean CV RΒ²: 0.3738
# %% [markdown]
# # Predictive Modeling: Score (%) Prediction from Large Numbers Dataset
# %%
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
# %% [markdown]
# ## 1. Load and Inspect Data
# %%
# Raw data from Large Numbers.pdf
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
# Create DataFrame
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# Clean percentage columns
df["Accuracy"] = df["Accuracy"].str.rstrip("%").astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip("%").astype(float)
print("β
Data loaded successfully.")
print("Original Data Shape:", df.shape)
# %% [markdown]
# ## 2. Feature Engineering & Encoding
# %%
# Define features
numeric_features = ["Accuracy", "Time (total)", "Exercises started", "Easy", "Moderate", "Hard"]
categorical_features = ["Trophies"]
# Preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
("num", "passthrough", numeric_features),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features)
],
remainder="drop"
)
# Fit preprocessor to get feature names
X_temp = df[numeric_features + categorical_features]
X_encoded = preprocessor.fit_transform(X_temp)
# Get final feature names
ohe = preprocessor.named_transformers_["cat"]
cat_feature_names = ohe.get_feature_names_out(categorical_features).tolist()
feature_names = numeric_features + cat_feature_names
print("Data Shape after Encoding:", X_encoded.shape)
print("Features used for modeling:", feature_names)
# %% [markdown]
# ## 3. Train-Test Split
# %%
X = df[numeric_features + categorical_features]
y = df["Score (%)"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True
)
print("Training set size:", len(X_train), "samples")
print("Testing set size:", len(X_test), "samples")
# %% [markdown]
# ## 4. Model Training (Random Forest)
# %%
model = Pipeline([
("preprocessor", preprocessor),
("regressor", RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42))
])
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Evaluate
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
test_mae = mean_absolute_error(y_test, y_pred_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
# Cross-validation (safe: min(5, n-1))
cv_folds = min(5, len(X_train) - 1)
cv_scores = cross_val_score(model, X_train, y_train, cv=cv_folds, scoring="r2")
cv_r2_mean = np.mean(cv_scores)
print(f"\nβ
Model Performance:")
print(f" Train RΒ²: {train_r2:.4f}")
print(f" Test RΒ²: {test_r2:.4f}")
print(f" Test MAE: {test_mae:.2f} %")
print(f" Test RMSE: {test_rmse:.2f} %")
print(f" CV RΒ² (mean): {cv_r2_mean:.4f} (Β±{np.std(cv_scores):.3f})")
# %% [markdown]
# ## 5. Visualization
# %%
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Actual vs Predicted
axes[0].scatter(y_test, y_pred_test, alpha=0.8, s=60, edgecolor='k')
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label="Ideal")
axes[0].set_xlabel("Actual Score (%)")
axes[0].set_ylabel("Predicted Score (%)")
axes[0].set_title(f"Test Set: RΒ² = {test_r2:.3f}")
axes[0].legend()
axes[0].grid(True)
# Residuals
residuals = y_test - y_pred_test
axes[1].scatter(y_pred_test, residuals, alpha=0.8, s=60, edgecolor='k')
axes[1].axhline(0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel("Predicted Score (%)")
axes[1].set_ylabel("Residuals")
axes[1].set_title("Residual Plot")
axes[1].grid(True)
plt.tight_layout()
plt.show()
# %% [markdown]
# ## 6. Export Results
# %%
# Predictions on full dataset
y_pred_full = model.predict(X)
results_df = df[["Name", "Score (%)"]].copy()
results_df["Predicted_Score_%"] = y_pred_full
results_df["Error (%)"] = results_df["Predicted_Score_%"] - results_df["Score (%)"]
results_df["Abs_Error (%)"] = np.abs(results_df["Error (%)"])
results_df.to_csv("score_prediction_results.csv", index=False)
print("β
Results saved to 'score_prediction_results.csv'")
# Display top 5 worst predictions
print("\nπ Largest Errors:")
print(results_df[["Name", "Score (%)", "Predicted_Score_%", "Error (%)"]].abs().sort_values("Error (%)", ascending=False).head())
β Data loaded successfully. Original Data Shape: (27, 11) Data Shape after Encoding: (27, 9) Features used for modeling: ['Accuracy', 'Time (total)', 'Exercises started', 'Easy', 'Moderate', 'Hard', 'Trophies_Bronze', 'Trophies_Diamond', 'Trophies_Gold'] Training set size: 21 samples Testing set size: 6 samples β Model Performance: Train RΒ²: 0.9960 Test RΒ²: 0.1688 Test MAE: 1.35 % Test RMSE: 2.04 % CV RΒ² (mean): -1.4825 (Β±4.896)
β Results saved to 'score_prediction_results.csv' π Largest Errors:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[14], line 185 183 # Display top 5 worst predictions 184 print("\nπ Largest Errors:") --> 185 print(results_df[["Name", "Score (%)", "Predicted_Score_%", "Error (%)"]].abs().sort_values("Error (%)", ascending=False).head()) File /opt/conda/lib/python3.13/site-packages/pandas/core/generic.py:1722, in NDFrame.abs(self) 1654 @final 1655 def abs(self) -> Self: 1656 """ 1657 Return a Series/DataFrame with absolute numeric value of each element. 1658 (...) 1720 3 7 40 -50 1721 """ -> 1722 res_mgr = self._mgr.apply(np.abs) 1723 return self._constructor_from_mgr(res_mgr, axes=res_mgr.axes).__finalize__( 1724 self, name="abs" 1725 ) File /opt/conda/lib/python3.13/site-packages/pandas/core/internals/managers.py:361, in BaseBlockManager.apply(self, f, align_keys, **kwargs) 358 kwargs[k] = obj[b.mgr_locs.indexer] 360 if callable(f): --> 361 applied = b.apply(f, **kwargs) 362 else: 363 applied = getattr(b, f)(**kwargs) File /opt/conda/lib/python3.13/site-packages/pandas/core/internals/blocks.py:395, in Block.apply(self, func, **kwargs) 389 @final 390 def apply(self, func, **kwargs) -> list[Block]: 391 """ 392 apply the function to my values; return a block if we are not 393 one 394 """ --> 395 result = func(self.values, **kwargs) 397 result = maybe_coerce_values(result) 398 return self._split_op_result(result) TypeError: bad operand type for abs(): 'str'
# Display top 5 worst predictions (by absolute error)
print("\nπ Largest Errors:")
results_display = results_df[["Name", "Score (%)", "Predicted_Score_%", "Error (%)"]].copy()
results_display["Abs_Error"] = results_display["Error (%)"].abs()
print(
results_display
.sort_values("Abs_Error", ascending=False)
.head()[["Name", "Score (%)", "Predicted_Score_%", "Error (%)"]]
.to_string(index=False)
)
2. Model Training and EvaluationΒΆ
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# β
Load data from Large Numbers.pdf
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# β
Clean percentage columns
df["Accuracy"] = df["Accuracy"].str.rstrip("%").astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip("%").astype(float)
# β
Select features and target
X = df[["Accuracy", "Time (total)", "Exercises started", "Trophies"]]
y = df["Score (%)"]
# β
Preprocessing with safe encoding
preprocessor = ColumnTransformer(
transformers=[
("num", "passthrough", ["Accuracy", "Time (total)", "Exercises started"]),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["Trophies"])
]
)
X_processed = preprocessor.fit_transform(X)
# β
Train-test split (80/20 β 21 train, 6 test)
X_train, X_test, y_train, y_test = train_test_split(
X_processed, y, test_size=0.2, random_state=42
)
# β
Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# β
Make predictions on the test set
y_pred = model.predict(X_test)
# β
Evaluate the model
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"R-squared (R2) Score: {r2:.4f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
# β
Interpretation
print("\nπ Interpretation:")
if r2 > 0.7:
print("β Strong fit: the model explains a high proportion of variance in Score (%).")
elif r2 > 0.4:
print("β Moderate fit: the model captures meaningful patterns, but room for improvement.")
else:
print("β Weak fit: linear assumptions may not hold; consider non-linear models or feature engineering.")
print(f"β On average, predicted scores are off by {rmse:.1f} percentage points.")
R-squared (R2) Score: -3.0405 Mean Squared Error (MSE): 20.21 Root Mean Squared Error (RMSE): 4.50 π Interpretation: β Weak fit: linear assumptions may not hold; consider non-linear models or feature engineering. β On average, predicted scores are off by 4.5 percentage points.
3. Visualization: Actual vs. Predicted ScoreΒΆ
To see how well the model works, we plot the actual scores % against the predicted scores % using a scatter plot. If the model is perfect, all the points will lie on the diagonal line y = x (shown as the red line). The closer the points are to this line, the more accurate the model is.
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg') # Safe for all environments (no GUI needed)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
# β
Parse data directly from the table in Large Numbers.pdf
data = [
["Abhishek Subba", "69.10%", 6951, 748, "95.90%", 29, "Gold", 4, 0, 0, "2025-10-22T14:53:12"],
["Abishek Adhikari", "63.71%", 4985, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-08-18T11:21:05"],
["Anjana Subba", "82.00%", 5311, 846, "108.46%", 33, "Diamond", 2, 2, 0, "2025-09-10T13:22:29"],
["Arpan Rai", "82.81%", 5547, 790, "101.28%", 29, "Diamond", 4, 0, 0, "2025-08-09T18:04:17"],
["Arpana Ghimirey", "78.34%", 4773, 509, "65.26%", 21, "Bronze", 1, 0, 0, "2025-10-22T12:40:02"],
["Chimi Dolma Gurung", "70.38%", 4093, 468, "60.00%", 23, "Bronze", 1, 0, 0, "2025-10-01T12:20:50"],
["Dawa Kelwang Keltshok", "73.07%", 4601, 782, "100.26%", 31, "Diamond", 4, 0, 0, "2025-05-20T13:15:02"],
["Jamyang Gurung", "77.57%", 5469, 781, "100.13%", 30, "Diamond", 4, 0, 0, "2025-05-15T20:20:30"],
["Jamyang Tenzin Namgyel", "75.23%", 5180, 797, "102.18%", 30, "Diamond", 2, 3, 0, "2025-09-03T14:34:27"],
["Jigme Tenzin Wangpo", "75.83%", 5037, 782, "100.26%", 30, "Diamond", 2, 0, 0, "2025-10-22T08:31:26"],
["Karma Dema Chokey", "75.55%", 16432, 788, "101.03%", 30, "Diamond", 4, 0, 0, "2025-09-25T13:18:29"],
["Kishan Rai", "79.70%", 4460, 800, "102.56%", 31, "Diamond", 0, 3, 0, "2025-09-29T12:12:10"],
["Kuenga Rinchen", "62.28%", 9502, 451, "57.82%", 22, "Bronze", 1, 0, 0, "2025-08-08T17:23:50"],
["Leki Tshomo", "65.09%", 15455, 782, "100.26%", 30, "Diamond", 1, 3, 0, "2025-11-03T20:48:32"],
["Lhakey Choden", "83.23%", 2665, 459, "58.85%", 20, "Bronze", 0, 2, 0, "2025-09-09T13:46:40"],
["Melan Rai", "71.27%", 7520, 448, "57.44%", 21, "Bronze", 1, 1, 0, "2025-08-28T13:22:57"],
["Mercy Jeshron Subba", "67.59%", 7630, 786, "100.77%", 31, "Diamond", 3, 0, 0, "2025-10-15T15:00:19"],
["Najimul Mia", "56.79%", 10148, 788, "101.03%", 30, "Diamond", 3, 1, 1, "2025-08-29T19:06:48"],
["Nima Kelwang Keltshok", "77.49%", 5491, 785, "100.64%", 30, "Diamond", 4, 0, 0, "2025-05-13T17:56:59"],
["Radha Dulal", "72.74%", 7431, 800, "102.56%", 31, "Diamond", 3, 1, 0, "2025-09-10T17:06:07"],
["Rigyel Singer", "76.60%", 10525, 787, "100.90%", 30, "Diamond", 0, 4, 1, "2025-10-08T13:28:29"],
["Susil Acharja", "73.73%", 5372, 794, "101.79%", 31, "Diamond", 4, 0, 0, "2025-06-08T19:19:10"],
["Tashi Tshokey Wangmo", "81.97%", 9897, 800, "102.56%", 30, "Diamond", 4, 0, 0, "2025-08-20T12:29:57"],
["Tashi Wangchuk", "85.39%", 5708, 472, "60.51%", 22, "Bronze", 0, 3, 0, "2025-09-08T12:30:39"],
["Tenzin Sonam Dolkar", "80.88%", 9247, 808, "103.59%", 31, "Diamond", 1, 2, 0, "2025-09-29T13:38:06"],
["Yeshey Tshoki", "85.73%", 2958, 412, "52.82%", 19, "Bronze", 1, 0, 0, "2025-08-06T14:36:48"],
["Yogira Kami", "80.36%", 7782, 783, "100.38%", 31, "Diamond", 2, 0, 0, "2025-10-08T13:25:35"],
]
# Build DataFrame
df = pd.DataFrame(data, columns=[
"Name", "Accuracy", "Time (total)", "Score (780)", "Score (%)",
"Exercises started", "Trophies", "Easy", "Moderate", "Hard", "Last submission date"
])
# Clean percentage columns
df["Accuracy"] = df["Accuracy"].str.rstrip('%').astype(float)
df["Score (%)"] = df["Score (%)"].str.rstrip('%').astype(float)
# Define features & target
X = df[["Accuracy", "Time (total)", "Exercises started", "Trophies"]]
y = df["Score (%)"]
# Preprocessing (safe for rare categories like 'Gold')
preprocessor = ColumnTransformer(
transformers=[
("num", "passthrough", ["Accuracy", "Time (total)", "Exercises started"]),
("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["Trophies"])
]
)
X_enc = preprocessor.fit_transform(X)
# Train-test split (6 test points β clear visualization)
X_train, X_test, y_train, y_test = train_test_split(
X_enc, y, test_size=6, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# π Plot: Actual vs. Predicted Score (%)
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7, color='steelblue', s=80, edgecolor='k', label='Predicted')
# Perfect prediction line: y = x
min_val = min(y_test.min(), y_pred.min()) - 1
max_val = max(y_test.max(), y_pred.max()) + 1
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Fit (y = x)')
plt.title('Actual vs. Predicted Score (%)', fontsize=14)
plt.xlabel('Actual Score (%)', fontsize=12)
plt.ylabel('Predicted Score (%)', fontsize=12)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
# Save plot
plt.savefig("actual_vs_predicted_score.png", dpi=150, bbox_inches='tight')
print("β
Plot saved as 'actual_vs_predicted_score.png'")
β Plot saved as 'actual_vs_predicted_score.png'