Kelzang Tobgyel - Fab Futures - Data Science
Home About

Week 2: November 25¶

Session objectives:¶

On the thrid session of Data Science, we were taken through different types of Mathematical functions and varaibles that can be fitted into our data visualization to match data. The session covered the following functions and its application in understading the data.

  1. Types of Variables (scalar, vector and matrix)
  2. Functions (Linear, affine, polynomial, nonlinear, sum and intergrals)
  3. Error estimation. (Model Estimation and Model Mismatch)
  4. Linear least squares

Self Learning (Before doing assignment )¶

I have explored the following informations:

  1. Application of using functions to a dataset.
  2. The fitting process in Machine Learning.

Assignment 3: Fitting a function to the dataset¶

Objective: My objective was to find the corelation between the average salaries and the year for Data Engineers in USA and to extraploate the future salries of Data Engineers¶

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
try:
    df_full = pd.read_csv("datasets/test_data.csv")
except FileNotFoundError:
    print("Error: 'datasets/test_data.csv' not found.")
    exit()

# Filter for Data Engineers and required columns
df_filtered = df_full[['work_year', 'salary_in_usd', 'job_title']]
df_de = df_filtered[df_filtered['job_title'] == 'Data Engineer']

# Calculate average salaries per year
avg_salaries_df = df_de.groupby('work_year')['salary_in_usd'].mean().reset_index()

# Ensure we have the required years (2020-2024)
# If a year is missing, the group by method will exclude it, which is fine for the plot.
year = avg_salaries_df['work_year'].to_numpy()
avg_salaries = avg_salaries_df['salary_in_usd'].to_numpy()

# Fitting a cubic polynomial of third degree
coefficients = np.polyfit(year, avg_salaries, 3)
poly_function = np.poly1d(coefficients)

# Creating a smooth x-range for the fitted curve
x_fit = np.linspace(year.min() - 0.1, year.max() + 0.1, 100)
y_fit = poly_function(x_fit)

# Plotting the results
plt.figure(figsize=(10, 6))

# Plotting the original data points
plt.scatter(year, avg_salaries, label='Average Salary Data Points', color='C0', zorder=5)

# Plot the fitted cubic function
plt.plot(x_fit, y_fit, label='Cubic Fit ($3^{rd}$ Degree)', color='C3', linestyle='--')

# Set titles and labels
plt.title("Data Engineers Average Salary: Cubic Polynomial Fit")
plt.xlabel("Year")
plt.ylabel("Average Salary (USD)")
plt.legend()
plt.grid(True, linestyle=':', alpha=0.6)
plt.gca().ticklabel_format(style='plain', axis='y') # Prevent scientific notation on Y-axis
plt.xlim(year.min() - 0.2, year.max() + 0.2) # Set x-limits to slightly hug the data

plt.savefig('data_engineer_salary_cubic_fit.png')
print("Polynomial coefficients (Cubic function $ax^3 + bx^2 + cx + d$):")
print(f"a: {coefficients[0]:.4f}")
print(f"b: {coefficients[1]:.4f}")
print(f"c: {coefficients[2]:.4f}")
print(f"d: {coefficients[3]:.4f}")
Polynomial coefficients (Cubic function $ax^3 + bx^2 + cx + d$):
a: -3948.5170
b: 23947204.9986
c: -48412120011.4833
d: 32623595550884.8320
No description has been provided for this image

Concepts and inforamtion explored while doing the assignment¶

The followings are the information i have explored thoroughly and based on the need of the assignment.

  1. I have learnt what scalar, vector and Matrices and how they resemble the organization of data.
  2. I have also learnt the opeartion of Matrices and their application in data manipulation,transformation and solving the eqns that governs the model.And how they are applied in machine learning.
  3. I have also learnt about Linear regression in which i have explored least square method and what it meant. Moreover, i have also explored terminology related to linear regression.
  4. I have also explored inforamtion on inverse and pseudoinverse when they are applied interms of its determinants. I have also explored information on finding coeeficients for bias and slope using SVD.

Coding Concepts explored¶

  1. How to load a CSV file using Panda
  2. How to filter out the unwanted data fileds and work on the fileds needed for plotting.
  3. Explored different parameters for methods in matplotlib libraries.

Assitance seeked¶

After plotting the data, I have asked Gemenie to recommend the best function fit for the data that i have plotted.

Video information refered during the assignment.¶

No description has been provided for this image

No description has been provided for this image

In [ ]: