Aristarco - Fab Futures - Data Science
Home About

Week 2: Assignment 3 - Fitting¶

Fitting a curve is the process of finding a mathematical function or equation that best represents a set of data points, aiming to model the underlying trend, smooth out noise, or predict values. It helps visualize relationships, interpolate missing data, or extrapolate beyond the known range, often using methods like least squares to minimize errors between the curve and the data

Key Concepts

  • Objective: To find a function (like a line, polynomial, or more complex model) that closely matches observed data.
  • Methods: Techniques include interpolation (exact fit through points) or smoothing (approximate fit). Common algorithms include linear regression, polynomial regression, and non-linear fitting.

Why it's Used

  • Data Visualization: To see trends and patterns clearly.
  • Prediction: To estimate values for inputs not in the original data (extrapolation).
  • *Noise Reduction:/ To smooth out random errors or outliers in data.
  • Model Building: To find physical or statistical relationships between variables, providing insights beyond just the numbers.

Important Consideration: Overfitting

A major challenge is overfitting, where a model fits the noise in the training data too closely, making it perform poorly on new, unseen data. A good fit balances accuracy with simplicity the bias-variance tradeoff, often motivated by a realistic physical model rather than just mathematical perfection.

Source: Wikipedia , Wall Street Mojo , Ask science

Using a new set of data. To facilitate data management and better learn how to fit a curve I change my data set to the traffic accidents in Mexico data base from INEGI, the mexican government statistical data source.¶

With this data base I planed to plot a fitting function between the data on road accidents per year and the tendency observed

In [1]:
import os
os.listdir("datasets")
Out[1]:
['Titanic_test.csv',
 'TENBIARE.csv',
 'enbiare_2021_fd.xlsx',
 'historico_accidentes.csv',
 'Titanic_train.csv',
 '.ipynb_checkpoints',
 'denue_inegi_21_.csv',
 'submission.csv',
 '.gitignore']
In [2]:
import pandas as pd

ruta = "/home/jovyan/work/aristarco-cortes/datasets/historico_accidentes.csv"

df = pd.read_csv(ruta, encoding="utf-8")

df.head()
Out[2]:
cve_entidad desc_entidad cve_municipio desc_municipio id_indicador indicador 1997 1998 1999 2000 ... 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
0 0 Estados Unidos Mexicanos 0 Estados Unidos Mexicanos 1006000039 Accidentes de tránsito terrestre en zonas urba... 248114.0 262687.0 285494.0 311938.0 ... 382066.0 360051.0 367789.0 365281.0 362729.0 301678.0 340415.0 377231.0 381048.0 374949.0
1 0 Estados Unidos Mexicanos 0 Estados Unidos Mexicanos 6200009438 Víctimas muertas en los accidentes de tránsito 6039.0 4986.0 5525.0 5263.0 ... 4636.0 4559.0 4394.0 4227.0 4127.0 3826.0 4401.0 5181.0 4803.0 4791.0
2 0 Estados Unidos Mexicanos 0 Estados Unidos Mexicanos 6200009439 Víctimas heridas en los accidentes de tránsito 98435.0 99238.0 103784.0 116502.0 ... 107202.0 97614.0 91157.0 89220.0 91761.0 71935.0 82466.0 91501.0 90500.0 85846.0
3 1 Aguascalientes 0 Estatal 1006000039 Accidentes de tránsito terrestre en zonas urba... 3864.0 3863.0 4263.0 4967.0 ... 4204.0 3944.0 3975.0 3773.0 3766.0 4173.0 4540.0 4533.0 4473.0 4272.0
4 1 Aguascalientes 0 Estatal 6200009438 Víctimas muertas en los accidentes de tránsito 300.0 118.0 139.0 65.0 ... 74.0 83.0 71.0 81.0 100.0 76.0 82.0 85.0 93.0 98.0

5 rows × 34 columns

To see what type of data has my dataset I use df.types¶

In [3]:
df.dtypes
Out[3]:
cve_entidad         int64
desc_entidad       object
cve_municipio       int64
desc_municipio     object
id_indicador        int64
indicador          object
1997              float64
1998              float64
1999              float64
2000              float64
2001              float64
2002              float64
2003              float64
2004              float64
2005              float64
2006              float64
2007              float64
2008              float64
2009              float64
2010              float64
2011              float64
2012              float64
2013              float64
2014              float64
2015              float64
2016              float64
2017              float64
2018              float64
2019              float64
2020              float64
2021              float64
2022              float64
2023              float64
2024              float64
dtype: object

I define a new dataframe with only national results of road accidents¶

In [4]:
# Gemini Prompt: Cómo me quedo con un solo renglón y elimino 7 columnas de mi dataframe (I ran out of tokens in Chatgpt so I switched to Gemini)
accidentes_df = df.drop(df.columns[0:6], axis=1).head(1)
print(accidentes_df)
       1997      1998      1999      2000      2001      2002      2003  \
0  248114.0  262687.0  285494.0  311938.0  364869.0  399002.0  424490.0   

       2004      2005      2006  ...      2015      2016      2017      2018  \
0  443607.0  452233.0  471272.0  ...  382066.0  360051.0  367789.0  365281.0   

       2019      2020      2021      2022      2023      2024  
0  362729.0  301678.0  340415.0  377231.0  381048.0  374949.0  

[1 rows x 28 columns]

Great! Now I have only the national traffic accident data in one line¶

In [5]:
# Prompt Gemini: Como transpongo la tabla que generé
# 1. Transponemos el DataFrame usando .T
accidentes_df = accidentes_df.T
print(accidentes_df)
             0
1997  248114.0
1998  262687.0
1999  285494.0
2000  311938.0
2001  364869.0
2002  399002.0
2003  424490.0
2004  443607.0
2005  452233.0
2006  471272.0
2007  476279.0
2008  466435.0
2009  428467.0
2010  427267.0
2011  387185.0
2012  390411.0
2013  385772.0
2014  380573.0
2015  382066.0
2016  360051.0
2017  367789.0
2018  365281.0
2019  362729.0
2020  301678.0
2021  340415.0
2022  377231.0
2023  381048.0
2024  374949.0

Yeiii! Now I have my dataframe in two columns, next is to re name the columns¶

In [6]:
# Gemini Prompt: Como re nombro las columnas para tener año y accidentes
# 1. Sacamos los años del índice y los convertimos en columna
accidentes_df = accidentes_df.reset_index()

# 2. Ahora tendrás columnas llamadas "index" y "0". Las renombramos:
accidentes_df.columns = ['Año', 'Accidentes']

print(accidentes_df)
     Año  Accidentes
0   1997    248114.0
1   1998    262687.0
2   1999    285494.0
3   2000    311938.0
4   2001    364869.0
5   2002    399002.0
6   2003    424490.0
7   2004    443607.0
8   2005    452233.0
9   2006    471272.0
10  2007    476279.0
11  2008    466435.0
12  2009    428467.0
13  2010    427267.0
14  2011    387185.0
15  2012    390411.0
16  2013    385772.0
17  2014    380573.0
18  2015    382066.0
19  2016    360051.0
20  2017    367789.0
21  2018    365281.0
22  2019    362729.0
23  2020    301678.0
24  2021    340415.0
25  2022    377231.0
26  2023    381048.0
27  2024    374949.0

To plot the graph I prompted to Gemini: quiero hacer una gráfica de puntos con accidentes en el eje y y año en el eje x. Además necesito una fitting curve entre los puntos

In [7]:
import matplotlib.pyplot as plt
import numpy as np

# 1. PREPARACIÓN DE DATOS
# Aseguramos que los años sean números (enteros) y no texto, si no, la fórmula fallará.
x = accidentes_df['Año'].astype(int)
y = accidentes_df['Accidentes']

# 2. CALCULAR LA CURVA DE AJUSTE (TREND LINE)
# 'polyfit' busca la ecuación matemática que mejor se acerca a tus puntos.
# El número '3' es el grado del polinomio (puedes cambiarlo a 1 para una recta o 2 para una curva simple).
coeficientes = np.polyfit(x, y, 3)
curva_tendencia = np.poly1d(coeficientes)

# 3. GENERAR LA GRÁFICA
plt.figure(figsize=(10, 6))

# A) Los puntos reales (Scatter plot)
plt.scatter(x, y, color='blue', label='Datos Reales')

# B) La línea de ajuste
# Usamos la función 'curva_tendencia' que creamos arriba
plt.plot(x, curva_tendencia(x), color='red', linestyle='--', linewidth=2, label='Tendencia (Ajuste)')

# 4. ETIQUETAS Y DISEÑO
plt.title('Evolución de Accidentes por Año con Curva de Ajuste')
plt.xlabel('Año')
plt.ylabel('Número de Accidentes')
plt.legend()     # Muestra el cuadrito con los nombres
plt.grid(True, alpha=0.3) # Cuadricula suave de fondo

plt.show()
No description has been provided for this image

Learning points¶

  • Choose wisely It is very important to choose the right database and the specific data we want to project in order to have clarity in the result.
  • Select More data does not always mean better results.
  • Clarity Having clarity from the outset about what I want to achieve makes it easier to develop the work step by step.

To be done: Making the graph only for Puebla State

In [ ]: