Week 2: Assignment 3 - Fitting¶
Fitting a curve is the process of finding a mathematical function or equation that best represents a set of data points, aiming to model the underlying trend, smooth out noise, or predict values. It helps visualize relationships, interpolate missing data, or extrapolate beyond the known range, often using methods like least squares to minimize errors between the curve and the data
Key Concepts
- Objective: To find a function (like a line, polynomial, or more complex model) that closely matches observed data.
- Methods: Techniques include interpolation (exact fit through points) or smoothing (approximate fit). Common algorithms include linear regression, polynomial regression, and non-linear fitting.
Why it's Used
- Data Visualization: To see trends and patterns clearly.
- Prediction: To estimate values for inputs not in the original data (extrapolation).
- *Noise Reduction:/ To smooth out random errors or outliers in data.
- Model Building: To find physical or statistical relationships between variables, providing insights beyond just the numbers.
Important Consideration: Overfitting
A major challenge is overfitting, where a model fits the noise in the training data too closely, making it perform poorly on new, unseen data. A good fit balances accuracy with simplicity the bias-variance tradeoff, often motivated by a realistic physical model rather than just mathematical perfection.
Source: Wikipedia , Wall Street Mojo , Ask science
Using a new set of data. To facilitate data management and better learn how to fit a curve I change my data set to the traffic accidents in Mexico data base from INEGI, the mexican government statistical data source.¶
With this data base I planed to plot a fitting function between the data on road accidents per year and the tendency observed
import os
os.listdir("datasets")
['Titanic_test.csv', 'TENBIARE.csv', 'enbiare_2021_fd.xlsx', 'historico_accidentes.csv', 'Titanic_train.csv', '.ipynb_checkpoints', 'denue_inegi_21_.csv', 'submission.csv', '.gitignore']
import pandas as pd
ruta = "/home/jovyan/work/aristarco-cortes/datasets/historico_accidentes.csv"
df = pd.read_csv(ruta, encoding="utf-8")
df.head()
| cve_entidad | desc_entidad | cve_municipio | desc_municipio | id_indicador | indicador | 1997 | 1998 | 1999 | 2000 | ... | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Estados Unidos Mexicanos | 0 | Estados Unidos Mexicanos | 1006000039 | Accidentes de tránsito terrestre en zonas urba... | 248114.0 | 262687.0 | 285494.0 | 311938.0 | ... | 382066.0 | 360051.0 | 367789.0 | 365281.0 | 362729.0 | 301678.0 | 340415.0 | 377231.0 | 381048.0 | 374949.0 |
| 1 | 0 | Estados Unidos Mexicanos | 0 | Estados Unidos Mexicanos | 6200009438 | Víctimas muertas en los accidentes de tránsito | 6039.0 | 4986.0 | 5525.0 | 5263.0 | ... | 4636.0 | 4559.0 | 4394.0 | 4227.0 | 4127.0 | 3826.0 | 4401.0 | 5181.0 | 4803.0 | 4791.0 |
| 2 | 0 | Estados Unidos Mexicanos | 0 | Estados Unidos Mexicanos | 6200009439 | Víctimas heridas en los accidentes de tránsito | 98435.0 | 99238.0 | 103784.0 | 116502.0 | ... | 107202.0 | 97614.0 | 91157.0 | 89220.0 | 91761.0 | 71935.0 | 82466.0 | 91501.0 | 90500.0 | 85846.0 |
| 3 | 1 | Aguascalientes | 0 | Estatal | 1006000039 | Accidentes de tránsito terrestre en zonas urba... | 3864.0 | 3863.0 | 4263.0 | 4967.0 | ... | 4204.0 | 3944.0 | 3975.0 | 3773.0 | 3766.0 | 4173.0 | 4540.0 | 4533.0 | 4473.0 | 4272.0 |
| 4 | 1 | Aguascalientes | 0 | Estatal | 6200009438 | Víctimas muertas en los accidentes de tránsito | 300.0 | 118.0 | 139.0 | 65.0 | ... | 74.0 | 83.0 | 71.0 | 81.0 | 100.0 | 76.0 | 82.0 | 85.0 | 93.0 | 98.0 |
5 rows × 34 columns
To see what type of data has my dataset I use df.types¶
df.dtypes
cve_entidad int64 desc_entidad object cve_municipio int64 desc_municipio object id_indicador int64 indicador object 1997 float64 1998 float64 1999 float64 2000 float64 2001 float64 2002 float64 2003 float64 2004 float64 2005 float64 2006 float64 2007 float64 2008 float64 2009 float64 2010 float64 2011 float64 2012 float64 2013 float64 2014 float64 2015 float64 2016 float64 2017 float64 2018 float64 2019 float64 2020 float64 2021 float64 2022 float64 2023 float64 2024 float64 dtype: object
I define a new dataframe with only national results of road accidents¶
# Gemini Prompt: Cómo me quedo con un solo renglón y elimino 7 columnas de mi dataframe (I ran out of tokens in Chatgpt so I switched to Gemini)
accidentes_df = df.drop(df.columns[0:6], axis=1).head(1)
print(accidentes_df)
1997 1998 1999 2000 2001 2002 2003 \
0 248114.0 262687.0 285494.0 311938.0 364869.0 399002.0 424490.0
2004 2005 2006 ... 2015 2016 2017 2018 \
0 443607.0 452233.0 471272.0 ... 382066.0 360051.0 367789.0 365281.0
2019 2020 2021 2022 2023 2024
0 362729.0 301678.0 340415.0 377231.0 381048.0 374949.0
[1 rows x 28 columns]
Great! Now I have only the national traffic accident data in one line¶
# Prompt Gemini: Como transpongo la tabla que generé
# 1. Transponemos el DataFrame usando .T
accidentes_df = accidentes_df.T
print(accidentes_df)
0 1997 248114.0 1998 262687.0 1999 285494.0 2000 311938.0 2001 364869.0 2002 399002.0 2003 424490.0 2004 443607.0 2005 452233.0 2006 471272.0 2007 476279.0 2008 466435.0 2009 428467.0 2010 427267.0 2011 387185.0 2012 390411.0 2013 385772.0 2014 380573.0 2015 382066.0 2016 360051.0 2017 367789.0 2018 365281.0 2019 362729.0 2020 301678.0 2021 340415.0 2022 377231.0 2023 381048.0 2024 374949.0
Yeiii! Now I have my dataframe in two columns, next is to re name the columns¶
# Gemini Prompt: Como re nombro las columnas para tener año y accidentes
# 1. Sacamos los años del índice y los convertimos en columna
accidentes_df = accidentes_df.reset_index()
# 2. Ahora tendrás columnas llamadas "index" y "0". Las renombramos:
accidentes_df.columns = ['Año', 'Accidentes']
print(accidentes_df)
Año Accidentes 0 1997 248114.0 1 1998 262687.0 2 1999 285494.0 3 2000 311938.0 4 2001 364869.0 5 2002 399002.0 6 2003 424490.0 7 2004 443607.0 8 2005 452233.0 9 2006 471272.0 10 2007 476279.0 11 2008 466435.0 12 2009 428467.0 13 2010 427267.0 14 2011 387185.0 15 2012 390411.0 16 2013 385772.0 17 2014 380573.0 18 2015 382066.0 19 2016 360051.0 20 2017 367789.0 21 2018 365281.0 22 2019 362729.0 23 2020 301678.0 24 2021 340415.0 25 2022 377231.0 26 2023 381048.0 27 2024 374949.0
To plot the graph I prompted to Gemini: quiero hacer una gráfica de puntos con accidentes en el eje y y año en el eje x. Además necesito una fitting curve entre los puntos
import matplotlib.pyplot as plt
import numpy as np
# 1. PREPARACIÓN DE DATOS
# Aseguramos que los años sean números (enteros) y no texto, si no, la fórmula fallará.
x = accidentes_df['Año'].astype(int)
y = accidentes_df['Accidentes']
# 2. CALCULAR LA CURVA DE AJUSTE (TREND LINE)
# 'polyfit' busca la ecuación matemática que mejor se acerca a tus puntos.
# El número '3' es el grado del polinomio (puedes cambiarlo a 1 para una recta o 2 para una curva simple).
coeficientes = np.polyfit(x, y, 3)
curva_tendencia = np.poly1d(coeficientes)
# 3. GENERAR LA GRÁFICA
plt.figure(figsize=(10, 6))
# A) Los puntos reales (Scatter plot)
plt.scatter(x, y, color='blue', label='Datos Reales')
# B) La línea de ajuste
# Usamos la función 'curva_tendencia' que creamos arriba
plt.plot(x, curva_tendencia(x), color='red', linestyle='--', linewidth=2, label='Tendencia (Ajuste)')
# 4. ETIQUETAS Y DISEÑO
plt.title('Evolución de Accidentes por Año con Curva de Ajuste')
plt.xlabel('Año')
plt.ylabel('Número de Accidentes')
plt.legend() # Muestra el cuadrito con los nombres
plt.grid(True, alpha=0.3) # Cuadricula suave de fondo
plt.show()
Learning points¶
- Choose wisely It is very important to choose the right database and the specific data we want to project in order to have clarity in the result.
- Select More data does not always mean better results.
- Clarity Having clarity from the outset about what I want to achieve makes it easier to develop the work step by step.
To be done: Making the graph only for Puebla State