Naldi Carrion - Fab Lab ESAN - Fab Futures - Data Science
Home About

< Home

Week 3 - 2nd Class: Density Estimation¶

This class introduces a fundamental idea in data analysis: how to understand the "shape" of a distribution without assuming a rigid formula. Instead of saying "the data is definitely normal" or "the data is definitely uniform," we learn to estimate how our data is actually distributed using graphical and mathematical methods.

Histograms: The Starting Point¶

It's the most basic tool for displaying distributions. A histogram isn't the actual distribution; it's just an approximation, and It will depend entirely on the bin size.

A bin that's too large "flattens" the data, while a bin that's too small "overrepresents noise.". Thus, a histogram is sensitive to the bin-width parameter. Choosing the wrong value can be visually misleading.

Kernel Density Estimation (KDE):¶

A smooth curve to describe the actual distribution, and a more elegant way to display density. It will be like replacing each point of my dataset by a bell shape (Gaussian kernel), and if we add up all those bell shapes, we can get a smooth curve. Hence, KDE will bring us a continuous curve, not a blocky shape.

Advantages:

  • It's less sensitive to arbitrary changes.
  • It allows us to better see trends and dense areas.
  • It depends on the bandwidth parameter (h).
  • Large bandwidth → very smooth curve.
  • Small bandwidth → noisy curve.

The "shape" of the distribution depends on the bandwidth. Thus, we need to choose it well.

Parzen Windows:¶

It is like the general mathematical version of KDE. A method for approximating a distribution without making any assumptions, using small functions (windows) around each point.

KDE = Parzen windows with a Gaussian kernel.

It's a nonparametric method (it doesn't assume a fixed shape), It requires a lot of data for the curve to be reliable.

It is like a Nonparametric estimation, letting the data speak, without imposing mathematical forms.

Maximum Likelihood Estimation (MLE)¶

When we want to assume a form. The class contrasts:

  • KDE = you don't assume a shape
  • MLE = you assume a shape and calculate the parameters

For example: “I assume the distribution is normal. MLE tells me which mean and standard deviation best describe it.”

MLE finds the optimal parameters of a model, useful when we do want to assume a theoretical distribution. MLE is used to fit a model; KDE is used to discover the actual shape.

Visual comparison: histogram vs KDE¶

  • The histogram is discontinuous.
  • KDE is smooth and allows us to see patterns.
  • Both depend on parameters:
    • Histogram → bins
    • KDE → bandwidth
  • KDE is generally better for exploring data, but it doesn't replace the histogram.

Example¶

From the code reviewed in class we can identify:

  • gaussian_kde constructs the KDE curve.
  • linspace generates points to draw the smooth curve.
  • hist(... density=True) normalizes to compare the histogram and the KDE curve.
  • alpha=0.3 provides transparency.

Visual contrast is key: block vs. smooth curve.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

x = np.random.normal(size=1000)
kde = gaussian_kde(x)

xs = np.linspace(-5, 5, 200)
plt.plot(xs, kde(xs))
plt.hist(x, bins=30, density=True, alpha=0.3)
plt.show()
No description has been provided for this image

Key Final Concept: Bias–Variance Tradeoff¶

The class displays graphs of curves that are either too smooth or too noisy. Key thing to learn:

  • Small bandwidth → little smoothing → high variance (the curve vibrates too much).
  • Large bandwidth → a lot of smoothing → high skewness (we miss details).

There is an optimal point between the two.

Assignment¶

Generate an histogram and a KDE¶

The dataset comes from the pilot survey with entrepreneurs and is stored as a CSV file.

  • File: datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv
  • Separator: semicolon (;)
    Each row is one entrepreneur, each column is an item or variable (socio-demographic, FRUG, BRIC, INNOV, etc.).
In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Carga del dataset del piloto
df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=";")   # <-- cambia el nombre si es necesario

df.head()
Out[4]:
Marca temporal NOM GEN EAGE FOUND CAGE1 AFOUND CBASED CSECT EEXP ... INNOV2 INNOV3 INNOV4 CAGE2 TECHBS ETEAM EAOS SEEDF OPERF INCC
0 4/4/2025 18:10:28 iFurniture 2 35 1 2 1 2 9 1 ... 4 2 4 1 1 1 1 1 1 1
1 4/6/2025 13:09:46 Salvy Natural - Indes Perú 2 37 1 2 1 2 12 1 ... 5 5 5 1 1 1 1 0 0 0
2 4/7/2025 16:07:37 AVR Technology 1 23 1 2 1 2 15 0 ... 4 4 4 0 1 1 1 1 1 1
3 4/7/2025 21:49:59 AIO SENSORS 1 32 1 1 1 3 9 0 ... 4 4 4 0 1 1 1 0 1 1
4 4/8/2025 17:54:07 Face Me 1 30 1 2 1 3 5 0 ... 4 4 4 1 1 0 1 1 1 1

5 rows × 41 columns

Building latent-variable mean scores¶

The questionnaire includes multiple Likert-scale items for each construct:

  • Frugality: FRUG1 to FRUG7
  • Bricolage: BRIC1 to BRIC8
  • Innovative Behaviour: INNOV1 to INNOV4

I compute a simple mean score per respondent for each construct, creating:

  • FRUG_mean
  • BRIC_mean
  • INNOV_mean
In [5]:
import pandas as pd

df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=";")
df_active = df[df["AFOUND"] == 1].copy()
df_active.head(), df_active.shape

# Listas de items
frug_items  = [f'FRUG{i}' for i in range(1, 8)]
bric_items  = [f'BRIC{i}' for i in range(1, 9)]
innov_items = [f'INNOV{i}' for i in range(1, 5)]

# Mean scores
df_active['FRUG_mean']  = df_active[frug_items].mean(axis=1)
df_active['BRIC_mean']  = df_active[bric_items].mean(axis=1)
df_active['INNOV_mean'] = df_active[innov_items].mean(axis=1)

df_active[['FRUG_mean','BRIC_mean','INNOV_mean']].head()
Out[5]:
FRUG_mean BRIC_mean INNOV_mean
0 5.000000 4.125 3.25
1 3.571429 4.750 5.00
2 4.000000 4.000 4.00
3 4.000000 4.375 4.00
4 4.285714 4.375 4.00

Distributions of latent-variable scores¶

To see how the constructs are distributed among active entrepreneurs, I plot histograms for each mean score.

In [6]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# FRUG
plt.subplot(3, 1, 1)
plt.hist(df_active['FRUG_mean'], bins=10, edgecolor='black')
plt.title("Frugality (FRUG_mean)")

# BRIC
plt.subplot(3, 1, 2)
plt.hist(df_active['BRIC_mean'], bins=10, edgecolor='black')
plt.title("Bricolage (BRIC_mean)")

# INNOV
plt.subplot(3, 1, 3)
plt.hist(df_active['INNOV_mean'], bins=10, edgecolor='black')
plt.title("Innovation (INNOV_mean)")

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

KDE + KDE Multivariate (Frugality + Bricolage)¶

Here the idea will be generate a KDE for each variable and the to use a two-dimensional KDE (very useful for visualizing the latent relationship between variables). It's similar to the technique I used in fitting for my pilot analysis when I graphed the three-dimensional relationship of the variables, but now from a density estimation perspective.

In [13]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Lista de variables latentes
vars_latentes = ["FRUG_mean", "BRIC_mean", "INNOV_mean"]

for var in vars_latentes:
    x = df_active[var].dropna()   # usamos df_active

    kde = gaussian_kde(x)
    xs = np.linspace(x.min(), x.max(), 200)

    plt.figure(figsize=(8,5))
    plt.hist(x, bins=10, density=True, alpha=0.3, edgecolor='black', label="Histogram")
    plt.plot(xs, kde(xs), linewidth=2, label="KDE")
    plt.title(f"{var} – Histogram + KDE")
    plt.xlabel(var)
    plt.ylabel("Density")
    plt.legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The Kernel Density Estimation (KDE) provided a smooth approximation of the underlying distributions for the three latent variables in my pilot dataset.

  • Frugality showed a unimodal distribution with a concentration around medium-high scores (≈4.0), suggesting moderate variability.
  • Bricolage displayed the most compact and homogeneous distribution, centered around 4.3, indicating consistent bricolage behavior among participants.
  • Innovative Behavior presented a slightly bimodal distribution, hinting at two subgroups: a conservative innovator profile (3.0–3.4) and a more proactive one (4.0–4.6).
In [15]:
from scipy.stats import gaussian_kde
import numpy as np
import matplotlib.pyplot as plt

# Nos quedamos solo con filas donde existan ambos valores
sub = df_active[["FRUG_mean", "BRIC_mean"]].dropna()

x = sub["FRUG_mean"].values
y = sub["BRIC_mean"].values

xy = np.vstack([x, y])
kde2d = gaussian_kde(xy)

# Crear rejilla de puntos
xmin, xmax = x.min(), x.max()
ymin, ymax = y.min(), y.max()

xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
grid_coords = np.vstack([xx.ravel(), yy.ravel()])
zz = kde2d(grid_coords).reshape(xx.shape)

plt.figure(figsize=(7,6))
plt.pcolormesh(xx, yy, zz, shading='auto', cmap="viridis")
plt.scatter(x, y, s=20, c='white', alpha=0.6)
plt.title("2D KDE – Frugality (FRUG_mean) vs. Bricolage (BRIC_mean)")
plt.xlabel("FRUG_mean")
plt.ylabel("BRIC_mean")
plt.colorbar(label="Density")
plt.show()
No description has been provided for this image

The 2D KDE for Frugality vs. Bricolage revealed a smooth elliptical density region, suggesting a mild positive association between the two constructs, with no evidence of multiple clusters. This reinforces the assumption made in the pilot SEM that the relationship is continuous and stable across participants.

In [ ]: