W3-2ndClass

Week 3 - 2nd Class: Density Estimation¶

This class introduces a fundamental idea in data analysis: how to understand the "shape" of a distribution without assuming a rigid formula. Instead of saying "the data is definitely normal" or "the data is definitely uniform," we learn to estimate how our data is actually distributed using graphical and mathematical methods.

Histograms: The Starting Point¶

It's the most basic tool for displaying distributions. A histogram isn't the actual distribution; it's just an approximation, and It will depend entirely on the bin size.

A bin that's too large "flattens" the data, while a bin that's too small "overrepresents noise.". Thus, a histogram is sensitive to the bin-width parameter. Choosing the wrong value can be visually misleading.

Kernel Density Estimation (KDE):¶

A smooth curve to describe the actual distribution, and a more elegant way to display density. It will be like replacing each point of my dataset by a bell shape (Gaussian kernel), and if we add up all those bell shapes, we can get a smooth curve. Hence, KDE will bring us a continuous curve, not a blocky shape.

Advantages:

It's less sensitive to arbitrary changes.
It allows us to better see trends and dense areas.
It depends on the bandwidth parameter (h).
Large bandwidth → very smooth curve.
Small bandwidth → noisy curve.

The "shape" of the distribution depends on the bandwidth. Thus, we need to choose it well.

Parzen Windows:¶

It is like the general mathematical version of KDE. A method for approximating a distribution without making any assumptions, using small functions (windows) around each point.

KDE = Parzen windows with a Gaussian kernel.

It's a nonparametric method (it doesn't assume a fixed shape), It requires a lot of data for the curve to be reliable.

It is like a Nonparametric estimation, letting the data speak, without imposing mathematical forms.

Maximum Likelihood Estimation (MLE)¶

When we want to assume a form. The class contrasts:

KDE = you don't assume a shape
MLE = you assume a shape and calculate the parameters

For example: “I assume the distribution is normal. MLE tells me which mean and standard deviation best describe it.”

MLE finds the optimal parameters of a model, useful when we do want to assume a theoretical distribution. MLE is used to fit a model; KDE is used to discover the actual shape.

Visual comparison: histogram vs KDE¶

The histogram is discontinuous.
KDE is smooth and allows us to see patterns.
Both depend on parameters:
- Histogram → bins
- KDE → bandwidth
KDE is generally better for exploring data, but it doesn't replace the histogram.

Example¶

From the code reviewed in class we can identify:

gaussian_kde constructs the KDE curve.
linspace generates points to draw the smooth curve.
hist(... density=True) normalizes to compare the histogram and the KDE curve.
alpha=0.3 provides transparency.

Visual contrast is key: block vs. smooth curve.

In [1]:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

x = np.random.normal(size=1000)
kde = gaussian_kde(x)

xs = np.linspace(-5, 5, 200)
plt.plot(xs, kde(xs))
plt.hist(x, bins=30, density=True, alpha=0.3)
plt.show()

No description has been provided for this image

Key Final Concept: Bias–Variance Tradeoff¶

The class displays graphs of curves that are either too smooth or too noisy. Key thing to learn:

Small bandwidth → little smoothing → high variance (the curve vibrates too much).
Large bandwidth → a lot of smoothing → high skewness (we miss details).

There is an optimal point between the two.

Assignment¶

Generate an histogram and a KDE¶

The dataset comes from the pilot survey with entrepreneurs and is stored as a CSV file.

File: datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv
Separator: semicolon (;)
Each row is one entrepreneur, each column is an item or variable (socio-demographic, FRUG, BRIC, INNOV, etc.).

In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Carga del dataset del piloto
df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=";")   # <-- cambia el nombre si es necesario

df.head()

Out[4]:

	Marca temporal	NOM	GEN	EAGE	FOUND	CAGE1	AFOUND	CBASED	CSECT	EEXP	...	INNOV2	INNOV3	INNOV4	CAGE2	TECHBS	ETEAM	EAOS	SEEDF	OPERF	INCC
0	4/4/2025 18:10:28	iFurniture	2	35	1	2	1	2	9	1	...	4	2	4	1	1	1	1	1	1	1
1	4/6/2025 13:09:46	Salvy Natural - Indes Perú	2	37	1	2	1	2	12	1	...	5	5	5	1	1	1	1	0	0	0
2	4/7/2025 16:07:37	AVR Technology	1	23	1	2	1	2	15	0	...	4	4	4	0	1	1	1	1	1	1
3	4/7/2025 21:49:59	AIO SENSORS	1	32	1	1	1	3	9	0	...	4	4	4	0	1	1	1	0	1	1
4	4/8/2025 17:54:07	Face Me	1	30	1	2	1	3	5	0	...	4	4	4	1	1	0	1	1	1	1

5 rows × 41 columns

Building latent-variable mean scores¶

The questionnaire includes multiple Likert-scale items for each construct:

Frugality: FRUG1 to FRUG7
Bricolage: BRIC1 to BRIC8
Innovative Behaviour: INNOV1 to INNOV4

I compute a simple mean score per respondent for each construct, creating:

FRUG_mean
BRIC_mean
INNOV_mean

In [5]:

import pandas as pd

df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=";")
df_active = df[df["AFOUND"] == 1].copy()
df_active.head(), df_active.shape

# Listas de items
frug_items  = [f'FRUG{i}' for i in range(1, 8)]
bric_items  = [f'BRIC{i}' for i in range(1, 9)]
innov_items = [f'INNOV{i}' for i in range(1, 5)]

# Mean scores
df_active['FRUG_mean']  = df_active[frug_items].mean(axis=1)
df_active['BRIC_mean']  = df_active[bric_items].mean(axis=1)
df_active['INNOV_mean'] = df_active[innov_items].mean(axis=1)

df_active[['FRUG_mean','BRIC_mean','INNOV_mean']].head()

Out[5]:

	FRUG_mean	BRIC_mean	INNOV_mean
0	5.000000	4.125	3.25
1	3.571429	4.750	5.00
2	4.000000	4.000	4.00
3	4.000000	4.375	4.00
4	4.285714	4.375	4.00

Distributions of latent-variable scores¶

To see how the constructs are distributed among active entrepreneurs, I plot histograms for each mean score.

In [6]:

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# FRUG
plt.subplot(3, 1, 1)
plt.hist(df_active['FRUG_mean'], bins=10, edgecolor='black')
plt.title("Frugality (FRUG_mean)")

# BRIC
plt.subplot(3, 1, 2)
plt.hist(df_active['BRIC_mean'], bins=10, edgecolor='black')
plt.title("Bricolage (BRIC_mean)")

# INNOV
plt.subplot(3, 1, 3)
plt.hist(df_active['INNOV_mean'], bins=10, edgecolor='black')
plt.title("Innovation (INNOV_mean)")

plt.tight_layout()
plt.show()

In [ ]:

KDE + KDE Multivariate (Frugality + Bricolage)¶

Here the idea will be generate a KDE for each variable and the to use a two-dimensional KDE (very useful for visualizing the latent relationship between variables). It's similar to the technique I used in fitting for my pilot analysis when I graphed the three-dimensional relationship of the variables, but now from a density estimation perspective.

In [13]:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Lista de variables latentes
vars_latentes = ["FRUG_mean", "BRIC_mean", "INNOV_mean"]

for var in vars_latentes:
    x = df_active[var].dropna()   # usamos df_active

    kde = gaussian_kde(x)
    xs = np.linspace(x.min(), x.max(), 200)

    plt.figure(figsize=(8,5))
    plt.hist(x, bins=10, density=True, alpha=0.3, edgecolor='black', label="Histogram")
    plt.plot(xs, kde(xs), linewidth=2, label="KDE")
    plt.title(f"{var} – Histogram + KDE")
    plt.xlabel(var)
    plt.ylabel("Density")
    plt.legend()
    plt.show()

The Kernel Density Estimation (KDE) provided a smooth approximation of the underlying distributions for the three latent variables in my pilot dataset.

Frugality showed a unimodal distribution with a concentration around medium-high scores (≈4.0), suggesting moderate variability.
Bricolage displayed the most compact and homogeneous distribution, centered around 4.3, indicating consistent bricolage behavior among participants.
Innovative Behavior presented a slightly bimodal distribution, hinting at two subgroups: a conservative innovator profile (3.0–3.4) and a more proactive one (4.0–4.6).

In [15]:

from scipy.stats import gaussian_kde
import numpy as np
import matplotlib.pyplot as plt

# Nos quedamos solo con filas donde existan ambos valores
sub = df_active[["FRUG_mean", "BRIC_mean"]].dropna()

x = sub["FRUG_mean"].values
y = sub["BRIC_mean"].values

xy = np.vstack([x, y])
kde2d = gaussian_kde(xy)

# Crear rejilla de puntos
xmin, xmax = x.min(), x.max()
ymin, ymax = y.min(), y.max()

xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
grid_coords = np.vstack([xx.ravel(), yy.ravel()])
zz = kde2d(grid_coords).reshape(xx.shape)

plt.figure(figsize=(7,6))
plt.pcolormesh(xx, yy, zz, shading='auto', cmap="viridis")
plt.scatter(x, y, s=20, c='white', alpha=0.6)
plt.title("2D KDE – Frugality (FRUG_mean) vs. Bricolage (BRIC_mean)")
plt.xlabel("FRUG_mean")
plt.ylabel("BRIC_mean")
plt.colorbar(label="Density")
plt.show()

The 2D KDE for Frugality vs. Bricolage revealed a smooth elliptical density region, suggesting a mild positive association between the two constructs, with no evidence of multiple clusters. This reinforces the assumption made in the pilot SEM that the relationship is continuous and stable across participants.

In [ ]: