Density Estimation¶

Neil did warn us that this session was going to be a doozy...and it sure lived up to his promise. Having gone on this rollercoaster ride 5 times already, the scary bits in the class produced laughter not fear. I guess that's a good sign. I knew that no matter how difficult the information that Neil shared was...that with enough research effort, I would get past the fog of confusion and begin to understand the difficult material.

In the latest session, I made it about 669% through before Neil left me in the dust. Neil talks us through equations like a composer would a difficult orchestral score. The logic is almost clear, but the symbols and notation were indecipherable to me. It's been a minute since math class. I know that Neil sees 'music' when he looks at the formulas, but for us unused to looking at those notations, they are like hieroglyphics, pretty but beyond comprehension. Looking forward to not having it all look like gibberish.

To approach to getting from 'Lost in Space' to 'Some Semblance of Understanding' will need to start with understanding the key terminology...I think. If I understood it correctly, the big concept for the session was...last session we learned how to generate outcome probability. This session we will go one step further and do work to see if the prediction probabilities generated as good or bad quality.

I asked ChatGPT how the Density Estimation techniques that Neil shared might be useful to gain insights into stock price predictions and it offered the following:

Help to understand the underlying distribution of stock-related variables
Improve forecasting, risk evaluation, and strategy design

Specific examples includes:

Help to understand the Distribution of Returns since stock returns are not 'normally distributed' but instead show 'fat tails' (extreme moves are common) and 'skewness' (asymmetric price moves). A KDE curve reveals the above and helps quantify the risk of large swings.
Estimate the Probability of Future Price Movements. With Density Estimate fo Returns, probability of tomorrow's returns > 1%, probability of loss > -2, probability of hitting a stop-loss or take-profit level
Price Path can be simulated using estimated distribution for option pricing, monte carlo simulations, risk management, algo-trading strategies
Improve Volatility Models with KDE, such as 20-day rolling volatility and likelihood of high-volatility spikes...adapting to periods of high/low volatility and improving prediction accuracy
Detect Regime Shifts as KDE curves change shapes suddenly or over time...narrow-tall KDE in calm markets, wide-fat_tailed during a crisis
Feature Engineering for Machine Learning Models such as, tail risk, skewness/kurtosis, peak distribution zones, density peaks near support/resistance levels. ML models can perform better when fed features extracted from KDE rather than raw prices
Identify Anomalies & Outliers in the values such as unusually large return, rare volume spike or excessive volatility...essential for cleaning data before training prediction models

Sounds good to me. Let's go!!

Research¶

Assignment Work¶

"We can do this the easy way...or we can do it the hard way"

Neil gave us 3 options to tackle the assignment, each progressively more difficult (and more statistically rewarding?) than the other.

Easy > Expectation Maximization
Medium > Gaussian Mixture
Advance > Cluster-Weighted Modeling

Let's start with easy.

1. Density Estimation¶

ChatGPT provides the following example code to generate a Kernel Density Estimation (KDE) for a stock's closing price. Let's run it with Nvidia data!

In [6]:

# import yahoo finance library
import yfinance as yf

nvda = yf.Ticker('NVDA') #specify stock ticker, assign to a variable
data = nvda.history(interval="1d", period="1y") #pull historical data, specify interval and period
data #show data

Out[6]:

	Open	High	Low	Close	Volume	Dividends	Stock Splits
Date
2024-12-05 00:00:00-05:00	145.078404	146.508085	143.918653	145.028412	172621200	0.01	0.0
2024-12-06 00:00:00-05:00	144.568530	145.668281	141.279238	142.408997	188505600	0.00	0.0
2024-12-09 00:00:00-05:00	138.939723	139.919506	137.100128	138.779755	189308600	0.00	0.0
2024-12-10 00:00:00-05:00	138.979717	141.789118	133.760853	135.040588	210020900	0.00	0.0
2024-12-11 00:00:00-05:00	137.330076	140.139461	135.180550	139.279648	184905200	0.00	0.0
...	...	...	...	...	...	...	...
2025-11-28 00:00:00-05:00	179.009995	179.289993	176.500000	177.000000	121332800	0.00	0.0
2025-12-01 00:00:00-05:00	174.759995	180.300003	173.679993	179.919998	188131000	0.00	0.0
2025-12-02 00:00:00-05:00	181.759995	185.660004	180.000000	181.460007	182632200	0.00	0.0
2025-12-03 00:00:00-05:00	181.080002	182.449997	179.110001	179.589996	165138000	0.00	0.0
2025-12-04 00:00:00-05:00	181.619995	184.520004	179.960007	183.380005	167025900	0.00	0.0

250 rows × 7 columns

In [ ]:

! pip install seaborn

In [8]:

# KDE of Daily Returns

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

returns = np.log(data['Close'] / data['Close'].shift(1)).dropna()

sns.kdeplot(returns)
plt.title("Kernel Density Estimate of Daily Log Returns")
plt.show()

No description has been provided for this image

So what are we looking at? What does the above plot tell us about Nvidia's stock return over time?

According to ChatGPT, this curve shows the probability shape of returns, which can then be used for simulation or probability estimation.

The curve shows how likely different daily returns for Nvidia have been historically.

Observations:

Extremely tall, narrow peak at 0 return (Leptokurtic distribution)
- NVDA usually moves very little per day
- Distribution highly concentrated around zero
- similar behavior to most large tech stocks > many calm days, a few very volatile ones
The curve is NOT symmetric...has heavier tails
- tall and narrow peak...but density spreads out much farther on both sides than normal distribution
- Nvidia has 'fat tailed' distribution
- ...more frequent large positive/negative moves that normal
- ...sensitive to earnings, news, market crashes, hype cycles, etc.
Left tail extends further than the right...downside risk exists
- Big downside crashes have happened
- daily drops have been larger (in magnitude) than its biggest daily gains
- Nvidia has higher likelihood of sharp downward shocks
Right tail still present...NVDA has had many large positive days
- NVDA also has huge positive rallies
The KDE confirms that NVDA's returns are NOT Normally Distributed
- Classic Gaussian assumptions fail...since not normal distribution
- Fat Tails (a probability distribution that has more extreme outcomes than a normal Gaussian distribution would predict)
- Nvidia tails extend to -0.45 and +0.35...classic Fat Tail behavior
- Volatility clusters, regime shifts, sudden jumps...are more common for Nvidia
- For stocks...Fat Tail means extreme price moves are more common
Implication for Prediction & Modeling
- Nvidia behaves like a two-regime asset
- Regime 1 Calm Days: small daily moves, in stable market, most trading days
- Regime 2 Volatile Days: earnings announcements, macro events, technology releases
- GMM is very appropriate for Nvidia because it can capture these 2 regimes

Simply Stated > Nvidia stock returns has 2 different personalities...calm versus volatile. Most of the time, Nvidia stock shows a return pattern with tiny changes positive or negative. This is represented by the big, tall spike in the graph. However, every so often, Nvidia will deliver very big price changes...sometimes positive and sometimes negative. This is represented by the Fat Tails on either side of the spike.

2. Gaussian Mixture Model¶

Based on my research on the Gaussian Mixture Model (GMM), I understand that a Density Estimation can actually be generated using the Normal Distribution of two or more variables that influences data changes.

In the previous example, the outcome showed that Nvidia's stock price returns can be categorized into 2 distinct Regimes or volatility cases, calm or volatile. As such, the Distribution Estimation that was ploted can actually be thought of as the combination of 2 different data distributions, one that describes Nvidia's stock price changes in calm markets and the other tha describes changes in volatile markets.

Let's look at the GMM breakdown of Nvidia's price return distribution

Regime Inspection for Nvidia with Gaussian Mixture Model¶

ChatGPT had suggested that a GMM would be ideal for looking at the distribution probabilities of Nvidia's 2 distinct regimes, simplistically described as Calm Days and Volatile Days. Let's look to understand the code that ChatGPT offers up to illustrate this thesis.

In [ ]:

pip install numpy pandas scikit-learn matplotlib --upgrade yfinance

In [2]:

import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Download NVDA daily data (adjust as you like)
ticker = "NVDA"
data = yf.download(ticker, start="2010-01-01")  # long history
data = data.dropna()
data

C:\Users\senna\AppData\Local\Temp\ipykernel_8244\3726011903.py:9: FutureWarning: YF.download() has changed argument auto_adjust default to True
  data = yf.download(ticker, start="2010-01-01")  # long history
[*********************100%***********************]  1 of 1 completed

Out[2]:

Price	Close	High	Low	Open	Volume
Ticker	NVDA	NVDA	NVDA	NVDA	NVDA
Date
2010-01-04	0.423807	0.426786	0.415097	0.424265	800204000
2010-01-05	0.429995	0.434580	0.422202	0.422202	728648000
2010-01-06	0.432746	0.433663	0.425641	0.429766	649168000
2010-01-07	0.424265	0.432287	0.421056	0.430454	547792000
2010-01-08	0.425182	0.428162	0.418306	0.420827	478168000
...	...	...	...	...	...
2025-12-02	181.449905	185.649669	179.989980	181.749876	182632200
2025-12-03	179.580002	182.439843	179.100033	181.069924	165138000
2025-12-04	183.380005	184.520004	179.960007	181.619995	167364900
2025-12-05	182.410004	184.660004	180.910004	183.889999	143971100
2025-12-08	185.550003	188.000000	182.399994	182.639999	201696500

4008 rows × 5 columns

In [3]:

# Compute daily log returns
data['LogReturn'] = np.log(data['Close'] / data['Close'].shift(1))
returns = data['LogReturn'].dropna().values.reshape(-1, 1)  # shape (n_samples, 1)

In [4]:

# Fit 2-Regime GMM > Calm vs Volatile Periods

# sklearn fit distribution of NVDA returns as a mixture of 2 Gaussian Distros
# Regime 0 = Calm
# Regime 1 = Volatile 

# Configuring the Model 
gmm2 = GaussianMixture(
    n_components=2, # 2 regimes
    covariance_type='full', # full 1D covariance matrix for each regime
    init_params='kmeans', # initialize cluster centers using k-means clustering
    n_init=10, # 10 fitting iterations (should increase?)
    random_state=42 # set random seed
)

# Fit the returns
gmm2.fit(returns)

# Regime labels and posterior probabilities
labels2 = gmm2.predict(returns)                  # hard cluster label specification (0 or 1) for each data point  
probs2  = gmm2.predict_proba(returns)            # soft probabilities for each regime for each (daily) data pont...some percentage for 1 and the other

print("2-regime means:", gmm2.means_.ravel()) # print means...'ravel' a numpy method turns array into 1D (flattened) view
print("2-regime stds :", np.sqrt(gmm2.covariances_.ravel())) # print standard deviation
print("2-regime weights:", gmm2.weights_) # print weights

2-regime means: [0.00133091 0.00164009]
2-regime stds : [0.04112189 0.01586472]
2-regime weights: [0.39555542 0.60444458]

In [5]:

# Visualize Mixture vs Empirical Distribution  

x = np.linspace(returns.min(), returns.max(), 1000).reshape(-1, 1)

# Mixture pdf for 2 regimes
log_prob_2 = gmm2.score_samples(x) # returns the log of the combined mixture probability distribution function...adding 2 Gaussians together  
pdf_2 = np.exp(log_prob_2) # converting to probability distribution function

# log-probabilities of each component 
log_probs_components = gmm2._estimate_log_prob(x)   # shape (1000, 2)
comp_pdfs = np.exp(log_probs_components) # converting to probability distribution function
weighted_pdfs = comp_pdfs * gmm2.weights_ # Weight each component by its regime weight


plt.figure(figsize=(10, 5))
plt.hist(returns, bins=80, density=True, alpha=0.4, label='Empirical Histogram')

# plot mixture (combined) PDF
plt.plot(x, pdf_2, color='orange', linewidth=2, label='GMM pdf') 
# plot the 2 regimes individually
plt.plot(x, weighted_pdfs[:, 0], 'r--', label='Regime 0 pdf')
plt.plot(x, weighted_pdfs[:, 1], 'g--', label='Regime 1 pdf')

plt.title("NVDA Daily Log Returns + 2-Regime GMM Fit")
plt.xlabel("Log return")
plt.ylabel("Density")
plt.legend()
plt.show()

Observations: Nvidia's Daily Log Returns, 2 Regime GMM Fit

The plot above shows several things...

A Histogram of NVDA's return distributions > blue bar graph
A Gaussian Distribution of Volatile Returns > red dash line
A Gaussian Distribution of Calm Returns > green dash line
A Gaussian Mixture Probability Distribution Function > orange line

So Nvidia's return history since its IPO when analyzed for volatility using GMM techniques tells us the following story:

Nvidia trades like a boring stock most of the time, the green dashed line has high density and concentrated between -0.5% and +0.5% daily returns. But, less common but definitely not uncommon, Nvidia will have dramatic return days where it trades between -2.2% to +2.7%, the less dense red dashed line.

As an investor in Nvidia stock, this plot provides empirical confirmation that Nvidia is NOT a low risk stock to invest in, that it has meaningful price volatility characteristics that must be considered.

Adding Dimensionality¶

In class review, Neil commented that the work I did above has only a single dimension...and that it would interesting if I added more dimensions to the analysis to get even more insights. Suggested dimensions included market movement, etc.

I have only looked at Return so far.

Learning Outcome¶

The processes and algorithms for this week have too many different names...making it hard to do research. "Cluster Weighted Model" for example.
So I understand this week's work to be to parse a dataset plot of data into liked groups, clusters, regimes
...then generate a probability distribution for each of those clusters
the procedure allows the analysis of component effects of an aggragate distribution