FAB Futures - Data Science
Home About

Data Science/Session 6 > Density Estimation¶

Neil did warn us that this session was going to be a doozy...and it sure lived up to his promise. Having gone on this rollercoaster ride 5 times already, the scary bits in the class produced laughter not fear. I guess that's a good sign. I knew that no matter how difficult the information that Neil shared was...that with enough research effort, I would get past the fog of confusion and begin to understand the difficult material.

In the latest session, I made it about 669% through before Neil left me in the dust. Neil talks us through equations like a composer would a difficult orchestral score. The logic is almost clear, but the symbols and notation were indecipherable to me. It's been a minute since math class. I know that Neil sees 'music' when he looks at the formulas, but for us unused to looking at those notations, they are like hieroglyphics, pretty but beyond comprehension. Looking forward to not having it all look like gibberish.

To approach to getting from 'Lost in Space' to 'Some Semblance of Understanding' will need to start with understanding the key terminology...I think. If I understood it correctly, the big concept for the session was...last session we learned how to generate outcome probability. This session we will go one step further and do work to see if the prediction probabilities generated as good or bad quality.

I asked ChatGPT how the Density Estimation techniques that Neil shared might be useful to gain insights into stock price predictions and it offered the following:

  • Help to understand the underlying distribution of stock-related variables
  • Improve forecasting, risk evaluation, and strategy design

Specific examples includes:

  • Help to understand the Distribution of Returns since stock returns are not 'normally distributed' but instead show 'fat tails' (extreme moves are common) and 'skewness' (asymmetric price moves). A KDE curve reveals the above and helps quantify the risk of large swings.
  • Estimate the Probability of Future Price Movements. With Density Estimate fo Returns, probability of tomorrow's returns > 1%, probability of loss > -2, probability of hitting a stop-loss or take-profit level
  • Price Path can be simulated using estimated distribution for option pricing, monte carlo simulations, risk management, algo-trading strategies
  • Improve Volatility Models with KDE, such as 20-day rolling volatility and likelihood of high-volatility spikes...adapting to periods of high/low volatility and improving prediction accuracy
  • Detect Regime Shifts as KDE curves change shapes suddenly or over time...narrow-tall KDE in calm markets, wide-fat_tailed during a crisis
  • Feature Engineering for Machine Learning Models such as, tail risk, skewness/kurtosis, peak distribution zones, density peaks near support/resistance levels. ML models can perform better when fed features extracted from KDE rather than raw prices
  • Identify Anomalies & Outliers in the values such as unusually large return, rare volume spike or excessive volatility...essential for cleaning data before training prediction models

Sounds good to me. Let's go!!

Research¶

  • Expectation Maximization
  • Gaussian Mixture
  • Cluster-Weighted Modeling

Assignment Work¶

"We can do this the easy way...or we can do it the hard way"

Neil gave us 3 options to tackle the assignment, each progressively more difficult (and more statistically rewarding?) than the other.

Easy > Expectation Maximization
Medium > Gaussian Mixture
Advance > Cluster-Weighted Modeling

Let's start with easy.

1. Density Estimation¶

ChatGPT provides the following example code to generate a Kernel Density Estimation (KDE) for a stock's closing price. Let's run it with Nvidia data!

In [6]:
# import yahoo finance library
import yfinance as yf

nvda = yf.Ticker('NVDA') #specify stock ticker, assign to a variable
data = nvda.history(interval="1d", period="1y") #pull historical data, specify interval and period
data #show data
Out[6]:
Open High Low Close Volume Dividends Stock Splits
Date
2024-12-05 00:00:00-05:00 145.078404 146.508085 143.918653 145.028412 172621200 0.01 0.0
2024-12-06 00:00:00-05:00 144.568530 145.668281 141.279238 142.408997 188505600 0.00 0.0
2024-12-09 00:00:00-05:00 138.939723 139.919506 137.100128 138.779755 189308600 0.00 0.0
2024-12-10 00:00:00-05:00 138.979717 141.789118 133.760853 135.040588 210020900 0.00 0.0
2024-12-11 00:00:00-05:00 137.330076 140.139461 135.180550 139.279648 184905200 0.00 0.0
... ... ... ... ... ... ... ...
2025-11-28 00:00:00-05:00 179.009995 179.289993 176.500000 177.000000 121332800 0.00 0.0
2025-12-01 00:00:00-05:00 174.759995 180.300003 173.679993 179.919998 188131000 0.00 0.0
2025-12-02 00:00:00-05:00 181.759995 185.660004 180.000000 181.460007 182632200 0.00 0.0
2025-12-03 00:00:00-05:00 181.080002 182.449997 179.110001 179.589996 165138000 0.00 0.0
2025-12-04 00:00:00-05:00 181.619995 184.520004 179.960007 183.380005 167025900 0.00 0.0

250 rows × 7 columns

In [ ]:
! pip install seaborn
In [8]:
# KDE of Daily Returns

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

returns = np.log(data['Close'] / data['Close'].shift(1)).dropna()

sns.kdeplot(returns)
plt.title("Kernel Density Estimate of Daily Log Returns")
plt.show()
No description has been provided for this image

So what are we looking at? What does the above plot tell us about Nvidia's stock return over time?

According to ChatGPT, this curve shows the probability shape of returns, which can then be used for simulation or probability estimation.

The curve shows how likely different daily returns for Nvidia have been historically.

Observations:

  1. Extremely tall, narrow peak at 0 return (Leptokurtic distribution)
    • NVDA usually moves very little per day
    • Distribution highly concentrated around zero
    • similar behavior to most large tech stocks > many calm days, a few very volatile ones
  2. The curve is NOT symmetric...has heavier tails
    • tall and narrow peak...but density spreads out much farther on both sides than normal distribution
    • Nvidia has 'fat tailed' distribution
    • ...more frequent large positive/negative moves that normal
    • ...sensitive to earnings, news, market crashes, hype cycles, etc.
  3. Left tail extends further than the right...downside risk exists
    • Big downside crashes have happened
    • daily drops have been larger (in magnitude) than its biggest daily gains
    • Nvidia has higher likelihood of sharp downward shocks
  4. Right tail still present...NVDA has had many large positive days
    • NVDA also has huge positive rallies
  5. The KDE confirms that NVDA's returns are NOT Normally Distributed
    • Classic Gaussian assumptions fail...since not normal distribution
    • Fat Tails (a probability distribution that has more extreme outcomes than a normal Gaussian distribution would predict)
    • Nvidia tails extend to -0.45 and +0.35...classic Fat Tail behavior
    • Volatility clusters, regime shifts, sudden jumps...are more common for Nvidia
    • For stocks...Fat Tail means extreme price moves are more common
  6. Implication for Prediction & Modeling
    • Nvidia behaves like a two-regime asset
    • Regime 1 Calm Days: small daily moves, in stable market, most trading days
    • Regime 2 Volatile Days: earnings announcements, macro events, technology releases
    • GMM is very appropriate for Nvidia because it can capture these 2 regimes

Simply Stated > Nvidia stock returns has 2 different personalities...calm versus volatile. Most of the time, Nvidia stock shows a return pattern with tiny changes positive or negative. This is represented by the big, tall spike in the graph. However, every so often, Nvidia will deliver very big price changes...sometimes positive and sometimes negative. This is represented by the Fat Tails on either side of the spike.

2. Gaussian Mixture Model¶

Based on my research on the Gaussian Mixture Model (GMM), I understand that a Density Estimation can actually be generated using the Normal Distribution of two or more variables that influences data changes.

In the previous example, the outcome showed that Nvidia's stock price returns can be categorized into 2 distinct Regimes or volatility cases, calm or volatile. As such, the Distribution Estimation that was ploted can actually be thought of as the combination of 2 different data distributions, one that describes Nvidia's stock price changes in calm markets and the other tha describes changes in volatile markets.

Let's look at the GMM breakdown of Nvidia's price return distribution

Regime Inspection for Nvidia with Gaussian Mixture Model¶

ChatGPT had suggested that a GMM would be ideal for looking at the distribution probabilities of Nvidia's 2 distinct regimes, simplistically described as Calm Days and Volatile Days. Let's look to understand the code that ChatGPT offers up to illustrate this thesis.

In [ ]:
pip install numpy pandas scikit-learn matplotlib --upgrade yfinance
In [2]:
import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Download NVDA daily data (adjust as you like)
ticker = "NVDA"
data = yf.download(ticker, start="2010-01-01")  # long history
data = data.dropna()
data
C:\Users\senna\AppData\Local\Temp\ipykernel_8244\3726011903.py:9: FutureWarning: YF.download() has changed argument auto_adjust default to True
  data = yf.download(ticker, start="2010-01-01")  # long history
[*********************100%***********************]  1 of 1 completed
Out[2]:
Price Close High Low Open Volume
Ticker NVDA NVDA NVDA NVDA NVDA
Date
2010-01-04 0.423807 0.426786 0.415097 0.424265 800204000
2010-01-05 0.429995 0.434580 0.422202 0.422202 728648000
2010-01-06 0.432746 0.433663 0.425641 0.429766 649168000
2010-01-07 0.424265 0.432287 0.421056 0.430454 547792000
2010-01-08 0.425182 0.428162 0.418306 0.420827 478168000
... ... ... ... ... ...
2025-12-02 181.449905 185.649669 179.989980 181.749876 182632200
2025-12-03 179.580002 182.439843 179.100033 181.069924 165138000
2025-12-04 183.380005 184.520004 179.960007 181.619995 167364900
2025-12-05 182.410004 184.660004 180.910004 183.889999 143971100
2025-12-08 185.550003 188.000000 182.399994 182.639999 201696500

4008 rows × 5 columns

In [3]:
# Compute daily log returns
data['LogReturn'] = np.log(data['Close'] / data['Close'].shift(1))
returns = data['LogReturn'].dropna().values.reshape(-1, 1)  # shape (n_samples, 1)
In [4]:
# Fit 2-Regime GMM > Calm vs Volatile Periods

# sklearn fit distribution of NVDA returns as a mixture of 2 Gaussian Distros
# Regime 0 = Calm
# Regime 1 = Volatile 

# Configuring the Model 
gmm2 = GaussianMixture(
    n_components=2, # 2 regimes
    covariance_type='full', # full 1D covariance matrix for each regime
    init_params='kmeans', # initialize cluster centers using k-means clustering
    n_init=10, # 10 fitting iterations (should increase?)
    random_state=42 # set random seed
)

# Fit the returns
gmm2.fit(returns)

# Regime labels and posterior probabilities
labels2 = gmm2.predict(returns)                  # hard cluster label specification (0 or 1) for each data point  
probs2  = gmm2.predict_proba(returns)            # soft probabilities for each regime for each (daily) data pont...some percentage for 1 and the other

print("2-regime means:", gmm2.means_.ravel()) # print means...'ravel' a numpy method turns array into 1D (flattened) view
print("2-regime stds :", np.sqrt(gmm2.covariances_.ravel())) # print standard deviation
print("2-regime weights:", gmm2.weights_) # print weights
2-regime means: [0.00133091 0.00164009]
2-regime stds : [0.04112189 0.01586472]
2-regime weights: [0.39555542 0.60444458]
In [5]:
# Visualize Mixture vs Empirical Distribution  

x = np.linspace(returns.min(), returns.max(), 1000).reshape(-1, 1)

# Mixture pdf for 2 regimes
log_prob_2 = gmm2.score_samples(x) # returns the log of the combined mixture probability distribution function...adding 2 Gaussians together  
pdf_2 = np.exp(log_prob_2) # converting to probability distribution function

# log-probabilities of each component 
log_probs_components = gmm2._estimate_log_prob(x)   # shape (1000, 2)
comp_pdfs = np.exp(log_probs_components) # converting to probability distribution function
weighted_pdfs = comp_pdfs * gmm2.weights_ # Weight each component by its regime weight


plt.figure(figsize=(10, 5))
plt.hist(returns, bins=80, density=True, alpha=0.4, label='Empirical Histogram')

# plot mixture (combined) PDF
plt.plot(x, pdf_2, color='orange', linewidth=2, label='GMM pdf') 
# plot the 2 regimes individually
plt.plot(x, weighted_pdfs[:, 0], 'r--', label='Regime 0 pdf')
plt.plot(x, weighted_pdfs[:, 1], 'g--', label='Regime 1 pdf')

plt.title("NVDA Daily Log Returns + 2-Regime GMM Fit")
plt.xlabel("Log return")
plt.ylabel("Density")
plt.legend()
plt.show()
No description has been provided for this image

Observations: Nvidia's Daily Log Returns, 2 Regime GMM Fit

The plot above shows several things...

  • A Histogram of NVDA's return distributions > blue bar graph
  • A Gaussian Distribution of Volatile Returns > red dash line
  • A Gaussian Distribution of Calm Returns > green dash line
  • A Gaussian Mixture Probability Distribution Function > orange line

So Nvidia's return history since its IPO when analyzed for volatility using GMM techniques tells us the following story:

Nvidia trades like a boring stock most of the time, the green dashed line has high density and concentrated between -0.5% and +0.5% daily returns. But, less common but definitely not uncommon, Nvidia will have dramatic return days where it trades between -2.2% to +2.7%, the less dense red dashed line.

As an investor in Nvidia stock, this plot provides empirical confirmation that Nvidia is NOT a low risk stock to invest in, that it has meaningful price volatility characteristics that must be considered.

Adding Dimensionality¶

In class review, Neil commented that the work I did above has only a single dimension...and that it would interesting if I added more dimensions to the analysis to get even more insights. Suggested dimensions included market movement, etc.

I have only looked at Return so far.

Learning Outcome¶

  • The processes and algorithms for this week have too many different names...making it hard to do research. "Cluster Weighted Model" for example.
  • So I understand this week's work to be to parse a dataset plot of data into liked groups, clusters, regimes
  • ...then generate a probability distribution for each of those clusters
  • the procedure allows the analysis of component effects of an aggragate distribution