Data Science/Session 6 > Density Estimation¶
Neil did warn us that this session was going to be a doozy...and it sure lived up to his promise. Having gone on this rollercoaster ride 5 times already, the scary bits in the class produced laughter not fear. I guess that's a good sign. I knew that no matter how difficult the information that Neil shared was...that with enough research effort, I would get past the fog of confusion and begin to understand the difficult material.
In the latest session, I made it about 669% through before Neil left me in the dust. Neil talks us through equations like a composer would a difficult orchestral score. The logic is almost clear, but the symbols and notation were indecipherable to me. It's been a minute since math class. I know that Neil sees 'music' when he looks at the formulas, but for us unused to looking at those notations, they are like hieroglyphics, pretty but beyond comprehension. Looking forward to not having it all look like gibberish.
To approach to getting from 'Lost in Space' to 'Some Semblance of Understanding' will need to start with understanding the key terminology...I think. If I understood it correctly, the big concept for the session was...last session we learned how to generate outcome probability. This session we will go one step further and do work to see if the prediction probabilities generated as good or bad quality.
I asked ChatGPT how the Density Estimation techniques that Neil shared might be useful to gain insights into stock price predictions and it offered the following:
- Help to understand the underlying distribution of stock-related variables
- Improve forecasting, risk evaluation, and strategy design
Specific examples includes:
- Help to understand the Distribution of Returns since stock returns are not 'normally distributed' but instead show 'fat tails' (extreme moves are common) and 'skewness' (asymmetric price moves). A KDE curve reveals the above and helps quantify the risk of large swings.
- Estimate the Probability of Future Price Movements. With Density Estimate fo Returns, probability of tomorrow's returns > 1%, probability of loss > -2, probability of hitting a stop-loss or take-profit level
- Price Path can be simulated using estimated distribution for option pricing, monte carlo simulations, risk management, algo-trading strategies
- Improve Volatility Models with KDE, such as 20-day rolling volatility and likelihood of high-volatility spikes...adapting to periods of high/low volatility and improving prediction accuracy
- Detect Regime Shifts as KDE curves change shapes suddenly or over time...narrow-tall KDE in calm markets, wide-fat_tailed during a crisis
- Feature Engineering for Machine Learning Models such as, tail risk, skewness/kurtosis, peak distribution zones, density peaks near support/resistance levels. ML models can perform better when fed features extracted from KDE rather than raw prices
- Identify Anomalies & Outliers in the values such as unusually large return, rare volume spike or excessive volatility...essential for cleaning data before training prediction models
Sounds good to me. Let's go!!
Assignment Work¶
"We can do this the easy way...or we can do it the hard way"
Neil gave us 3 options to tackle the assignment, each progressively more difficult (and more statistically rewarding?) than the other.
Easy > Expectation Maximization
Medium > Gaussian Mixture
Advance > Cluster-Weighted Modeling
Let's start with easy.
1. Density Estimation¶
ChatGPT provides the following example code to generate a Kernel Density Estimation (KDE) for a stock's closing price. Let's run it with Nvidia data!
# import yahoo finance library
import yfinance as yf
nvda = yf.Ticker('NVDA') #specify stock ticker, assign to a variable
data = nvda.history(interval="1d", period="1y") #pull historical data, specify interval and period
data #show data
| Open | High | Low | Close | Volume | Dividends | Stock Splits | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2024-12-05 00:00:00-05:00 | 145.078404 | 146.508085 | 143.918653 | 145.028412 | 172621200 | 0.01 | 0.0 |
| 2024-12-06 00:00:00-05:00 | 144.568530 | 145.668281 | 141.279238 | 142.408997 | 188505600 | 0.00 | 0.0 |
| 2024-12-09 00:00:00-05:00 | 138.939723 | 139.919506 | 137.100128 | 138.779755 | 189308600 | 0.00 | 0.0 |
| 2024-12-10 00:00:00-05:00 | 138.979717 | 141.789118 | 133.760853 | 135.040588 | 210020900 | 0.00 | 0.0 |
| 2024-12-11 00:00:00-05:00 | 137.330076 | 140.139461 | 135.180550 | 139.279648 | 184905200 | 0.00 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2025-11-28 00:00:00-05:00 | 179.009995 | 179.289993 | 176.500000 | 177.000000 | 121332800 | 0.00 | 0.0 |
| 2025-12-01 00:00:00-05:00 | 174.759995 | 180.300003 | 173.679993 | 179.919998 | 188131000 | 0.00 | 0.0 |
| 2025-12-02 00:00:00-05:00 | 181.759995 | 185.660004 | 180.000000 | 181.460007 | 182632200 | 0.00 | 0.0 |
| 2025-12-03 00:00:00-05:00 | 181.080002 | 182.449997 | 179.110001 | 179.589996 | 165138000 | 0.00 | 0.0 |
| 2025-12-04 00:00:00-05:00 | 181.619995 | 184.520004 | 179.960007 | 183.380005 | 167025900 | 0.00 | 0.0 |
250 rows × 7 columns
! pip install seaborn
# KDE of Daily Returns
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
returns = np.log(data['Close'] / data['Close'].shift(1)).dropna()
sns.kdeplot(returns)
plt.title("Kernel Density Estimate of Daily Log Returns")
plt.show()
So what are we looking at? What does the above plot tell us about Nvidia's stock return over time?
According to ChatGPT, this curve shows the probability shape of returns, which can then be used for simulation or probability estimation.
The curve shows how likely different daily returns for Nvidia have been historically.
Observations:
- Extremely tall, narrow peak at 0 return (Leptokurtic distribution)
- NVDA usually moves very little per day
- Distribution highly concentrated around zero
- similar behavior to most large tech stocks > many calm days, a few very volatile ones
- The curve is NOT symmetric...has heavier tails
- tall and narrow peak...but density spreads out much farther on both sides than normal distribution
- Nvidia has 'fat tailed' distribution
- ...more frequent large positive/negative moves that normal
- ...sensitive to earnings, news, market crashes, hype cycles, etc.
- Left tail extends further than the right...downside risk exists
- Big downside crashes have happened
- daily drops have been larger (in magnitude) than its biggest daily gains
- Nvidia has higher likelihood of sharp downward shocks
- Right tail still present...NVDA has had many large positive days
- NVDA also has huge positive rallies
- The KDE confirms that NVDA's returns are NOT Normally Distributed
- Classic Gaussian assumptions fail...since not normal distribution
- Fat Tails (a probability distribution that has more extreme outcomes than a normal Gaussian distribution would predict)
- Nvidia tails extend to -0.45 and +0.35...classic Fat Tail behavior
- Volatility clusters, regime shifts, sudden jumps...are more common for Nvidia
- For stocks...Fat Tail means extreme price moves are more common
- Implication for Prediction & Modeling
- Nvidia behaves like a two-regime asset
- Regime 1 Calm Days: small daily moves, in stable market, most trading days
- Regime 2 Volatile Days: earnings announcements, macro events, technology releases
- GMM is very appropriate for Nvidia because it can capture these 2 regimes
Simply Stated > Nvidia stock returns has 2 different personalities...calm versus volatile. Most of the time, Nvidia stock shows a return pattern with tiny changes positive or negative. This is represented by the big, tall spike in the graph. However, every so often, Nvidia will deliver very big price changes...sometimes positive and sometimes negative. This is represented by the Fat Tails on either side of the spike.
2. Gaussian Mixture Model¶
Based on my research on the Gaussian Mixture Model (GMM), I understand that a Density Estimation can actually be generated using the Normal Distribution of two or more variables that influences data changes.
In the previous example, the outcome showed that Nvidia's stock price returns can be categorized into 2 distinct Regimes or volatility cases, calm or volatile. As such, the Distribution Estimation that was ploted can actually be thought of as the combination of 2 different data distributions, one that describes Nvidia's stock price changes in calm markets and the other tha describes changes in volatile markets.
Let's look at the GMM breakdown of Nvidia's price return distribution
Regime Inspection for Nvidia with Gaussian Mixture Model¶
ChatGPT had suggested that a GMM would be ideal for looking at the distribution probabilities of Nvidia's 2 distinct regimes, simplistically described as Calm Days and Volatile Days. Let's look to understand the code that ChatGPT offers up to illustrate this thesis.
pip install numpy pandas scikit-learn matplotlib --upgrade yfinance
import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
# Download NVDA daily data (adjust as you like)
ticker = "NVDA"
data = yf.download(ticker, start="2010-01-01") # long history
data = data.dropna()
data
C:\Users\senna\AppData\Local\Temp\ipykernel_8244\3726011903.py:9: FutureWarning: YF.download() has changed argument auto_adjust default to True data = yf.download(ticker, start="2010-01-01") # long history [*********************100%***********************] 1 of 1 completed
| Price | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|
| Ticker | NVDA | NVDA | NVDA | NVDA | NVDA |
| Date | |||||
| 2010-01-04 | 0.423807 | 0.426786 | 0.415097 | 0.424265 | 800204000 |
| 2010-01-05 | 0.429995 | 0.434580 | 0.422202 | 0.422202 | 728648000 |
| 2010-01-06 | 0.432746 | 0.433663 | 0.425641 | 0.429766 | 649168000 |
| 2010-01-07 | 0.424265 | 0.432287 | 0.421056 | 0.430454 | 547792000 |
| 2010-01-08 | 0.425182 | 0.428162 | 0.418306 | 0.420827 | 478168000 |
| ... | ... | ... | ... | ... | ... |
| 2025-12-02 | 181.449905 | 185.649669 | 179.989980 | 181.749876 | 182632200 |
| 2025-12-03 | 179.580002 | 182.439843 | 179.100033 | 181.069924 | 165138000 |
| 2025-12-04 | 183.380005 | 184.520004 | 179.960007 | 181.619995 | 167364900 |
| 2025-12-05 | 182.410004 | 184.660004 | 180.910004 | 183.889999 | 143971100 |
| 2025-12-08 | 185.550003 | 188.000000 | 182.399994 | 182.639999 | 201696500 |
4008 rows × 5 columns
# Compute daily log returns
data['LogReturn'] = np.log(data['Close'] / data['Close'].shift(1))
returns = data['LogReturn'].dropna().values.reshape(-1, 1) # shape (n_samples, 1)
# Fit 2-Regime GMM > Calm vs Volatile Periods
# sklearn fit distribution of NVDA returns as a mixture of 2 Gaussian Distros
# Regime 0 = Calm
# Regime 1 = Volatile
# Configuring the Model
gmm2 = GaussianMixture(
n_components=2, # 2 regimes
covariance_type='full', # full 1D covariance matrix for each regime
init_params='kmeans', # initialize cluster centers using k-means clustering
n_init=10, # 10 fitting iterations (should increase?)
random_state=42 # set random seed
)
# Fit the returns
gmm2.fit(returns)
# Regime labels and posterior probabilities
labels2 = gmm2.predict(returns) # hard cluster label specification (0 or 1) for each data point
probs2 = gmm2.predict_proba(returns) # soft probabilities for each regime for each (daily) data pont...some percentage for 1 and the other
print("2-regime means:", gmm2.means_.ravel()) # print means...'ravel' a numpy method turns array into 1D (flattened) view
print("2-regime stds :", np.sqrt(gmm2.covariances_.ravel())) # print standard deviation
print("2-regime weights:", gmm2.weights_) # print weights
2-regime means: [0.00133091 0.00164009] 2-regime stds : [0.04112189 0.01586472] 2-regime weights: [0.39555542 0.60444458]
# Visualize Mixture vs Empirical Distribution
x = np.linspace(returns.min(), returns.max(), 1000).reshape(-1, 1)
# Mixture pdf for 2 regimes
log_prob_2 = gmm2.score_samples(x) # returns the log of the combined mixture probability distribution function...adding 2 Gaussians together
pdf_2 = np.exp(log_prob_2) # converting to probability distribution function
# log-probabilities of each component
log_probs_components = gmm2._estimate_log_prob(x) # shape (1000, 2)
comp_pdfs = np.exp(log_probs_components) # converting to probability distribution function
weighted_pdfs = comp_pdfs * gmm2.weights_ # Weight each component by its regime weight
plt.figure(figsize=(10, 5))
plt.hist(returns, bins=80, density=True, alpha=0.4, label='Empirical Histogram')
# plot mixture (combined) PDF
plt.plot(x, pdf_2, color='orange', linewidth=2, label='GMM pdf')
# plot the 2 regimes individually
plt.plot(x, weighted_pdfs[:, 0], 'r--', label='Regime 0 pdf')
plt.plot(x, weighted_pdfs[:, 1], 'g--', label='Regime 1 pdf')
plt.title("NVDA Daily Log Returns + 2-Regime GMM Fit")
plt.xlabel("Log return")
plt.ylabel("Density")
plt.legend()
plt.show()
Observations: Nvidia's Daily Log Returns, 2 Regime GMM Fit
The plot above shows several things...
- A Histogram of NVDA's return distributions > blue bar graph
- A Gaussian Distribution of Volatile Returns > red dash line
- A Gaussian Distribution of Calm Returns > green dash line
- A Gaussian Mixture Probability Distribution Function > orange line
So Nvidia's return history since its IPO when analyzed for volatility using GMM techniques tells us the following story:
Nvidia trades like a boring stock most of the time, the green dashed line has high density and concentrated between -0.5% and +0.5% daily returns. But, less common but definitely not uncommon, Nvidia will have dramatic return days where it trades between -2.2% to +2.7%, the less dense red dashed line.
As an investor in Nvidia stock, this plot provides empirical confirmation that Nvidia is NOT a low risk stock to invest in, that it has meaningful price volatility characteristics that must be considered.
Adding Dimensionality¶
In class review, Neil commented that the work I did above has only a single dimension...and that it would interesting if I added more dimensions to the analysis to get even more insights. Suggested dimensions included market movement, etc.
I have only looked at Return so far.
Learning Outcome¶
- The processes and algorithms for this week have too many different names...making it hard to do research. "Cluster Weighted Model" for example.
- So I understand this week's work to be to parse a dataset plot of data into liked groups, clusters, regimes
- ...then generate a probability distribution for each of those clusters
- the procedure allows the analysis of component effects of an aggragate distribution