Density Estimation¶
After the class on Density Estimation, I could see vaguely how I could make the GMM fit in my previous class to fit better to the empirical data. However, I still needed to few more tutorial videos on YouTube to understand things better.¶
According to one of the YouTube videos I watched on the Estimation Maximization, probability density estimation was about constructing an estimate base on the empirical data. The process involved selecting a Probability distribution function and the parameters of the function that best describes the combined probability of the observed data.¶
What is Maximum Likelihood?¶
The method of determining the parameters of a probability distribution by maximizing the likelihood function to make the observed data most probable for the model. However, the limitation here is that the maximum likelihood assumes that all variables relevant to the data a present.¶
In some datasets, these variables may remain hidden, causing inconsistencies, and these variables are called hidden or latent variables.¶
And in the presence of these latent variables, the conventional maximum likelihood estimator doesn't work and hence that when we bring in the Estimation Maximization algorithm¶
How does Estimation Maximization work?¶
K-means¶
From the videos that I referred to on YouTube, K-means is a clustering algorithm that partitions data into k clusters by minimizing the distance between data points and the cluster centers (anchors). This helps provide a good initial guess for the cluster locations, basically the means in the data and using K-means first can help EM converge faster.¶
Why am I doing this?¶
I initially tried fitting a Gaussian Mixture Model (GMM) directly to the BTC price data, which can be found in my previous assignment. While the model ran successfully, the resulting probability density function (PDF) was too smooth and didn’t capture the finer details of the empirical data distribution. To improve the fit, I decided to decompose the problem: first, identify clusters in the data using K-means, then refine the Gaussian components using the Expectation-Maximization (EM) algorithm, and finally fit the GMM using these parameters.¶
K means clustering¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Load and clean BTC data
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close'])
prices = df['Close'].values.reshape(-1, 1) # Reshape to 2D array
#Initialize K-means parameters
k = 3 # number of clusters
np.random.seed(42)
initial_indices = np.random.choice(len(prices), k, replace=False)
centers = prices[initial_indices] # randomly pick k points as initial centers
print("Initial Centers:", centers.flatten())
# Step 3: Iterative K-means
max_iters = 10
for iteration in range(max_iters):
#Assign each point to the nearest center
distances = np.abs(prices - centers.T) # shape (n_points, k)
labels = np.argmin(distances, axis=1) # closest center
# Update centers
new_centers = np.array([prices[labels == i].mean() for i in range(k)]).reshape(-1, 1)
# Print iteration info
print(f"Iteration {iteration+1}: Centers = {new_centers.flatten()}")
# Check for convergence
if np.allclose(centers, new_centers):
print("Converged!")
break
centers = new_centers
# Visualize K-means clusters
plt.figure(figsize=(12,6))
for i in range(k):
cluster_points = prices[labels == i]
plt.hist(cluster_points, bins=50, alpha=0.6, label=f"Cluster {i+1}")
plt.axvline(centers[0], color='r', linestyle='--', lw=2, label="Cluster Centers")
plt.axvline(centers[1], color='g', linestyle='--', lw=2)
plt.axvline(centers[2], color='b', linestyle='--', lw=2)
plt.xlabel("BTC Closing Price (USD)")
plt.ylabel("Frequency")
plt.title(f"K-Means Clustering of BTC Prices ({k} clusters)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Initial Centers: [67578.09375 43798.1171875 84347.0234375] Iteration 1: Centers = [ 63062.93260609 33113.59833688 101735.25941918] Iteration 2: Centers = [ 60929.71829243 31260.1176401 102395.06667493] Iteration 3: Centers = [ 59373.82064559 30284.4421901 102342.1364817 ] Iteration 4: Centers = [ 58887.74958882 29993.3433042 102286.14849918] Iteration 5: Centers = [ 58714.19798712 29910.20505149 102229.23220144] Iteration 6: Centers = [ 58540.97149884 29827.64773661 102172.28182264] Iteration 7: Centers = [ 58540.97149884 29827.64773661 102172.28182264] Converged!
The K-means clustering of the BTC prices separated the data into three main clusters. These clusters roughly correspond to low, medium, and high price ranges in the data. The cluster centers will serve as initial estimates for the Gaussian components in the EM algorithm.¶
Next, I will use the EM algorithm to refine the clusters. EM (Expectation-Maximization) allows for soft clustering, where each point has a probability of belonging to each cluster, allowing a more accurate modeling of BTC price distributions. The K-means centers are used to initialize the EM algorithm.¶
EM algorithm¶
In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load BTC data and make sure dates and prices are correct
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close']) # remove rows with missing values
prices = df['Close'].values.reshape(-1, 1) # convert to 2D array
n = len(prices)
# Initialize EM parameters from K-means results
k = 3 # number of clusters
means = np.array([58540.97, 29827.65, 102172.28], dtype=float) # starting centers
variances = np.array([np.var(prices)]*k) # start with overall variance
weights = np.array([1/k]*k) # assume equal weight for each cluster
# Repeat EM steps multiple times
max_iters = 60
for iteration in range(max_iters):
# E-step: calculate how likely each point belongs to each cluster
resp = np.zeros((n, k))
for i in range(k):
resp[:, i] = weights[i] * (1/np.sqrt(2*np.pi*variances[i])) * \
np.exp(-(prices.flatten() - means[i])**2 / (2*variances[i]))
resp = resp / resp.sum(axis=1, keepdims=True) # make sure probabilities add up to 1
# M-step: update cluster centers, variances, and weights based on probabilities
for i in range(k):
Nk = resp[:, i].sum() # total weight of this cluster
means[i] = (resp[:, i] @ prices.flatten()) / Nk # new mean
variances[i] = (resp[:, i] @ (prices.flatten() - means[i])**2) / Nk # new variance
weights[i] = Nk / n # new weight
print(f"Iteration {iteration+1}: Means = {means}, Variances = {variances}, Weights = {weights}")
# Plot histogram and the fitted EM Gaussians
x = np.linspace(prices.min(), prices.max(), 1000)
pdf = np.zeros_like(x)
for i in range(k):
pdf += weights[i] * (1/np.sqrt(2*np.pi*variances[i])) * \
np.exp(-(x - means[i])**2 / (2*variances[i]))
plt.figure(figsize=(12,6))
plt.hist(prices, bins=50, density=True, alpha=0.6, color='skyblue', label="BTC Prices")
plt.plot(x, pdf, 'r-', lw=2, label="EM Gaussian Fit")
plt.xlabel("BTC Closing Price (USD)")
plt.ylabel("Probability Density")
plt.title("BTC Price Distribution (EM Algorithm)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Iteration 1: Means = [51094.76645625 37095.1121068 88539.68232879], Variances = [5.88628421e+08 2.84952908e+08 6.48023829e+08], Weights = [0.38338957 0.39554994 0.2210605 ] Iteration 2: Means = [50301.34644594 36187.45544699 91592.85669591], Variances = [5.51207784e+08 2.00141424e+08 5.42476484e+08], Weights = [0.35713726 0.41532636 0.22753637] Iteration 3: Means = [50227.35626683 35201.907338 93623.98544598], Variances = [4.88686091e+08 1.69835073e+08 4.35404759e+08], Weights = [0.33452403 0.43257575 0.23290022] Iteration 4: Means = [50347.63350486 34423.5285382 95228.41786393], Variances = [4.17775337e+08 1.54376796e+08 3.61925911e+08], Weights = [0.32142797 0.44351174 0.23506028] Iteration 5: Means = [50624.42806815 33763.83898027 96629.28195177], Variances = [3.54189321e+08 1.44170921e+08 3.09463348e+08], Weights = [0.31846895 0.4476763 0.23385475] Iteration 6: Means = [51108.82635631 33103.13201589 97950.46471259], Variances = [3.05000850e+08 1.35081473e+08 2.65194579e+08], Weights = [0.32363641 0.44652583 0.22983776] Iteration 7: Means = [51809.21922987 32377.89283026 99274.6136694 ], Variances = [2.68517179e+08 1.24791737e+08 2.23110903e+08], Weights = [0.33415846 0.44215702 0.22368452] Iteration 8: Means = [ 52658.95153418 31582.19599544 100596.24616048], Variances = [2.40084944e+08 1.12474250e+08 1.82172850e+08], Weights = [0.34722213 0.43638303 0.21639483] Iteration 9: Means = [ 53541.49154298 30766.57786884 101761.52397503], Variances = [2.16400296e+08 9.90138052e+07 1.46839793e+08], Weights = [0.35990948 0.43062053 0.20946999] Iteration 10: Means = [ 54325.4995573 30025.29321924 102559.89961219], Variances = [1.96594156e+08 8.66437503e+07 1.24077760e+08], Weights = [0.36981835 0.42582622 0.20435542] Iteration 11: Means = [ 54898.26075448 29438.44454161 102970.88725186], Variances = [1.80813543e+08 7.71984279e+07 1.13732940e+08], Weights = [0.37640653 0.42214348 0.20145 ] Iteration 12: Means = [ 55208.95845068 29019.92968602 103125.27225738], Variances = [1.68783199e+08 7.08251500e+07 1.10054242e+08], Weights = [0.38050331 0.41910825 0.20038844] Iteration 13: Means = [ 55303.61304008 28732.69050071 103136.99433509], Variances = [1.59985307e+08 6.66904459e+07 1.09279234e+08], Weights = [0.3831349 0.4163141 0.200551 ] Iteration 14: Means = [ 55267.08495537 28530.81657929 103079.74955157], Variances = [1.53759150e+08 6.39463772e+07 1.09781399e+08], Weights = [0.38506566 0.41360671 0.20132764] Iteration 15: Means = [ 55167.16500657 28380.21374983 102998.31407987], Variances = [1.49450933e+08 6.20171348e+07 1.10779170e+08], Weights = [0.38675 0.41095687 0.20229312] Iteration 16: Means = [ 55043.77150305 28259.99648579 102916.97798127], Variances = [1.46553082e+08 6.05619931e+07 1.11868353e+08], Weights = [0.38842366 0.40836229 0.20321405] Iteration 17: Means = [ 54916.26867001 28158.2288298 102846.64872253], Variances = [1.44706237e+08 5.93895261e+07 1.12856466e+08], Weights = [0.39019416 0.40581523 0.20399061] Iteration 18: Means = [ 54792.24230617 28068.07165923 102790.60684951], Variances = [1.43654311e+08 5.83924708e+07 1.13671022e+08], Weights = [0.39210151 0.40329975 0.20459874] Iteration 19: Means = [ 54673.5342353 27985.41777759 102748.39531201], Variances = [1.43208434e+08 5.75084591e+07 1.14301859e+08], Weights = [0.3941533 0.40079643 0.20505027] Iteration 20: Means = [ 54559.6112616 27907.63666963 102718.07139346], Variances = [1.43225941e+08 5.66989137e+07 1.14766997e+08], Weights = [0.39634327 0.39828661 0.20537013] Iteration 21: Means = [ 54449.19957251 27832.942653 102697.35288238], Variances = [1.43598226e+08 5.59383746e+07 1.15094097e+08], Weights = [0.39866053 0.39575444 0.20558503] Iteration 22: Means = [ 54340.9777837 27760.07324429 102684.1242833 ], Variances = [1.44242727e+08 5.52091608e+07 1.15311346e+08], Weights = [0.40109376 0.39318742 0.20571882] Iteration 23: Means = [ 54233.81930906 27688.11475368 102676.61164784], Variances = [1.45096974e+08 5.44985936e+07 1.15443623e+08], Weights = [0.40363287 0.39057607 0.20579106] Iteration 24: Means = [ 54126.84214625 27616.39869479 102673.40318017], Variances = [1.46113881e+08 5.37974293e+07 1.15511365e+08], Weights = [0.40626945 0.38791346 0.20581708] Iteration 25: Means = [ 54019.38758535 27544.43475054 102673.40797862], Variances = [1.47258023e+08 5.30988878e+07 1.15530696e+08], Weights = [0.4089967 0.38519465 0.20580865] Iteration 26: Means = [ 53910.97986155 27471.86476317 102675.7971639 ], Variances = [1.48502716e+08 5.23980030e+07 1.15514031e+08], Weights = [0.41180913 0.38241623 0.20577464] Iteration 27: Means = [ 53801.28664961 27398.43018438 102679.94653325], Variances = [1.49827758e+08 5.16911678e+07 1.15470804e+08], Weights = [0.41470223 0.37957606 0.20572171] Iteration 28: Means = [ 53690.08656776 27323.9488345 102685.38764638], Variances = [1.51217714e+08 5.09758095e+07 1.15408126e+08], Weights = [0.41767217 0.37667298 0.20565484] Iteration 29: Means = [ 53577.24448182 27248.29838589 102691.76872239], Variances = [1.52660617e+08 5.02501586e+07 1.15331349e+08], Weights = [0.42071549 0.37370672 0.20557779] Iteration 30: Means = [ 53462.69358155 27171.40480075 102698.82450087], Variances = [1.54147005e+08 4.95130840e+07 1.15244507e+08], Weights = [0.42382885 0.37067777 0.20549337] Iteration 31: Means = [ 53346.42279649 27093.23444381 102706.35352643], Variances = [1.55669202e+08 4.87639805e+07 1.15150655e+08], Weights = [0.42700885 0.36758741 0.20540374] Iteration 32: Means = [ 53228.46821754 27013.78891763 102714.2012791 ], Variances = [1.57220786e+08 4.80026918e+07 1.15052123e+08], Weights = [0.43025183 0.36443762 0.20531055] Iteration 33: Means = [ 53108.90742592 26933.10189822 102722.24777976], Variances = [1.58796200e+08 4.72294619e+07 1.14950712e+08], Weights = [0.43355371 0.3612312 0.20521509] Iteration 34: Means = [ 52987.85586767 26851.23741196 102730.39856459], Variances = [1.60390465e+08 4.64449051e+07 1.14847821e+08], Weights = [0.43690992 0.3579717 0.20511838] Iteration 35: Means = [ 52865.46460683 26768.28910761 102738.57817055], Variances = [1.61998983e+08 4.56499904e+07 1.14744559e+08], Weights = [0.44031523 0.35466353 0.20502124] Iteration 36: Means = [ 52741.91893835 26684.38015256 102746.72548034], Variances = [1.63617384e+08 4.48460323e+07 1.14641812e+08], Weights = [0.44376366 0.35131198 0.20492435] Iteration 37: Means = [ 52617.43744885 26599.66342939 102754.79043686], Variances = [1.65241428e+08 4.40346873e+07 1.14540294e+08], Weights = [0.44724848 0.34792325 0.20482826] Iteration 38: Means = [ 52492.27118887 26514.3217366 102762.73175942], Variances = [1.66866945e+08 4.32179485e+07 1.14440582e+08], Weights = [0.45076209 0.34450447 0.20473344] Iteration 39: Means = [ 52366.7026742 26428.56771355 102770.51538409], Variances = [1.68489790e+08 4.23981370e+07 1.14343147e+08], Weights = [0.454296 0.34106372 0.20464027] Iteration 40: Means = [ 52241.04447367 26342.64322232 102778.11341652], Variances = [1.70105826e+08 4.15778867e+07 1.14248362e+08], Weights = [0.4578409 0.33761002 0.20454908] Iteration 41: Means = [ 52115.63717553 26256.81793654 102785.50343335], Variances = [1.71710918e+08 4.07601185e+07 1.14156521e+08], Weights = [0.46138659 0.33415327 0.20446014] Iteration 42: Means = [ 51990.84655842 26171.38691622 102792.66800373], Variances = [1.73300941e+08 3.99480030e+07 1.14067842e+08], Weights = [0.46492214 0.33070421 0.20437364] Iteration 43: Means = [ 51867.05983333 26086.66699565 102799.59432964], Variances = [1.74871793e+08 3.91449087e+07 1.13982476e+08], Weights = [0.46843594 0.32727429 0.20428977] Iteration 44: Means = [ 51744.68087329 26002.99188211 102806.27392608], Variances = [1.76419413e+08 3.83543381e+07 1.13900510e+08], Weights = [0.47191586 0.32387552 0.20420863] Iteration 45: Means = [ 51624.12441012 25920.70595904 102812.70228172], Variances = [1.77939806e+08 3.75798488e+07 1.13821974e+08], Weights = [0.47534939 0.32052032 0.20413029] Iteration 46: Means = [ 51505.80925284 25840.15690418 102818.87846008], Variances = [1.79429072e+08 3.68249657e+07 1.13746847e+08], Weights = [0.47872392 0.31722129 0.20405479] Iteration 47: Means = [ 51390.15066829 25761.68736333 102824.8046199 ], Variances = [1.80883429e+08 3.60930854e+07 1.13675063e+08], Weights = [0.48202691 0.31399095 0.20398214] Iteration 48: Means = [ 51277.55215317 25685.62604864 102830.48545212], Variances = [1.82299251e+08 3.53873794e+07 1.13606518e+08], Weights = [0.48524618 0.31084153 0.2039123 ] Iteration 49: Means = [ 51168.39691172 25612.27873998 102835.92754881], Variances = [1.83673101e+08 3.47107009e+07 1.13541085e+08], Weights = [0.48837016 0.30778462 0.20384522] Iteration 50: Means = [ 51063.03942062 25541.91973775 102841.1387342 ], Variances = [1.85001771e+08 3.40655027e+07 1.13478615e+08], Weights = [0.49138817 0.30483099 0.20378084] Iteration 51: Means = [ 50961.79750414 25474.78433051 102846.12739908], Variances = [1.86282320e+08 3.34537704e+07 1.13418957e+08], Weights = [0.49429063 0.30199028 0.20371909] Iteration 52: Means = [ 50864.94534613 25411.06279017 102850.90188515], Variances = [1.87512119e+08 3.28769768e+07 1.13361959e+08], Weights = [0.49706926 0.29927083 0.20365991] Iteration 53: Means = [ 50772.7078289 25350.89629375 102855.46996404], Variances = [1.88688892e+08 3.23360593e+07 1.13307480e+08], Weights = [0.4997173 0.29667948 0.20360322] Iteration 54: Means = [ 50685.256511 25294.37500592 102859.83844799], Variances = [1.89810761e+08 3.18314221e+07 1.13255395e+08], Weights = [0.50222955 0.29422148 0.20354896] Iteration 55: Means = [ 50602.7074461 25241.53836517 102864.01295528], Variances = [1.90876276e+08 3.13629612e+07 1.13205602e+08], Weights = [0.50460247 0.29190043 0.20349711] Iteration 56: Means = [ 50525.12091434 25192.37742586 102867.99783667], Variances = [1.91884447e+08 3.09301083e+07 1.13158019e+08], Weights = [0.50683414 0.28971825 0.20344761] Iteration 57: Means = [ 50452.5030031 25146.83894707 102871.79625192], Variances = [1.92834760e+08 3.05318903e+07 1.13112586e+08], Weights = [0.50892424 0.28767531 0.20340045] Iteration 58: Means = [ 50384.8088512 25104.83080664 102875.4103703 ], Variances = [1.93727180e+08 3.01669976e+07 1.13069265e+08], Weights = [0.51087394 0.28577046 0.2033556 ] Iteration 59: Means = [ 50321.94727306 25066.22826591 102878.84165853], Variances = [1.94562145e+08 2.98338569e+07 1.13028030e+08], Weights = [0.51268571 0.28400122 0.20331307] Iteration 60: Means = [ 50263.78641607 25030.88061461 102882.09121508], Variances = [1.95340541e+08 2.95307032e+07 1.12988868e+08], Weights = [0.51436323 0.28236394 0.20327283]
Gaussian Miture Model with EM Initialization¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Load and clean BTC data
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close'])
prices = df['Close'].values.reshape(-1, 1)
# EM-refined parameters
means_init = np.array([[50263.78641607],
[25030.88061461],
[102882.09121508]])
variances_init = np.array([1.95340541e+08, 2.95307032e+07, 1.12988868e+08])
weights_init = np.array([0.51436323, 0.28236394, 0.20327283])
# Convert variances to covariance matrices (full covariance type)
covariances_init = variances_init.reshape(-1, 1, 1) # 3x1x1 array
# Fit GMM using sklearn, initialized with EM parameters
gmm = GaussianMixture(
n_components=3,
covariance_type='full',
weights_init=weights_init,
means_init=means_init,
precisions_init=np.array([1/cov for cov in covariances_init]), # inverse of covariance
random_state=42
)
gmm.fit(prices)
# Extract final fitted parameters
fitted_means = gmm.means_.flatten()
fitted_covariances = gmm.covariances_.flatten()
fitted_weights = gmm.weights_.flatten()
print("Fitted GMM Means:", fitted_means)
print("Fitted GMM Covariances:", fitted_covariances)
print("Fitted GMM Weights:", fitted_weights)
# Visualize the fitted GMM PDF
x = np.linspace(prices.min(), prices.max(), 1000)
pdf = np.exp(gmm.score_samples(x.reshape(-1,1)))
plt.figure(figsize=(12,6))
plt.hist(prices, bins=50, density=True, alpha=0.6, color='skyblue', label="BTC Prices")
plt.plot(x, pdf, 'r-', lw=2, label="GMM Fit (initialized by EM)")
plt.xlabel("BTC Closing Price (USD)")
plt.ylabel("Probability Density")
plt.title("BTC Price Distribution (GMM fitted with EM initialization)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Fitted GMM Means: [ 50160.87436025 24969.25655697 102888.04970422] Fitted GMM Covariances: [1.96733192e+08 2.90067327e+07 1.12916730e+08] Fitted GMM Weights: [0.51733489 0.27946591 0.2031992 ]
GMM reveals that Bitcoin prices are not centered around a single average value but instead cluster into three distinct market regimes: a low-price zone (around 25k), a mid-price zone (around 50k), and a high-price zone (around 100k). The weights of the components show how much time BTC historically spent in each regime, with the mid-price region being most common. The model captures multimodality and heavy tails, representing Bitcoin's volatility far better than a normal distribution. These insights allow a deeper understanding of market structure, tail risk, and volatility behavior, and provide a strong foundation for regime-based forecasting.¶
Reference¶
- GeeksforGeeks. (n.d.). Expectation maximization (EM) algorithm. https://www.geeksforgeeks.org/expectation-maximization-em-algorithm/
- GeeksforGeeks. (n.d.). K-means clustering – introduction. https://www.geeksforgeeks.org/k-means-clustering-introduction/
- GeeksforGeeks. (n.d.). Probability distribution. https://www.geeksforgeeks.org/maths/probability-distribution/
- EM Algorithm in Machine Learning | Expectation-Maximization Tutorial [Video]. (2019, May 1). YouTube. https://www.youtube.com/watch?v=DIADjJXrgps