[Your-Name-Here] - Fab Futures - Data Science
Home About

Density Estimation¶

After the class on Density Estimation, I could see vaguely how I could make the GMM fit in my previous class to fit better to the empirical data. However, I still needed to few more tutorial videos on YouTube to understand things better.¶

According to one of the YouTube videos I watched on the Estimation Maximization, probability density estimation was about constructing an estimate base on the empirical data. The process involved selecting a Probability distribution function and the parameters of the function that best describes the combined probability of the observed data.¶

What is Maximum Likelihood?¶

The method of determining the parameters of a probability distribution by maximizing the likelihood function to make the observed data most probable for the model. However, the limitation here is that the maximum likelihood assumes that all variables relevant to the data a present.¶

In some datasets, these variables may remain hidden, causing inconsistencies, and these variables are called hidden or latent variables.¶

And in the presence of these latent variables, the conventional maximum likelihood estimator doesn't work and hence that when we bring in the Estimation Maximization algorithm¶

How does Estimation Maximization work?¶

Screenshot 2025-12-07 at 6.17.31 PM.png

K-means¶

From the videos that I referred to on YouTube, K-means is a clustering algorithm that partitions data into k clusters by minimizing the distance between data points and the cluster centers (anchors). This helps provide a good initial guess for the cluster locations, basically the means in the data and using K-means first can help EM converge faster.¶

Why am I doing this?¶

I initially tried fitting a Gaussian Mixture Model (GMM) directly to the BTC price data, which can be found in my previous assignment. While the model ran successfully, the resulting probability density function (PDF) was too smooth and didn’t capture the finer details of the empirical data distribution. To improve the fit, I decided to decompose the problem: first, identify clusters in the data using K-means, then refine the Gaussian components using the Expectation-Maximization (EM) algorithm, and finally fit the GMM using these parameters.¶

K means clustering¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Load and clean BTC data
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close'])

prices = df['Close'].values.reshape(-1, 1)  # Reshape to 2D array

#Initialize K-means parameters
k = 3  # number of clusters
np.random.seed(42)
initial_indices = np.random.choice(len(prices), k, replace=False)
centers = prices[initial_indices]  # randomly pick k points as initial centers

print("Initial Centers:", centers.flatten())

# Step 3: Iterative K-means
max_iters = 10
for iteration in range(max_iters):
    #Assign each point to the nearest center
    distances = np.abs(prices - centers.T)  # shape (n_points, k)
    labels = np.argmin(distances, axis=1)  # closest center
    
    # Update centers
    new_centers = np.array([prices[labels == i].mean() for i in range(k)]).reshape(-1, 1)
    
    # Print iteration info
    print(f"Iteration {iteration+1}: Centers = {new_centers.flatten()}")
    
    # Check for convergence
    if np.allclose(centers, new_centers):
        print("Converged!")
        break
    
    centers = new_centers

# Visualize K-means clusters
plt.figure(figsize=(12,6))
for i in range(k):
    cluster_points = prices[labels == i]
    plt.hist(cluster_points, bins=50, alpha=0.6, label=f"Cluster {i+1}")
    
plt.axvline(centers[0], color='r', linestyle='--', lw=2, label="Cluster Centers")
plt.axvline(centers[1], color='g', linestyle='--', lw=2)
plt.axvline(centers[2], color='b', linestyle='--', lw=2)
plt.xlabel("BTC Closing Price (USD)")
plt.ylabel("Frequency")
plt.title(f"K-Means Clustering of BTC Prices ({k} clusters)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Initial Centers: [67578.09375   43798.1171875 84347.0234375]
Iteration 1: Centers = [ 63062.93260609  33113.59833688 101735.25941918]
Iteration 2: Centers = [ 60929.71829243  31260.1176401  102395.06667493]
Iteration 3: Centers = [ 59373.82064559  30284.4421901  102342.1364817 ]
Iteration 4: Centers = [ 58887.74958882  29993.3433042  102286.14849918]
Iteration 5: Centers = [ 58714.19798712  29910.20505149 102229.23220144]
Iteration 6: Centers = [ 58540.97149884  29827.64773661 102172.28182264]
Iteration 7: Centers = [ 58540.97149884  29827.64773661 102172.28182264]
Converged!
No description has been provided for this image

The K-means clustering of the BTC prices separated the data into three main clusters. These clusters roughly correspond to low, medium, and high price ranges in the data. The cluster centers will serve as initial estimates for the Gaussian components in the EM algorithm.¶

Next, I will use the EM algorithm to refine the clusters. EM (Expectation-Maximization) allows for soft clustering, where each point has a probability of belonging to each cluster, allowing a more accurate modeling of BTC price distributions. The K-means centers are used to initialize the EM algorithm.¶

EM algorithm¶

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load BTC data and make sure dates and prices are correct
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close'])  # remove rows with missing values

prices = df['Close'].values.reshape(-1, 1)  # convert to 2D array
n = len(prices)

# Initialize EM parameters from K-means results
k = 3  # number of clusters
means = np.array([58540.97, 29827.65, 102172.28], dtype=float)  # starting centers
variances = np.array([np.var(prices)]*k)  # start with overall variance
weights = np.array([1/k]*k)  # assume equal weight for each cluster

# Repeat EM steps multiple times
max_iters = 60
for iteration in range(max_iters):
    # E-step: calculate how likely each point belongs to each cluster
    resp = np.zeros((n, k))
    for i in range(k):
        resp[:, i] = weights[i] * (1/np.sqrt(2*np.pi*variances[i])) * \
                     np.exp(-(prices.flatten() - means[i])**2 / (2*variances[i]))
    resp = resp / resp.sum(axis=1, keepdims=True)  # make sure probabilities add up to 1
    
    # M-step: update cluster centers, variances, and weights based on probabilities
    for i in range(k):
        Nk = resp[:, i].sum()  # total weight of this cluster
        means[i] = (resp[:, i] @ prices.flatten()) / Nk  # new mean
        variances[i] = (resp[:, i] @ (prices.flatten() - means[i])**2) / Nk  # new variance
        weights[i] = Nk / n  # new weight
    
    print(f"Iteration {iteration+1}: Means = {means}, Variances = {variances}, Weights = {weights}")

# Plot histogram and the fitted EM Gaussians
x = np.linspace(prices.min(), prices.max(), 1000)
pdf = np.zeros_like(x)
for i in range(k):
    pdf += weights[i] * (1/np.sqrt(2*np.pi*variances[i])) * \
           np.exp(-(x - means[i])**2 / (2*variances[i]))

plt.figure(figsize=(12,6))
plt.hist(prices, bins=50, density=True, alpha=0.6, color='skyblue', label="BTC Prices")
plt.plot(x, pdf, 'r-', lw=2, label="EM Gaussian Fit")
plt.xlabel("BTC Closing Price (USD)")
plt.ylabel("Probability Density")
plt.title("BTC Price Distribution (EM Algorithm)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Iteration 1: Means = [51094.76645625 37095.1121068  88539.68232879], Variances = [5.88628421e+08 2.84952908e+08 6.48023829e+08], Weights = [0.38338957 0.39554994 0.2210605 ]
Iteration 2: Means = [50301.34644594 36187.45544699 91592.85669591], Variances = [5.51207784e+08 2.00141424e+08 5.42476484e+08], Weights = [0.35713726 0.41532636 0.22753637]
Iteration 3: Means = [50227.35626683 35201.907338   93623.98544598], Variances = [4.88686091e+08 1.69835073e+08 4.35404759e+08], Weights = [0.33452403 0.43257575 0.23290022]
Iteration 4: Means = [50347.63350486 34423.5285382  95228.41786393], Variances = [4.17775337e+08 1.54376796e+08 3.61925911e+08], Weights = [0.32142797 0.44351174 0.23506028]
Iteration 5: Means = [50624.42806815 33763.83898027 96629.28195177], Variances = [3.54189321e+08 1.44170921e+08 3.09463348e+08], Weights = [0.31846895 0.4476763  0.23385475]
Iteration 6: Means = [51108.82635631 33103.13201589 97950.46471259], Variances = [3.05000850e+08 1.35081473e+08 2.65194579e+08], Weights = [0.32363641 0.44652583 0.22983776]
Iteration 7: Means = [51809.21922987 32377.89283026 99274.6136694 ], Variances = [2.68517179e+08 1.24791737e+08 2.23110903e+08], Weights = [0.33415846 0.44215702 0.22368452]
Iteration 8: Means = [ 52658.95153418  31582.19599544 100596.24616048], Variances = [2.40084944e+08 1.12474250e+08 1.82172850e+08], Weights = [0.34722213 0.43638303 0.21639483]
Iteration 9: Means = [ 53541.49154298  30766.57786884 101761.52397503], Variances = [2.16400296e+08 9.90138052e+07 1.46839793e+08], Weights = [0.35990948 0.43062053 0.20946999]
Iteration 10: Means = [ 54325.4995573   30025.29321924 102559.89961219], Variances = [1.96594156e+08 8.66437503e+07 1.24077760e+08], Weights = [0.36981835 0.42582622 0.20435542]
Iteration 11: Means = [ 54898.26075448  29438.44454161 102970.88725186], Variances = [1.80813543e+08 7.71984279e+07 1.13732940e+08], Weights = [0.37640653 0.42214348 0.20145   ]
Iteration 12: Means = [ 55208.95845068  29019.92968602 103125.27225738], Variances = [1.68783199e+08 7.08251500e+07 1.10054242e+08], Weights = [0.38050331 0.41910825 0.20038844]
Iteration 13: Means = [ 55303.61304008  28732.69050071 103136.99433509], Variances = [1.59985307e+08 6.66904459e+07 1.09279234e+08], Weights = [0.3831349 0.4163141 0.200551 ]
Iteration 14: Means = [ 55267.08495537  28530.81657929 103079.74955157], Variances = [1.53759150e+08 6.39463772e+07 1.09781399e+08], Weights = [0.38506566 0.41360671 0.20132764]
Iteration 15: Means = [ 55167.16500657  28380.21374983 102998.31407987], Variances = [1.49450933e+08 6.20171348e+07 1.10779170e+08], Weights = [0.38675    0.41095687 0.20229312]
Iteration 16: Means = [ 55043.77150305  28259.99648579 102916.97798127], Variances = [1.46553082e+08 6.05619931e+07 1.11868353e+08], Weights = [0.38842366 0.40836229 0.20321405]
Iteration 17: Means = [ 54916.26867001  28158.2288298  102846.64872253], Variances = [1.44706237e+08 5.93895261e+07 1.12856466e+08], Weights = [0.39019416 0.40581523 0.20399061]
Iteration 18: Means = [ 54792.24230617  28068.07165923 102790.60684951], Variances = [1.43654311e+08 5.83924708e+07 1.13671022e+08], Weights = [0.39210151 0.40329975 0.20459874]
Iteration 19: Means = [ 54673.5342353   27985.41777759 102748.39531201], Variances = [1.43208434e+08 5.75084591e+07 1.14301859e+08], Weights = [0.3941533  0.40079643 0.20505027]
Iteration 20: Means = [ 54559.6112616   27907.63666963 102718.07139346], Variances = [1.43225941e+08 5.66989137e+07 1.14766997e+08], Weights = [0.39634327 0.39828661 0.20537013]
Iteration 21: Means = [ 54449.19957251  27832.942653   102697.35288238], Variances = [1.43598226e+08 5.59383746e+07 1.15094097e+08], Weights = [0.39866053 0.39575444 0.20558503]
Iteration 22: Means = [ 54340.9777837   27760.07324429 102684.1242833 ], Variances = [1.44242727e+08 5.52091608e+07 1.15311346e+08], Weights = [0.40109376 0.39318742 0.20571882]
Iteration 23: Means = [ 54233.81930906  27688.11475368 102676.61164784], Variances = [1.45096974e+08 5.44985936e+07 1.15443623e+08], Weights = [0.40363287 0.39057607 0.20579106]
Iteration 24: Means = [ 54126.84214625  27616.39869479 102673.40318017], Variances = [1.46113881e+08 5.37974293e+07 1.15511365e+08], Weights = [0.40626945 0.38791346 0.20581708]
Iteration 25: Means = [ 54019.38758535  27544.43475054 102673.40797862], Variances = [1.47258023e+08 5.30988878e+07 1.15530696e+08], Weights = [0.4089967  0.38519465 0.20580865]
Iteration 26: Means = [ 53910.97986155  27471.86476317 102675.7971639 ], Variances = [1.48502716e+08 5.23980030e+07 1.15514031e+08], Weights = [0.41180913 0.38241623 0.20577464]
Iteration 27: Means = [ 53801.28664961  27398.43018438 102679.94653325], Variances = [1.49827758e+08 5.16911678e+07 1.15470804e+08], Weights = [0.41470223 0.37957606 0.20572171]
Iteration 28: Means = [ 53690.08656776  27323.9488345  102685.38764638], Variances = [1.51217714e+08 5.09758095e+07 1.15408126e+08], Weights = [0.41767217 0.37667298 0.20565484]
Iteration 29: Means = [ 53577.24448182  27248.29838589 102691.76872239], Variances = [1.52660617e+08 5.02501586e+07 1.15331349e+08], Weights = [0.42071549 0.37370672 0.20557779]
Iteration 30: Means = [ 53462.69358155  27171.40480075 102698.82450087], Variances = [1.54147005e+08 4.95130840e+07 1.15244507e+08], Weights = [0.42382885 0.37067777 0.20549337]
Iteration 31: Means = [ 53346.42279649  27093.23444381 102706.35352643], Variances = [1.55669202e+08 4.87639805e+07 1.15150655e+08], Weights = [0.42700885 0.36758741 0.20540374]
Iteration 32: Means = [ 53228.46821754  27013.78891763 102714.2012791 ], Variances = [1.57220786e+08 4.80026918e+07 1.15052123e+08], Weights = [0.43025183 0.36443762 0.20531055]
Iteration 33: Means = [ 53108.90742592  26933.10189822 102722.24777976], Variances = [1.58796200e+08 4.72294619e+07 1.14950712e+08], Weights = [0.43355371 0.3612312  0.20521509]
Iteration 34: Means = [ 52987.85586767  26851.23741196 102730.39856459], Variances = [1.60390465e+08 4.64449051e+07 1.14847821e+08], Weights = [0.43690992 0.3579717  0.20511838]
Iteration 35: Means = [ 52865.46460683  26768.28910761 102738.57817055], Variances = [1.61998983e+08 4.56499904e+07 1.14744559e+08], Weights = [0.44031523 0.35466353 0.20502124]
Iteration 36: Means = [ 52741.91893835  26684.38015256 102746.72548034], Variances = [1.63617384e+08 4.48460323e+07 1.14641812e+08], Weights = [0.44376366 0.35131198 0.20492435]
Iteration 37: Means = [ 52617.43744885  26599.66342939 102754.79043686], Variances = [1.65241428e+08 4.40346873e+07 1.14540294e+08], Weights = [0.44724848 0.34792325 0.20482826]
Iteration 38: Means = [ 52492.27118887  26514.3217366  102762.73175942], Variances = [1.66866945e+08 4.32179485e+07 1.14440582e+08], Weights = [0.45076209 0.34450447 0.20473344]
Iteration 39: Means = [ 52366.7026742   26428.56771355 102770.51538409], Variances = [1.68489790e+08 4.23981370e+07 1.14343147e+08], Weights = [0.454296   0.34106372 0.20464027]
Iteration 40: Means = [ 52241.04447367  26342.64322232 102778.11341652], Variances = [1.70105826e+08 4.15778867e+07 1.14248362e+08], Weights = [0.4578409  0.33761002 0.20454908]
Iteration 41: Means = [ 52115.63717553  26256.81793654 102785.50343335], Variances = [1.71710918e+08 4.07601185e+07 1.14156521e+08], Weights = [0.46138659 0.33415327 0.20446014]
Iteration 42: Means = [ 51990.84655842  26171.38691622 102792.66800373], Variances = [1.73300941e+08 3.99480030e+07 1.14067842e+08], Weights = [0.46492214 0.33070421 0.20437364]
Iteration 43: Means = [ 51867.05983333  26086.66699565 102799.59432964], Variances = [1.74871793e+08 3.91449087e+07 1.13982476e+08], Weights = [0.46843594 0.32727429 0.20428977]
Iteration 44: Means = [ 51744.68087329  26002.99188211 102806.27392608], Variances = [1.76419413e+08 3.83543381e+07 1.13900510e+08], Weights = [0.47191586 0.32387552 0.20420863]
Iteration 45: Means = [ 51624.12441012  25920.70595904 102812.70228172], Variances = [1.77939806e+08 3.75798488e+07 1.13821974e+08], Weights = [0.47534939 0.32052032 0.20413029]
Iteration 46: Means = [ 51505.80925284  25840.15690418 102818.87846008], Variances = [1.79429072e+08 3.68249657e+07 1.13746847e+08], Weights = [0.47872392 0.31722129 0.20405479]
Iteration 47: Means = [ 51390.15066829  25761.68736333 102824.8046199 ], Variances = [1.80883429e+08 3.60930854e+07 1.13675063e+08], Weights = [0.48202691 0.31399095 0.20398214]
Iteration 48: Means = [ 51277.55215317  25685.62604864 102830.48545212], Variances = [1.82299251e+08 3.53873794e+07 1.13606518e+08], Weights = [0.48524618 0.31084153 0.2039123 ]
Iteration 49: Means = [ 51168.39691172  25612.27873998 102835.92754881], Variances = [1.83673101e+08 3.47107009e+07 1.13541085e+08], Weights = [0.48837016 0.30778462 0.20384522]
Iteration 50: Means = [ 51063.03942062  25541.91973775 102841.1387342 ], Variances = [1.85001771e+08 3.40655027e+07 1.13478615e+08], Weights = [0.49138817 0.30483099 0.20378084]
Iteration 51: Means = [ 50961.79750414  25474.78433051 102846.12739908], Variances = [1.86282320e+08 3.34537704e+07 1.13418957e+08], Weights = [0.49429063 0.30199028 0.20371909]
Iteration 52: Means = [ 50864.94534613  25411.06279017 102850.90188515], Variances = [1.87512119e+08 3.28769768e+07 1.13361959e+08], Weights = [0.49706926 0.29927083 0.20365991]
Iteration 53: Means = [ 50772.7078289   25350.89629375 102855.46996404], Variances = [1.88688892e+08 3.23360593e+07 1.13307480e+08], Weights = [0.4997173  0.29667948 0.20360322]
Iteration 54: Means = [ 50685.256511    25294.37500592 102859.83844799], Variances = [1.89810761e+08 3.18314221e+07 1.13255395e+08], Weights = [0.50222955 0.29422148 0.20354896]
Iteration 55: Means = [ 50602.7074461   25241.53836517 102864.01295528], Variances = [1.90876276e+08 3.13629612e+07 1.13205602e+08], Weights = [0.50460247 0.29190043 0.20349711]
Iteration 56: Means = [ 50525.12091434  25192.37742586 102867.99783667], Variances = [1.91884447e+08 3.09301083e+07 1.13158019e+08], Weights = [0.50683414 0.28971825 0.20344761]
Iteration 57: Means = [ 50452.5030031   25146.83894707 102871.79625192], Variances = [1.92834760e+08 3.05318903e+07 1.13112586e+08], Weights = [0.50892424 0.28767531 0.20340045]
Iteration 58: Means = [ 50384.8088512   25104.83080664 102875.4103703 ], Variances = [1.93727180e+08 3.01669976e+07 1.13069265e+08], Weights = [0.51087394 0.28577046 0.2033556 ]
Iteration 59: Means = [ 50321.94727306  25066.22826591 102878.84165853], Variances = [1.94562145e+08 2.98338569e+07 1.13028030e+08], Weights = [0.51268571 0.28400122 0.20331307]
Iteration 60: Means = [ 50263.78641607  25030.88061461 102882.09121508], Variances = [1.95340541e+08 2.95307032e+07 1.12988868e+08], Weights = [0.51436323 0.28236394 0.20327283]
No description has been provided for this image

Gaussian Miture Model with EM Initialization¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

# Load and clean BTC data
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close'])

prices = df['Close'].values.reshape(-1, 1)

# EM-refined parameters
means_init = np.array([[50263.78641607],
                       [25030.88061461],
                       [102882.09121508]])
variances_init = np.array([1.95340541e+08, 2.95307032e+07, 1.12988868e+08])
weights_init = np.array([0.51436323, 0.28236394, 0.20327283])

# Convert variances to covariance matrices (full covariance type)
covariances_init = variances_init.reshape(-1, 1, 1)  # 3x1x1 array

# Fit GMM using sklearn, initialized with EM parameters
gmm = GaussianMixture(
    n_components=3,
    covariance_type='full',
    weights_init=weights_init,
    means_init=means_init,
    precisions_init=np.array([1/cov for cov in covariances_init]),  # inverse of covariance
    random_state=42
)

gmm.fit(prices)

# Extract final fitted parameters
fitted_means = gmm.means_.flatten()
fitted_covariances = gmm.covariances_.flatten()
fitted_weights = gmm.weights_.flatten()

print("Fitted GMM Means:", fitted_means)
print("Fitted GMM Covariances:", fitted_covariances)
print("Fitted GMM Weights:", fitted_weights)

# Visualize the fitted GMM PDF
x = np.linspace(prices.min(), prices.max(), 1000)
pdf = np.exp(gmm.score_samples(x.reshape(-1,1)))

plt.figure(figsize=(12,6))
plt.hist(prices, bins=50, density=True, alpha=0.6, color='skyblue', label="BTC Prices")
plt.plot(x, pdf, 'r-', lw=2, label="GMM Fit (initialized by EM)")
plt.xlabel("BTC Closing Price (USD)")
plt.ylabel("Probability Density")
plt.title("BTC Price Distribution (GMM fitted with EM initialization)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Fitted GMM Means: [ 50160.87436025  24969.25655697 102888.04970422]
Fitted GMM Covariances: [1.96733192e+08 2.90067327e+07 1.12916730e+08]
Fitted GMM Weights: [0.51733489 0.27946591 0.2031992 ]
No description has been provided for this image

GMM reveals that Bitcoin prices are not centered around a single average value but instead cluster into three distinct market regimes: a low-price zone (around 25k), a mid-price zone (around 50k), and a high-price zone (around 100k). The weights of the components show how much time BTC historically spent in each regime, with the mid-price region being most common. The model captures multimodality and heavy tails, representing Bitcoin's volatility far better than a normal distribution. These insights allow a deeper understanding of market structure, tail risk, and volatility behavior, and provide a strong foundation for regime-based forecasting.¶

Reference¶

  1. GeeksforGeeks. (n.d.). Expectation maximization (EM) algorithm. https://www.geeksforgeeks.org/expectation-maximization-em-algorithm/
  2. GeeksforGeeks. (n.d.). K-means clustering – introduction. https://www.geeksforgeeks.org/k-means-clustering-introduction/
  3. GeeksforGeeks. (n.d.). Probability distribution. https://www.geeksforgeeks.org/maths/probability-distribution/
  4. EM Algorithm in Machine Learning | Expectation-Maximization Tutorial [Video]. (2019, May 1). YouTube. https://www.youtube.com/watch?v=DIADjJXrgps