FAB Futures - Data Science
Home About

Research > Final Project¶

Bayesian Optimization > Miximize Sharpe Ratio¶

ChatGPT Prompt: "Use Bayesian Optimization for Stock Analysis"

Response:

  • Bayesian Optimization (BO) is a method for finding the best parameter values for a model when...the model doesn't have a 'closed-form' solution; evaluations are expensive (e.g. backtesting a trading strategy); fewer, smarter experiments instead of brute-fore grid/random search is desired.
  • BO is commonly used to tune...Trading Strategy Parameters (SMA/EMA window lengths, MACD variations, RSI thresholds, Stop-Loss and Take-Profit Levels); Machine Learning Hyperparameters (XG Boost parameters, Neural Network architecture, GARCH model volatility parameters); Risk-Reward Optimization (Max Sharpe Ratio, Min Volatility, Max Return under constraint).
  • BO builds a probabilistic model (Gaussian Process or Tree Parzen Estimator) of the objective and chooses where to test next based on expected improvement.

Bayesian Optimization Procedure

  1. Test a few random points
  2. Models the objective (Sharpe Ratio, etc.)
  3. Predicts where the best parameters might be
  4. Tests only the most promising areas
  5. Converges to the optimal region with far fewer evaluations

code example...

In [2]:
import sys
print(sys.version)
3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]
In [ ]:
pip install bayesian-optimization yfinance pandas numpy
In [4]:
# Import Libraries  
import numpy as np
import pandas as pd
import yfinance as yf
from bayes_opt import BayesianOptimization
In [5]:
# Download Stock Price
data = yf.download("NVDA", start="2020-01-01")
close = data["Close"]
C:\Users\senna\AppData\Local\Temp\ipykernel_52276\4142185428.py:2: FutureWarning: YF.download() has changed argument auto_adjust default to True
  data = yf.download("NVDA", start="2020-01-01")
[*********************100%***********************]  1 of 1 completed
In [6]:
# MACD Optimizer
def macd_strategy(fast, slow, signal):
    # Cast to int because the optimizer passes floats
    fast = int(fast)
    slow = int(slow)
    signal = int(signal)

    # Basic sanity: make sure fast < slow, otherwise skip
    if fast >= slow:
        return -1e6  # punish invalid combinations

    # Ensure we work with a 1D price Series (not a DataFrame)
    price = close.squeeze()  # if close is (n,1) DataFrame, this makes it a Series

    # Compute MACD using price
    ema_fast = price.ewm(span=fast, adjust=False).mean()
    ema_slow = price.ewm(span=slow, adjust=False).mean()
    macd = ema_fast - ema_slow
    macd_signal = macd.ewm(span=signal, adjust=False).mean()

    # Trading signal: long if macd > signal, short if macd < signal
    signal_array = np.where(macd > macd_signal, 1, -1)

    # 🔑 Flatten to 1D in case it's shape (n, 1)
    signal_array = np.asarray(signal_array).ravel()

    position = pd.Series(signal_array, index=price.index)

    # Next-day returns
    daily_returns = price.pct_change().shift(-1)

    # Strategy returns: position * next-day return
    strat_returns = (position * daily_returns).dropna()

    if len(strat_returns) < 2:
        return -1e6  # not enough data, punish

    mean_ret = strat_returns.mean()
    std_ret = strat_returns.std()

    if std_ret == 0 or np.isnan(std_ret):
        return -1e6  # avoid division by zero / NaN

    sharpe = mean_ret / std_ret

    # Ensure scalar float
    return float(sharpe)
In [7]:
# Set Bayesian Optimization Search Space  

pbounds = {
    'fast': (5, 20),
    'slow': (20, 60),
    'signal': (5, 20)
}
In [8]:
# Run Optimizer 

optimizer = BayesianOptimization(
    f=macd_strategy,
    pbounds=pbounds,
    random_state=42,
)

# sanity check
print(type(close), getattr(close, "shape", None))
print(macd_strategy(12, 26, 9))

# optimiizer
optimizer.maximize(
    init_points=5,  # random exploration
    n_iter=25,      # number of BO steps
)
<class 'pandas.core.frame.DataFrame'> (1494, 1)
0.0015917149370049744
|   iter    |  target   |   fast    |   slow    |  signal   |
-------------------------------------------------------------
| 1         | -0.003622 | 10.618101 | 58.028572 | 15.979909 |
| 2         | -0.001716 | 13.979877 | 26.240745 | 7.3399178 |
| 3         | 0.0051059 | 5.8712541 | 54.647045 | 14.016725 |
| 4         | -0.012246 | 15.621088 | 20.823379 | 19.548647 |
| 5         | 0.0060529 | 17.486639 | 28.493564 | 7.7273745 |
| 6         | 0.0074873 | 14.177793 | 27.081731 | 9.9148254 |
| 7         | -0.005907 | 12.133381 | 40.615718 | 16.804650 |
| 8         | -0.004408 | 7.3045643 | 50.371981 | 13.944226 |
| 9         | -0.004786 | 8.0354559 | 23.621292 | 9.7255889 |
| 10        | 0.0034848 | 15.822249 | 27.928175 | 9.3486802 |
| 11        | 0.0012803 | 13.936310 | 27.480764 | 11.714182 |
| 12        | 0.0035931 | 15.089475 | 25.660809 | 10.245168 |
| 13        | 0.0008060 | 13.122292 | 28.672044 | 9.3465059 |
| 14        | 0.0007976 | 18.935903 | 27.327750 | 8.4579160 |
| 15        | -0.000840 | 17.359605 | 30.395785 | 7.5158661 |
| 16        | 0.0075519 | 13.041378 | 26.071989 | 10.094416 |
| 17        | -0.000480 | 17.419953 | 27.617790 | 6.0054842 |
| 18        | 0.0042859 | 12.893582 | 24.634095 | 11.119140 |
| 19        | -0.000795 | 11.372922 | 26.406567 | 10.739794 |
| 20        | -0.004228 | 13.336312 | 24.716421 | 9.3795091 |
| 21        | 0.0075519 | 13.775163 | 26.230229 | 10.751048 |
| 22        | 0.0028836 | 14.059227 | 25.105605 | 12.277775 |
| 23        | -.549e-06 | 5.0134866 | 56.162408 | 14.032980 |
| 24        | -0.002636 | 7.3132460 | 54.545115 | 13.250811 |
| 25        | 0.0050410 | 13.151439 | 27.033000 | 10.241574 |
| 26        | -0.001480 | 15.216094 | 27.060640 | 10.636084 |
| 27        | 0.0061376 | 13.886257 | 26.298971 | 9.7739464 |
| 28        | 0.0054361 | 12.959076 | 25.886019 | 11.197008 |
| 29        | 0.0028308 | 16.615078 | 27.678008 | 8.0534724 |
| 30        | -0.002424 | 17.453019 | 28.774511 | 9.0129120 |
=============================================================

According to ChatGPT, the table above contains the following information.

  • inter = interation
  • target = the Sharpe Ratio of the MACD strategy (higher value = better performance relative to volatility, better risk-return profile)
  • fast = fast (short-term, 12) EMA window tested by optimizer
  • slow = slow (long-term, 26) EMA window tested by optimizer
  • signal = signal line window (9)
In [9]:
# Print Best Found Parameters 

print(optimizer.max)
{'target': np.float64(0.00755198127205693), 'params': {'fast': np.float64(13.041378353189243), 'slow': np.float64(26.0719891633693), 'signal': np.float64(10.094416661343582)}}

The last code prints the best Sharpe Ratio (target) generated during the optimization runs (iteration 17) and the 'fast', 'slow' and 'signal' parameters that generated it.

  • Rerun with annualized values > make the resulting Sharpe Values larger and easier to interpret
  • Now backtest and plot
  • Compare to traditional MACD
In [10]:
# MACD optimization with Annualized Sharpe Ratio

def macd_strategy(fast, slow, signal):
    # Cast to int because the optimizer passes floats
    fast = int(fast)
    slow = int(slow)
    signal = int(signal)

    # Basic sanity: make sure fast < slow, otherwise skip
    if fast >= slow:
        return -1e6  # punish invalid combinations

    # Ensure we work with a 1D price Series (not a DataFrame)
    price = close.squeeze()  # if close is (n,1) DataFrame, this makes it a Series

    # Compute MACD using price
    ema_fast = price.ewm(span=fast, adjust=False).mean()
    ema_slow = price.ewm(span=slow, adjust=False).mean()
    macd = ema_fast - ema_slow
    macd_signal = macd.ewm(span=signal, adjust=False).mean()

    # Trading signal: long if macd > signal, short if macd < signal
    signal_array = np.where(macd > macd_signal, 1, -1)
    signal_array = np.asarray(signal_array).ravel()  # flatten to 1D

    position = pd.Series(signal_array, index=price.index)

    # Next-day returns
    daily_returns = price.pct_change().shift(-1)

    # Strategy returns: position * next-day return
    strat_returns = (position * daily_returns).dropna()

    if len(strat_returns) < 2:
        return -1e6  # not enough data, punish

    mean_ret = strat_returns.mean()
    std_ret = strat_returns.std()

    if std_ret == 0 or np.isnan(std_ret):
        return -1e6  # avoid division by zero / NaN

    # Daily Sharpe
    sharpe_daily = mean_ret / std_ret

    # 🔑 Annualized Sharpe (using 252 trading days)
    sharpe_annual = sharpe_daily * np.sqrt(252)

    # Ensure scalar float
    return float(sharpe_annual)
In [11]:
# Set Bayesian Optimization Search Space  

pbounds = {
    'fast': (5, 20),
    'slow': (20, 60),
    'signal': (5, 20)
}
In [12]:
# Run Optimizer 

optimizer = BayesianOptimization(
    f=macd_strategy,
    pbounds=pbounds,
    random_state=42,
)

# sanity check
print(type(close), getattr(close, "shape", None))
print(macd_strategy(12, 26, 9))

# optimiizer
optimizer.maximize(
    init_points=5,  # random exploration
    n_iter=25,      # number of BO steps
)
<class 'pandas.core.frame.DataFrame'> (1494, 1)
0.02526769128853202
|   iter    |  target   |   fast    |   slow    |  signal   |
-------------------------------------------------------------
| 1         | -0.057507 | 10.618101 | 58.028572 | 15.979909 |
| 2         | -0.027253 | 13.979877 | 26.240745 | 7.3399178 |
| 3         | 0.0810550 | 5.8712541 | 54.647045 | 14.016725 |
| 4         | -0.194408 | 15.621088 | 20.823379 | 19.548647 |
| 5         | 0.0960870 | 17.486639 | 28.493564 | 7.7273745 |
| 6         | 0.1188581 | 14.177793 | 27.081731 | 9.9148254 |
| 7         | -0.093776 | 12.133381 | 40.615718 | 16.804650 |
| 8         | -0.069988 | 7.3045643 | 50.371981 | 13.944226 |
| 9         | -0.075990 | 8.0354559 | 23.621292 | 9.7255889 |
| 10        | 0.0553204 | 15.822444 | 27.929174 | 9.3486489 |
| 11        | 0.0203252 | 13.946982 | 27.486160 | 11.713781 |
| 12        | 0.0570396 | 15.088161 | 25.657477 | 10.244422 |
| 13        | 0.0127950 | 13.109116 | 28.660156 | 9.3533526 |
| 14        | 0.0126629 | 18.933057 | 27.332209 | 8.4730437 |
| 15        | -0.013347 | 17.353964 | 30.392941 | 7.5172228 |
| 16        | 0.1198839 | 13.044368 | 26.070400 | 10.101693 |
| 17        | -0.007629 | 17.421949 | 27.608447 | 6.0130746 |
| 18        | 0.0680378 | 12.895039 | 24.641222 | 11.136283 |
| 19        | -0.012635 | 11.379503 | 26.402007 | 10.755725 |
| 20        | -0.067118 | 13.327823 | 24.706756 | 9.3870710 |
| 21        | 0.1198839 | 13.779659 | 26.228730 | 10.758302 |
| 22        | 0.0457760 | 14.076787 | 25.105925 | 12.291088 |
| 23        | 0.0416774 | 5.0       | 56.064969 | 13.477997 |
| 24        | -0.004194 | 5.2727304 | 54.956291 | 15.649019 |
| 25        | 0.0800235 | 13.135644 | 27.036207 | 10.239986 |
| 26        | -0.009862 | 6.6954698 | 54.777589 | 12.550407 |
| 27        | -0.023497 | 15.248773 | 27.058751 | 10.651457 |
| 28        | 0.0974315 | 13.886148 | 26.301891 | 9.7648224 |
| 29        | 0.0862956 | 12.964071 | 25.896737 | 11.223886 |
| 30        | 0.1532190 | 16.206568 | 28.284942 | 7.8577785 |
=============================================================
In [13]:
# Print Best Found Parameters 

print(optimizer.max)
{'target': np.float64(0.15321906859751228), 'params': {'fast': np.float64(16.206568886903206), 'slow': np.float64(28.284942912101357), 'signal': np.float64(7.857778507934423)}}
In [14]:
# Sanity Check
print("Annualized Sharpe for classic MACD (12, 26, 9):")
print(macd_strategy(12, 26, 9))
Annualized Sharpe for classic MACD (12, 26, 9):
0.02526769128853202

Interpreting the Outputs

Best Parameters:

Generated Sharpe Ratio = 0.127
Optimized MACD fast parameter = 14 days
Optimized MACD slow parameter = 27 days
Optimized MACD signal parameter = 10 days

Sanity Check > Classic MACD

Sharpe Ratio = 0.034 MACD fast = 12 days
MACD slow = 26 days
MACD signal = 9 days

The optimized version is better, but both results are low, because...

  • a simple long/short daily strategy
  • no transaction costs
  • single stock over a limited time window

Now, backtest & plot...

In [ ]:
! pip install matplotlib
In [17]:
import matplotlib.pyplot as plt

# define a helper backtest function  
def backtest_macd(price, fast, slow, signal, name="MACD Strategy"):
    # Ensure Series
    price = price.squeeze()

    # MACD components
    ema_fast = price.ewm(span=fast, adjust=False).mean()
    ema_slow = price.ewm(span=slow, adjust=False).mean()
    macd = ema_fast - ema_slow
    macd_signal = macd.ewm(span=signal, adjust=False).mean()

    # Position: long (1) if macd > signal, short (-1) otherwise
    signal_array = np.where(macd > macd_signal, 1, -1)
    signal_array = np.asarray(signal_array).ravel()
    position = pd.Series(signal_array, index=price.index)

    # Next-day returns
    daily_returns = price.pct_change().shift(-1)

    # Strategy returns
    strat_returns = (position * daily_returns).dropna()

    # Equity curve (starting from 1.0)
    equity = (1 + strat_returns).cumprod()

    # Pack into DataFrame
    result = pd.DataFrame({
        "Price": price.loc[equity.index],
        name: equity
    })

    return result
In [18]:
# run backtest with optimize MACD parameters  

fast_opt   = int(round(14.177793420835693))
slow_opt   = int(round(27.081731781135897))
signal_opt = int(round(9.914825448824644))

print(f"Using optimized MACD({fast_opt}, {slow_opt}, {signal_opt})")

bt_opt = backtest_macd(close, fast_opt, slow_opt, signal_opt,
                       name=f"MACD_opt_{fast_opt}_{slow_opt}_{signal_opt}")
Using optimized MACD(14, 27, 10)
In [19]:
# plot equity curve  

plt.figure(figsize=(10, 5))
plt.plot(bt_opt.index, bt_opt.iloc[:, 1], label=bt_opt.columns[1])
plt.title("Equity Curve - Optimized MACD Strategy")
plt.xlabel("Date")
plt.ylabel("Equity (starting at 1.0)")
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

Looks like just another price chart to me, but ChatGPT assures me that this is what it should look like. Its explanation...

Sharpe Ratio of 0.13 annualized is OK for a simple model

  • equity starts at 1
  • rises as high as 1.6
  • declines from 2024 onwards
  • by late 2025, the level is 0.55

Observations:

  • the strategy did well 2020-2023
  • Draws down early 2021
  • Recovery through 2022
  • Big surge 2023
  • Starting 2024, MACD performs poorly in sideways market, Choppy high-volatility markets, Reversal heavy markets
  • In 2024-2025 NVDA had repeated whipsaws (in prices). In sideways conditions, MACD flips long/short over and over...producing loses.
  • Hence the 2024 declines

Important Note!! >>> MACD is a trend indicator...not predictive signal

Density Estimation¶

ChatGPT Prompt: "How can Density Estimation be used to help gain insights into Stock Price Predictions"

"Density estimation is a powerful statistical tool that can help you understand the underlying distribution of stock-related variables, which in turn can improve forecasting, risk evaluation, and strategy design. While density estimation does not predict prices directly, it provides insights that make prediction models more informed and robust"

"Density estimation is a method used to approximate the probability distribution of a random variable (daily returns, volatility, trading volume, price changes"

2 Common Forms:

  • Histogram (Bar Graph)
  • Kernel Density Estimation (Smooth Continuous Curve)

Helping Stock Prediction¶

  1. Help Understand Distribution of Returns
    Stock returns are not normally distributed. They show:

    • fat tails (extreme moves more common than normal distribution predicts)
    • skewness (price moves are not symmetric)
      A KDE curve reveals this clearly, helping quantify the risk of large swings.
  2. Estimate Probability of Future Price Movements
    Density estimate of return calculations can be used to compute...

  • Probability that tomorrow's return is > 1%
  • Probability of loss > -2%
  • Probability of hitting Stop-Loss or Take-Profit levels > interesting!!
    Price Paths can be simulated using estimated distribution, useful for...
  • option pricing
  • Monte Carlo simulations
  • Risk management
  • Algo-trading strategies
  1. Improve Volatility Models > interesting!!
    Stock Volatility has a distribution, KDE allows estimation:

    • Distribution of 20-day rolling volatility
    • Likelihood of High-Volatility spikes
      Model can adapt to periods of high or low volatility improving prediction accuracy.
  2. Detect Regime Shifts
    KDE curves change shape over time (curves of what??)

    • Calm Markets > narrow tall KDE
    • Volatile Markets > wide, fat-tailed KDE
      Sudden changes in the estimated density can signal market regime changes.
  3. Feature Engineering for Machine Learning Models
    ML models often perform better when fed features extracted from KDE rather than raw prices.
    Density-based features can improve predictive power:

    • Tail Risk (probability in extreme ends of KDE)
    • skewness/kurtosis(?)
    • peak distribution zones
    • density peaks near support/resistance levels
  4. Identify Anomalies & Outliers
    KDE helps find values that lie outside high-density regions.

    • Unusual large return
    • Rare volume spike
    • Excessive volatility
      Outlier detection is essential for cleaning data before training prediction models

No description has been provided for this image

In [ ]:
pip install numpy seaborn matplotlib
In [21]:
# import yahoo finance library
import yfinance as yf

nvda = yf.Ticker('NVDA') #specify stock ticker, assign to a variable
data = nvda.history(interval="1d", period="max") #pull historical data, specify interval and period
data #show data
Out[21]:
Open High Low Close Volume Dividends Stock Splits
Date
1999-01-22 00:00:00-05:00 0.040112 0.044767 0.035575 0.037605 2714688000 0.00 0.0
1999-01-25 00:00:00-05:00 0.040589 0.042021 0.037605 0.041545 510480000 0.00 0.0
1999-01-26 00:00:00-05:00 0.042021 0.042857 0.037724 0.038321 343200000 0.00 0.0
1999-01-27 00:00:00-05:00 0.038440 0.039395 0.036291 0.038202 244368000 0.00 0.0
1999-01-28 00:00:00-05:00 0.038202 0.038440 0.037843 0.038082 227520000 0.00 0.0
... ... ... ... ... ... ... ...
2025-12-04 00:00:00-05:00 181.619995 184.520004 179.960007 183.380005 167364900 0.01 0.0
2025-12-05 00:00:00-05:00 183.889999 184.660004 180.910004 182.410004 143971100 0.00 0.0
2025-12-08 00:00:00-05:00 182.639999 188.000000 182.399994 185.550003 204378100 0.00 0.0
2025-12-09 00:00:00-05:00 185.559998 185.720001 183.320007 184.970001 144719700 0.00 0.0
2025-12-10 00:00:00-05:00 184.970001 185.479996 182.039993 183.779999 161931100 0.00 0.0

6764 rows × 7 columns

In [22]:
# Example KDE of Daily Returns  

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

returns = np.log(data['Close'] / data['Close'].shift(1)).dropna()

sns.kdeplot(returns)
plt.title("Kernel Density Estimate of Daily Log Returns")
plt.show()
No description has been provided for this image

According to ChatGPT, this curve shows the probability shape of returns, which can then be used for simulation or probability estimation.

The curve shows how likely different daily returns for Nvidia have been historically.

Observations:

  1. Extremely tall, narrow peak at 0 return (Leptokurtic distribution)
    • NVDA usually moves very little per day
    • Distribution highly concentrated around zero
    • similar behavior to most large tech stocks > many calm days, a few very volatile ones
  2. The curve is NOT symmetric...has heavier tails
    • tall and narrow peak...but density spreads out much farther on both sides than normal distribution
    • Nvidia has 'fat tailed' distribution
    • ...more frequent large positive/negative moves that normal
    • ...sensitive to earnings, news, market crashes, hype cycles, etc.
  3. Left tail extends further than the right...downside risk exists
    • Big downside crashes have happened
    • daily drops have been larger (in magnitude) than its biggest daily gains
    • Nvidia has higher likelihood of sharp downward shocks
  4. Right tail still present...NVDA has had many large positive days
    • NVDA also has huge positive rallies
  5. The KDE confirms that NVDA's returns are NOT Normally Distributed
    • Classic Gaussian assumptions fail...since not normal distribution
    • Fat Tails (a probability distribution that has more extreme outcomes than a normal Gaussian distribution would predict)
    • Nvidia tails extend to -0.45 and +0.35...classic Fat Tail behavior
    • Volatility clusters, regime shifts, sudden jumps...are more common for Nvidia
    • For stocks...Fat Tail means extreme price moves are more common
  6. Implication for Prediction & Modeling
    • Nvidia behaves like a two-regime asset
    • Regime 1 Calm Days: small daily moves, in stable market, most trading days
    • Regime 2 Volatile Days: earnings announcements, macro events, technology releases
    • GMM is very appropriate for Nvidia because it can capture these 2 regimes

Definition: Skewness & Kurtosis

Skewness = asymmetry in a distribution. Zero skewness means perfect symmetry. Direction of tail.
- Negative Skew (long tail on the left)= most days are small gains, but sometimes there are large losses Kurtosis = measures 'tailedness' or how heavy or light the tails are compared to a normal distribution. Normal distribution kurtosis = 3 (excess kurtosis = 0). High kurtosis means 'fat tails' or 'leptokurtic'. Tall peaks and heavy tails. Extreme events more likely than bell curve. - Extreme moves happen more often than the normal distribution predicts. Size of tail.

Regime Inspection for Nvidia with Gaussian Mixture Model¶

ChatGPT had suggested that a GMM would be ideal for looking at the distribution probabilities of Nvidia's 2 distinct regimes, simplistically described as Calm Days and Volatile Days. Let's look to understand the code that ChatGPT offers up to illustrate this thesis.

In [ ]:
pip install numpy pandas scikit-learn matplotlib --upgrade yfinance
In [105]:
import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt

# Download NVDA daily data (adjust as you like)
ticker = "NVDA"
data = yf.download(ticker, start="2010-01-01")  # long history
data = data.dropna()
data
C:\Users\senna\AppData\Local\Temp\ipykernel_52276\939044389.py:9: FutureWarning: YF.download() has changed argument auto_adjust default to True
  data = yf.download(ticker, start="2010-01-01")  # long history
[*********************100%***********************]  1 of 1 completed
Out[105]:
Price Close High Low Open Volume
Ticker NVDA NVDA NVDA NVDA NVDA
Date
2010-01-04 0.423807 0.426786 0.415097 0.424265 800204000
2010-01-05 0.429995 0.434579 0.422202 0.422202 728648000
2010-01-06 0.432746 0.433663 0.425640 0.429766 649168000
2010-01-07 0.424265 0.432287 0.421056 0.430454 547792000
2010-01-08 0.425182 0.428162 0.418306 0.420827 478168000
... ... ... ... ... ...
2025-12-04 183.380005 184.520004 179.960007 181.619995 167364900
2025-12-05 182.410004 184.660004 180.910004 183.889999 143971100
2025-12-08 185.550003 188.000000 182.399994 182.639999 204378100
2025-12-09 184.970001 185.720001 183.320007 185.559998 144719700
2025-12-10 183.779999 185.479996 182.039993 184.970001 161931100

4010 rows × 5 columns

In [25]:
# Compute daily log returns
data['LogReturn'] = np.log(data['Close'] / data['Close'].shift(1))
returns = data['LogReturn'].dropna().values.reshape(-1, 1)  # shape (n_samples, 1)
In [26]:
# Fit 2-Regime GMM > Calm vs Volatile Periods

# sklearn fit distribution of NVDA returns as a mixture of 2 Gaussian Distros
# Regime 0 = Calm
# Regime 1 = Volatile 

# Configuring the Model 
gmm2 = GaussianMixture(
    n_components=2, # 2 regimes
    covariance_type='full', # full 1D covariance matrix for each regime
    init_params='kmeans', # initialize cluster centers using k-means clustering
    n_init=10, # 10 fitting iterations (should increase?)
    random_state=42 # set random seed
)

# Fit the returns
gmm2.fit(returns)

# Regime labels and posterior probabilities
labels2 = gmm2.predict(returns)                  # hard cluster label specification (0 or 1) for each data point  
probs2  = gmm2.predict_proba(returns)            # soft probabilities for each regime for each (daily) data pont...some percentage for 1 and the other

print("2-regime means:", gmm2.means_.ravel()) # print means...'ravel' a numpy method turns array into 1D (flattened) view
print("2-regime stds :", np.sqrt(gmm2.covariances_.ravel())) # print standard deviation
print("2-regime weights:", gmm2.weights_) # print weights
2-regime means: [0.00135982 0.00161422]
2-regime stds : [0.04124314 0.01595234]
2-regime weights: [0.39139666 0.60860334]

The fit command above, does a lot according to ChatGPT. The explanation as follows:

  1. Runs K-means to find rough cluster centers
  2. Initializes 2 Gaussian distributions
  3. Run the Expectation-Maximization (EM) algorithm:
    • E-step: computes probabilities each return belongs to each regime
    • M-step: updates means, variances, weights
  4. Repeats until convergence

After fit, the GMM now has...

  • means of each regime
  • variances of each regime
  • mixture weights
  • log-likelihood of the fit
In [27]:
# Visualize Mixture vs Empirical Distribution  

x = np.linspace(returns.min(), returns.max(), 1000).reshape(-1, 1)

# Mixture pdf for 2 regimes
log_prob_2 = gmm2.score_samples(x) # returns the log of the combined mixture probability distribution function...adding 2 Gaussians together  
pdf_2 = np.exp(log_prob_2) # converting to probability distribution function

# log-probabilities of each component 
log_probs_components = gmm2._estimate_log_prob(x)   # shape (1000, 2)
comp_pdfs = np.exp(log_probs_components) # converting to probability distribution function
weighted_pdfs = comp_pdfs * gmm2.weights_ # Weight each component by its regime weight


plt.figure(figsize=(10, 5))
plt.hist(returns, bins=80, density=True, alpha=0.4, label='Empirical Histogram')

# plot mixture (combined) PDF
plt.plot(x, pdf_2, color='orange', linewidth=2, label='GMM pdf') 
# plot the 2 regimes individually
plt.plot(x, weighted_pdfs[:, 0], 'r--', label='Regime 0 pdf')
plt.plot(x, weighted_pdfs[:, 1], 'g--', label='Regime 1 pdf')

plt.title("NVDA Daily Log Returns + 2-Regime GMM Fit")
plt.xlabel("Log return")
plt.ylabel("Density")
plt.legend()
plt.show()
No description has been provided for this image
In [28]:
# Visualize Regimes over Time  

# assumes you already have:
# data  -> DataFrame with 'Close'
# returns -> log returns (len = len(data) - 1)
# gmm3   -> fitted 3-regime GMM on `returns`
labels2 = gmm2.predict(returns)                # shape (len(returns),)
idx_ret = data.index[1:]                       # align with returns

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 6))

# faint full price line in background
# ax.plot(data.index, data['Close'], color='black', linewidth=1, label='Close (background)')

# color-coded points by regime
colors = ['tab:blue', 'tab:orange']
for k, c in zip(range(2), colors):
    mask = (labels2 == k)
    ax.scatter(idx_ret[mask],
               data['Close'].iloc[1:][mask],
               s=1,
               color=c,
               label=f"Regime {k}")

ax.set_title("NVDA Close Price Colored by 2-Regime GMM")
ax.set_ylabel("Price")
ax.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [29]:
probs2 = gmm2.predict_proba(returns)   # shape (n_samples, 3)

fig, ax1 = plt.subplots(figsize=(14, 6))

# ------------------------------
# 1. Regime Probability Background
# ------------------------------
regime_labels = [f"Regime {k}" for k in range(probs2.shape[1])]

ax1.stackplot(
    idx_ret,
    probs2.T,
    labels=regime_labels,
    alpha=0.6
)

ax1.set_ylabel("Regime Probability")
ax1.set_ylim(0, 1)
ax1.set_title("NVDA Price Overlaid on 2-Regime GMM Probabilities")
ax1.legend(loc="upper left")


# ------------------------------
# 2. Overlay NVDA Price (using second y-axis)
# ------------------------------
ax2 = ax1.twinx()     # share x-axis, new y-axis

ax2.plot(data.index, data['Close'], color='black', linewidth=1.2, label="NVDA Close Price")

ax2.set_ylabel("Price ($)")
ax2.legend(loc="upper right")


plt.tight_layout()
plt.show()
No description has been provided for this image

3D visualization

In [30]:
# after fitting 2-regime GMM
# probs2 = gmm2.predict_proba(returns)        # shape (n_samples, 2)
# idx_ret = data.index[1:]                    # align with returns
# prices_ret = data['Close'].iloc[1:].values  # same length as probs2
In [31]:
# from mpl_toolkits.mplot3d import Axes3D  # noqa: F401, needed for 3D
# import matplotlib.dates as mdates
In [32]:
#t_num = mdates.date2num(idx_ret)  # numeric time for 3D plotting
In [33]:
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401
import matplotlib.dates as mdates
import numpy as np
import matplotlib.pyplot as plt

# --- align everything to returns length ---
idx_ret = data.index[1:]                       # index for returns
prices_ret = data['Close'].iloc[1:].to_numpy() # shape (N,)
probs2 = gmm2.predict_proba(returns)           # shape (N, 2)

p_reg1 = probs2[:, 1]                          # shape (N,)

# convert dates to numeric for 3D
t_num = mdates.date2num(idx_ret.to_pydatetime())  # shape (N,)

# just to be safe, flatten all of them
t_num    = np.asarray(t_num).ravel()
prices_ret = np.asarray(prices_ret).ravel()
p_reg1   = np.asarray(p_reg1).ravel()

print("t_num shape     :", t_num.shape)
print("prices_ret shape:", prices_ret.shape)
print("p_reg1 shape    :", p_reg1.shape)
t_num shape     : (4009,)
prices_ret shape: (4009,)
p_reg1 shape    : (4009,)
In [34]:
#3d Scatterplot with Consistent Sizes 

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

sc = ax.scatter(
    t_num,
    prices_ret,
    p_reg1,
    c=p_reg1,          # color by probability
    cmap='viridis',
    s=5
)

ax.set_xlabel("Date")
ax.set_ylabel("Price ($)")
ax.set_zlabel("P(Regime 1)")
ax.set_title("NVDA – 3D View: Time, Price, Probability of Regime 1")

ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
fig.colorbar(sc, ax=ax, label='P(Regime 1)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Option B

In [35]:
import numpy as np

n_samples, n_reg = probs2.shape

# meshgrid: time (x) × regime index (y)
T, R = np.meshgrid(t_num, np.arange(n_reg))   # T,R shape (n_reg, n_samples)

# Z needs same shape: each row = probs for one regime
Z = probs2.T                                  # shape (n_reg, n_samples)

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

surf = ax.plot_surface(
    T, R, Z,
    cmap='viridis',
    edgecolor='none',
    alpha=0.9
)

ax.set_xlabel("Date")
ax.set_ylabel("Regime index")
ax.set_zlabel("Probability")
ax.set_title("NVDA – Regime Probabilities in 3D (2-Regime GMM)")
ax.yaxis.set_ticks([0, 1])

ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
fig.colorbar(surf, ax=ax, label='Probability')

plt.tight_layout()
plt.show()
No description has been provided for this image

Multi-Dimensional Data Analysis¶

So far I have only looked at price return data. Neil suggest that I add 'more dimensions' to my analysis...market volatility, etc.

With too little time to get this work done, I go to ChatGPT for help.

"Go from 1D to multivariate: more features per day"

  • build a feature matrix 'x' with multiple columns
  • features could be...
    • return
    • rolling volatility
    • volume surprise
    • price position in range
    • lagged return
  • conceptually, clustering points in space

ChatGPT provides this example combining returns + volatility + volume. Each regime is defined by joint behavior of the 3 features.

In [ ]:
! pip install scikit-learn
In [ ]:
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
In [ ]:
X = pd.DataFrame(index=returns.index)

X["ret"] = returns
X["vol20"] = X["ret"].rolling(20).std() * np.sqrt(252)

X["vol_z"] = (
    (data["Volume"] - data["Volume"].rolling(20).mean())
    / data["Volume"].rolling(20).std()
)

X["pos_range20"] = (
    (data["Close"] - data["Close"].rolling(20).min())
    / (data["Close"].rolling(20).max() - data["Close"].rolling(20).min())
)

X["ret_lag1"] = X["ret"].shift(1)

X = X.dropna()
The Kernel crashed while executing code in the current cell or a previous cell. 

Please review the code in the cell(s) to identify a possible cause of the failure. 

Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
In [ ]:
#GMM a fit on a 5D feature spaceye

gmm2 = GaussianMixture(
    n_components=2,
    covariance_type="full",
    init_params="kmeans", # using k-means to initialize component means
    n_init=10, # runs EM algo 10x with different k-means initializations...can be increased 
    random_state=42,
)

gmm2.fit(X)  # or X.values

labels2 = gmm2.predict(X)
probs2  = gmm2.predict_proba(X)
In [71]:
# Plot 2 Feature GMM > Return & 20-day Rolling Volatility

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))

scatter = ax.scatter(
    X["ret"],
    X["vol20"],
    c=labels2,
    cmap="viridis",  # consistent color map
    alpha=0.6,
    s=20
)

ax.set_xlabel("Daily return")
ax.set_ylabel("20-day annualized volatility")
ax.set_title("GMM regimes in (return, vol20) space")

# Legend with matching colors
unique_labels = sorted(set(labels2))
handles = []
for lab in unique_labels:
    handles.append(
        plt.Line2D(
            [], [], marker="o", linestyle="", markersize=10,
            label=f"Regime {lab}",
            color=scatter.cmap(scatter.norm(lab))
        )
    )

ax.legend(handles=handles)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [75]:
# With decision boundary line between 2 regime

import numpy as np
import matplotlib.pyplot as plt

# Select the two features we’re plotting
x_feat = "ret"
y_feat = "vol20"

x = X[x_feat].values
y = X[y_feat].values

# Grid over feature space
x_min, x_max = x.min(), x.max()
y_min, y_max = y.min(), y.max()

xx, yy = np.meshgrid(
    np.linspace(x_min, x_max, 200),
    np.linspace(y_min, y_max, 200)
)

# Build grid points in the full feature space:
# Start from mean of all features, then vary only the two we care about.
X_mean = X.mean().values
grid = np.tile(X_mean, (xx.size, 1))

# Find column indices for x_feat and y_feat
cols = list(X.columns)
i_x = cols.index(x_feat)
i_y = cols.index(y_feat)

grid[:, i_x] = xx.ravel()
grid[:, i_y] = yy.ravel()

# Get probabilities from GMM
probs_grid = gmm2.predict_proba(grid)
# difference between regime 1 and regime 0
diff = probs_grid[:, 1] - probs_grid[:, 0]
diff = diff.reshape(xx.shape)

fig, ax = plt.subplots(figsize=(8, 6))

# Scatter
scatter = ax.scatter(x, y, c=labels2, cmap="viridis", alpha=0.6, s=20)

# Decision contour where p1 = p0 -> diff = 0
cs = ax.contour(xx, yy, diff, levels=[0], linewidths=2, colors='red')

ax.set_xlabel(x_feat)
ax.set_ylabel(y_feat)
ax.set_title("GMM regimes with decision boundary")

# Legend
unique_labels = sorted(set(labels2))
handles = []
for lab in unique_labels:
    handles.append(
        plt.Line2D(
            [], [], marker="o", linestyle="", markersize=10,
            label=f"Regime {lab}",
            color=scatter.cmap(scatter.norm(lab))
        )
    )
handles.append(
    plt.Line2D([], [], color="black", linestyle="-", label="Decision boundary")
)
ax.legend(handles=handles)

plt.tight_layout()
plt.show()
c:\Users\senna\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py:2691: UserWarning: X does not have valid feature names, but GaussianMixture was fitted with feature names
  warnings.warn(
No description has been provided for this image
In [ ]:
#Gaussian Elippes for each Components > show covariance shapes

from matplotlib.patches import Ellipse

def draw_ellipse(mean, cov, ax, edgecolor, n_std=2.0):
    # Eigen-decomposition for orientation
    vals, vecs = np.linalg.eigh(cov)
    order = vals.argsort()[::-1]
    vals, vecs = vals[order], vecs[order]

    theta = np.degrees(np.arctan2(*vecs[:, 0][::-1]))
    width, height = 2 * n_std * np.sqrt(vals)

    ell = Ellipse(
        xy=mean,
        width=width,
        height=height,
        angle=theta,
        edgecolor=edgecolor,
        facecolor="none",
        linewidth=2
    )
    ax.add_patch(ell)

x_feat = "ret"
y_feat = "vol20"
cols = list(X.columns)
ix = cols.index(x_feat)
iy = cols.index(y_feat)

fig, ax = plt.subplots(figsize=(8, 6))

scatter = ax.scatter(
    X[x_feat], X[y_feat],
    c=labels2, cmap="viridis", alpha=0.6, s=20
)

# Draw ellipses per component
for k in range(gmm2.n_components):
    mean_k = gmm2.means_[k, [ix, iy]]
    cov_k  = gmm2.covariances_[k][np.ix_([ix, iy], [ix, iy])]
    color = scatter.cmap(scatter.norm(k))
    draw_ellipse(mean_k, cov_k, ax, edgecolor=color, n_std=2.0)

ax.set_xlabel(x_feat)
ax.set_ylabel(y_feat)
ax.set_title("GMM regimes with Gaussian ellipses")

# Legend
unique_labels = sorted(set(labels2))
handles = []
for lab in unique_labels:
    handles.append(
        plt.Line2D(
            [], [], marker="o", linestyle="", markersize=10,
            label=f"Regime {lab}",
            color=scatter.cmap(scatter.norm(lab))
        )
    )
ax.legend(handles=handles)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [78]:
# Marginal Distributions per Regime (Histograms)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for k in [0, 1]:
    mask = labels2 == k
    axes[0].hist(X.loc[mask, "ret"], bins=50, alpha=0.5, label=f"Regime {k}")
    axes[1].hist(X.loc[mask, "vol20"], bins=50, alpha=0.5, label=f"Regime {k}")

axes[0].set_title("Return distribution by regime")
axes[0].set_xlabel("ret")

axes[1].set_title("Vol20 distribution by regime")
axes[1].set_xlabel("vol20")

for ax in axes:
    ax.legend()

plt.tight_layout()
plt.show()
No description has been provided for this image
In [80]:
# Scatter colored by probability of regime

prob_regime1 = probs2[:, 1]

fig, ax = plt.subplots(figsize=(8, 6))

cmap = "viridis"

sc = ax.scatter(
    X["ret"],
    X["vol20"],
    c=prob_regime1,
    cmap=cmap,
    alpha=0.7,
    s=20
)

ax.set_xlabel("Daily return")
ax.set_ylabel("20-day annualized volatility")
ax.set_title("P(Regime 1) over (ret, vol20)")

cbar = plt.colorbar(sc, ax=ax)
cbar.set_label("P(Regime 1)")

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Time Encoding: Colors by Date, Outline by Regime 
# Nice!

dates = X.index
t = np.linspace(0, 1, len(dates))  # normalized time

fig, ax = plt.subplots(figsize=(8, 6))

sc = ax.scatter(
    X["ret"], X["vol20"],
    c=t,
    cmap="plasma",
    alpha=0.7,
    s=20
)

# Outline by regime
for k in [0, 1]:
    mask = labels2 == k
    ax.scatter(
        X.loc[mask, "ret"],
        X.loc[mask, "vol20"],
        facecolors="none",
        edgecolors="white" if k == 0 else "black",
        linewidths=0.7,
        s=30,
        label=f"Regime {k} outline"
    )

ax.set_xlabel("Daily return")
ax.set_ylabel("20-day annualized volatility")
ax.set_title("Time-colored scatter with regime outlines")

cbar = plt.colorbar(sc, ax=ax)
cbar.set_label("Normalized time (old → new)")

ax.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [85]:
# Pairwise Feature Scatterplot with GMM Regimes
# for sure for FP

from pandas.plotting import scatter_matrix

cols = ["ret", "vol20", "vol_z", "pos_range20", "ret_lag1"]

scatter_matrix(
    X[cols],
    figsize=(10, 10),
    diagonal="hist",
    c=labels2,
    alpha=0.5
)

plt.suptitle("Pairwise feature scatter with GMM regimes", y=1.02)
plt.show()
No description has been provided for this image

Interpreting the above Pair-Plot¶

  • shows every pair of features from the multivariate dataset > return, 20-day Rolling Volatility, Volatility Z, 20-day Rolling Position Range, Return Lag
  • each diagonal cell is a histogram of a single feature
  • each off-diagonal cell is a scatter plot of Feature A vs Feature B
  • Purple color is one regime, yellow color is the other regime
  • Insight 1: Yellow points cluster where volatility is low > yellow is calm, low vol regime
  • Insight 2: Purple points spread much wider on vol20 > purple is turbulent, high vol regime
  • Insight 3: Vol Z vs Vol 20 > Yellow low volatility and low volume surprise, Purple spray outward...large volume shocks cluster in high-vol regime > volume surprises correlate strongly with regime switching
  • Insight 4: pos_range 20 = where price sits inside recent 20-day high/low range...yellow hugs the full 0-1 range, price oscillates in normal channels. purple clusters near mid-range but with more vertical noise > model found that range position is weakly predictive
  • Insight 5: ret and ret_lag1 show both regimes overlaps heavily, daily returns alone rarely give clean clustering > multi-variate shows that no single feature give clean separation, but the combination does

Volatility is the dominant clustering driver

In [86]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde

# ----------------------------------------------
# Inputs: X, labels2, gmm2
# X: DataFrame with your features
# labels2: regime labels (0/1)
# gmm2: fitted GaussianMixture model
# ----------------------------------------------

features = ["ret", "vol20", "vol_z", "pos_range20", "ret_lag1"]
cols = features

fig, axes = plt.subplots(len(cols), len(cols), figsize=(14, 14))

for i, feat_i in enumerate(cols):
    for j, feat_j in enumerate(cols):

        ax = axes[i, j]

        if i == j:
            # -----------------------------------------
            # DIAGONAL: HISTOGRAM + KDE + GMM DENSITIES
            # -----------------------------------------

            data_feat = X[feat_i].dropna()

            # Histogram
            ax.hist(
                data_feat,
                bins=40,
                density=True,
                alpha=0.4,
                color="lightblue"
            )

            # ---- KDE Line (SMOOTH NON-PARAMETRIC) ----
            xs = np.linspace(data_feat.min(), data_feat.max(), 400)
            kde = gaussian_kde(data_feat)
            ax.plot(xs, kde(xs), color="red", linewidth=2, label="KDE")

            # ---- GMM Component Densities ----
            comp_pdf_sum = np.zeros_like(xs)

            for k in range(gmm2.n_components):
                mean_k = gmm2.means_[k, i]
                std_k  = np.sqrt(gmm2.covariances_[k][i, i])
                weight_k = gmm2.weights_[k]

                # individual Gaussian component
                comp_pdf = weight_k * norm.pdf(xs, mean_k, std_k)
                comp_pdf_sum += comp_pdf

                # optional dashed individual curve
                ax.plot(xs, comp_pdf, linestyle="--", linewidth=1, color="gray")

            # ---- Mixture Density Curve (sum of components) ----
            ax.plot(xs, comp_pdf_sum, color="blue", linewidth=2, label="Mixture PDF")

            ax.set_title(f"{feat_i}")
            ax.set_yticks([])

        else:
            # -----------------------------------------
            # OFF-DIAGONAL: SCATTER PLOT
            # -----------------------------------------

            ax.scatter(
                X[feat_j],
                X[feat_i],
                c=labels2,
                cmap="viridis",
                alpha=0.5,
                s=8
            )

        # Label axes on edges
        if j == 0:
            ax.set_ylabel(feat_i)
        if i == len(cols) - 1:
            ax.set_xlabel(feat_j)

plt.suptitle("Pairwise Feature Matrix with KDE + GMM Component PDFs + Mixture Density", y=1.02)
plt.tight_layout()
plt.show()
No description has been provided for this image

List of Features

I asked ChatGPT "provide full name and definitions and relevance of each of these features ret, vol20, vol_z, pos_range20, ret_lag1"

  1. Daily Logarithmic Return = measures the percentage price change from one day to the next, expressed in log form
    • captures day-to-day price movement
    • good for detecting heavy-tailed behavior (fat tailed distributions in high vol regimes)
    • Useful input to clustering because return distribution shifts during shocks, events, and turbulence.
    • Low-vol regime → return values cluster narrowly near zero
    • High-vol regime → return distribution spreads out (large +/- swings)
  2. 20-Day Rolling Annualized Volatility = This is the classic measure of short-term volatility used in quant finance
    • Volatility is the single most important driver of regime segmentation in financial markets
    • High-vol and low-vol regimes are structurally different states
      • spreads widen
      • volume increases
      • returns behave differently
      • market microstructure shifts
    • This feature typically provides the clearest separation between regimes
  3. Volume Z-Score (20-Day) = Measures how unusual today’s volume is compared to the last 20 trading days
    • Trading volume spikes during:
      • breakouts
      • earnings
      • news shocks
      • panic selling
    • Volume is strongly correlated with volatility, but carries independent information about market participation and order flow imbalance
    • High-vol regime consistently shows elevated vol_z
    • Low-vol regime → volume stays near its mean (centered around 0)
    • Helps the model distinguish “quiet low-volatility days” vs. “active trending or panic days.”
  4. Price Position in 20-Day Range = This places the price on a 0 to 1 scale within its recent high–low range
    • 0 → price at the 20-day low
    • 1 → price at the 20-day high
    • 0.5 → middle of the recent range
    • Detects breakout vs. mean-reverting structure:
      • trending regime → prices often near upper range
      • fear/panic regime → price near lower range
      • sideways regime → price oscillates in the middle
    • Not as strong a separator as volatility or volume, but adds valuable directional context
    • High-vol spikes usually pull the price out of the range → shifts model likelihood
    • Low-vol periods often drift within a narrower band
  5. Lagged Daily Log Return (1 Day) = Just the previous day’s return
    • This adds time structure into the feature space:
      • Captures short-term momentum
      • Captures short-term mean reversion
      • Helps model transitions:
        • Large negative return → more likely entering high-vol regime
        • Consecutive small returns → more likely in calm regime
      • While returns alone are not great for clustering, lagged information improves regime persistence (like a pseudo-HMM effect)
      • If yesterday was shocking (big negative return), today’s probability of the high-vol regime typically increases
      • If yesterday was calm, model likelihood shifts toward low-vol
  6. Relative Strength Index (14-day)¶

alt text

Adding RSI, MACD, Market Beta

In [103]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm, gaussian_kde
from sklearn.mixture import GaussianMixture
import yfinance as yf


# ============================================================
# 1. LOAD MARKET INDEX (SPY for Beta)
# ============================================================
market_df = yf.download("SPY", start=data.index.min(), end=data.index.max())
market_df = market_df.reindex(data.index)   # align to NVDA's index


# ============================================================
# 2. BASE FEATURES
# ============================================================

# Daily log returns (ensure 1-D Series)
ret = pd.Series(
    np.log(data["Close"] / data["Close"].shift(1)).values.flatten(),
    index=data.index,
    name="ret"
)

X = pd.DataFrame(index=data.index)
X["ret"] = ret

# 20-day annualized volatility
X["vol20"] = X["ret"].rolling(20).std() * np.sqrt(252)

# Volume Z-score (20-day)
vol_roll = data["Volume"].rolling(20)
X["vol_z"] = (data["Volume"] - vol_roll.mean()) / vol_roll.std()

# Position in 20-day high-low range
roll = data["Close"].rolling(20)
range20 = (roll.max() - roll.min()).replace(0, np.nan)
X["pos_range20"] = (data["Close"] - roll.min()) / range20

# Lagged return
X["ret_lag1"] = X["ret"].shift(1)


# ============================================================
# 3. NEW FEATURES: RSI, MACD Histogram, Market Beta
# ============================================================

# ----------- RSI(14) -----------
delta = data["Close"].diff()
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)

avg_gain = gain.rolling(14).mean()
avg_loss = loss.rolling(14).mean()

RS = avg_gain / avg_loss
X["rsi14"] = 100 - (100 / (1 + RS))


# ----------- MACD Histogram (12, 26, 9) -----------
ema12 = data["Close"].ewm(span=12, adjust=False).mean()
ema26 = data["Close"].ewm(span=26, adjust=False).mean()
MACD = ema12 - ema26
MACDsig = MACD.ewm(span=9, adjust=False).mean()
X["macdhist"] = MACD - MACDsig


# ----------- Market Beta (60-day Rolling) -----------

market_close = market_df["Close"]
if isinstance(market_close, pd.DataFrame):
    market_close = market_close.iloc[:, 0]

market_close = pd.to_numeric(market_close, errors="coerce")

# 1-D market returns and stock returns
market_ret = pd.Series(
    np.log(market_close / market_close.shift(1)).values,
    index=market_close.index,
    name="market_ret"
)

stock_ret = ret  # already a 1-D Series

window = 60

# Rolling covariance and variance (both Series)
cov_series = stock_ret.rolling(window).cov(market_ret)
var_series = market_ret.rolling(window).var()

beta60 = cov_series / var_series

X["beta60"] = beta60



# ============================================================
# 4. CLEAN DATA & FIT GMM (2 regimes)
# ============================================================

X = X.dropna()

gmm2 = GaussianMixture(
    n_components=2,
    covariance_type="full",
    init_params="kmeans",
    n_init=10,
    random_state=42,
)

gmm2.fit(X)
labels2 = gmm2.predict(X)
probs2 = gmm2.predict_proba(X)


# ============================================================
# 5. PAIRPLOT WITH KDE + GMM PDFs
# ============================================================

features = ["ret", "vol20", "vol_z", "pos_range20",
            "ret_lag1", "rsi14", "macdhist", "beta60"]

fig, axes = plt.subplots(len(features), len(features), figsize=(22, 22))

for i, feat_i in enumerate(features):
    for j, feat_j in enumerate(features):

        ax = axes[i, j]

        if i == j:
            # ---------------------------------------------------
            # DIAGONAL: HISTOGRAM + KDE + MIXTURE PDF
            # ---------------------------------------------------
            data_feat = X[feat_i]

            ax.hist(
                data_feat,
                bins=40,
                density=True,
                alpha=0.35,
                color="lightblue"
            )

            xs = np.linspace(data_feat.min(), data_feat.max(), 400)
            kde = gaussian_kde(data_feat)
            ax.plot(xs, kde(xs), color="red", linewidth=2)

            # GMM components + mixture
            mix_pdf = np.zeros_like(xs)
            for k in range(gmm2.n_components):
                mean_k = gmm2.means_[k, i]
                std_k = np.sqrt(gmm2.covariances_[k][i, i])
                weight_k = gmm2.weights_[k]

                comp_pdf = weight_k * norm.pdf(xs, mean_k, std_k)
                mix_pdf += comp_pdf

                ax.plot(xs, comp_pdf, "--", color="gray", linewidth=1)

            ax.plot(xs, mix_pdf, color="blue", linewidth=2)
            ax.set_title(feat_i)
            ax.set_yticks([])

        else:
            # ---------------------------------------------------
            # OFF-DIAGONALS: scatter colored by regime
            # ---------------------------------------------------
            ax.scatter(
                X[feat_j],
                X[feat_i],
                c=labels2,
                cmap="viridis",
                alpha=0.4,
                s=12
            )

        if j == 0:
            ax.set_ylabel(feat_i)
        if i == len(features) - 1:
            ax.set_xlabel(feat_j)

plt.suptitle(
    "Pairwise Feature Matrix with KDE + GMM Component PDFs (8-D Regime Space)",
    y=1.02
)
plt.tight_layout()
plt.show()
C:\Users\senna\AppData\Local\Temp\ipykernel_52276\78685735.py:12: FutureWarning: YF.download() has changed argument auto_adjust default to True
  market_df = yf.download("SPY", start=data.index.min(), end=data.index.max())
[*********************100%***********************]  1 of 1 completed
No description has been provided for this image
In [57]:
from sklearn.decomposition import PCA

# If you used scaling:
# X_input = X_scaled   # numpy array
# If you did NOT use scaling:
X_input = X.values     # use raw features

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_input)

print("Explained variance ratios:", pca.explained_variance_ratio_)

fig, ax = plt.subplots(figsize=(8, 6))

scatter = ax.scatter(
    X_pca[:, 0],
    X_pca[:, 1],
    c=labels2,
    alpha=0.6
)

ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_title("GMM regimes in PCA space")

handles = []
for k in sorted(set(labels2)):
    handles.append(
        plt.Line2D([], [], marker="o", linestyle="", label=f"Regime {k}")
    )
ax.legend(handles=handles)

plt.show()
Explained variance ratios: [0.87251742 0.09891611]
No description has been provided for this image
In [58]:
regime_series = pd.Series(labels2, index=X.index, name="regime")
prob_regime1 = pd.Series(probs2[:, 1], index=X.index, name="prob_regime_1")

fig, ax = plt.subplots(2, 1, figsize=(12, 8), sharex=True)

# Price or close
ax[0].plot(data.loc[X.index, "Close"])
ax[0].set_ylabel("NVDA Close")
ax[0].set_title("NVDA price with GMM regime probability")

# Probability of regime 1
ax[1].plot(prob_regime1)
ax[1].set_ylabel("P(Regime 1)")
ax[1].set_xlabel("Date")

plt.show()
No description has been provided for this image
In [59]:
df_regime = X.copy()
df_regime["regime"] = labels2

for k in sorted(df_regime["regime"].unique()):
    print(f"\n=== Regime {k} ===")
    print(df_regime[df_regime["regime"] == k].describe()[["ret", "vol20", "vol_z", "pos_range20"]])
=== Regime 0 ===
               ret        vol20        vol_z  pos_range20
count  1445.000000  1445.000000  1445.000000  1445.000000
mean      0.000543     0.563806     0.678261     0.529580
std       0.042882     0.210477     1.281086     0.373168
min      -0.207711     0.129971    -2.847612     0.000000
25%      -0.028635     0.404649    -0.344320     0.173616
50%      -0.000918     0.563352     0.499215     0.533719
75%       0.030032     0.676657     1.550514     0.918644
max       0.260876     1.396895     4.072415     1.000000

=== Regime 1 ===
               ret        vol20        vol_z  pos_range20
count  2545.000000  2545.000000  2545.000000  2545.000000
mean      0.002121     0.330991    -0.413406     0.631074
std       0.015511     0.103084     0.617095     0.338578
min      -0.046106     0.088809    -2.602145     0.000000
25%      -0.008054     0.256820    -0.820109     0.341045
50%       0.002107     0.319964    -0.485011     0.726854
75%       0.012397     0.390361    -0.029610     0.946544
max       0.048913     0.651198     1.430103     1.000000
In [101]:
print(type(ret))
print(ret.shape)
<class 'pandas.core.frame.DataFrame'>
(4010, 1)

The Strategy¶

Nvidia's data is primary time-series data. As such, I am postulating that meaningful insight will be to search for changes in one or more variables or dimensions of the stock's data over time.

Until now, I have conducted analysis on one dimension (price returns) and over the longest timeframe (from IPO to the present). For the Final Presentation, I should:

  • Conduct analysis on a consecutive series of shorter timeframes
    • write a program to automate the analysis of one short timeframe after another
    • ...collecting the output data from each run
    • ...plotting visualization for each run
  • Look at more variables and see if they might have some relationship
  • Present the final series of outputs as a dynamic...moving picture or sound file...over time

Data Science Procedure

  • data selection and preparation
  • data analysis
  • data presentation (visualization, sonification)

Idea

  • for sure, i think i want to show Nvidia's analysis results dynamically
  • let's say it is an animation...color and speed could be come means of expressing data dimensions...frequency (dark vs light colors), amplitude (faster or slower)

Insight Expression > Visualization/Sonification¶

Traditional visualization is primarily 2D, or when it is 3D...most often static. What if movement over time can also reveal insights? A heartbeat monitor usually shows a steady pulse rhythm, which is not so interesting. But when the rhythm accelerates or decelerates, or when the pulse is larger or smaller than normal...we pay attention. Something unusual has occurred and we need to have a closer look.

Pure Data Sonification of Images¶

image to sound in pure data
Sonification in Pure Data Sonification Algorithm From Data to Music

Blender for Data Science Visualization¶

https://www.youtube.com/watch?v=wEj7cfwL6RY