Machine Learning¶
What is my data, and what did I do in the fitting assignment?¶
My data is Bitcoin historical data, focused on the daily closing price, and the goal is to see if I can fit a model to it to predict the future prices.¶
Every day, the Bitcoin price goes up and down, and if you look at every single day, it feels noisy and chaotic.¶
To find a pattern, we are looking at 5 years of BTC data and closing prices to see big patterns. So with the Gaussian Smoothing, we remove the daily noise, and we focus on the long-term trend.¶
Let us begin with a simple model and then advance to complex models¶
Stage 1: Basic Linear Regression¶
In [1]:
# Load the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter1d
from sklearn.linear_model import LinearRegression
df = pd.read_csv("datasets/BTC_USD_full_data.csv")
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Close'] = pd.to_numeric(df['Close'], errors='coerce')
df = df.dropna(subset=['Date', 'Close'])
df['DayIndex'] = (df['Date'] - df['Date'].min()).dt.days
sigma = 20
df['Smooth'] = gaussian_filter1d(df['Close'], sigma=sigma)
X = df['DayIndex'].values.reshape(-1, 1)
y = df['Smooth'].values
In [2]:
model = LinearRegression()
model.fit(X, y)
df['LR_Past'] = model.predict(X)
plt.figure(figsize=(14,6))
plt.plot(df['Date'], df['Smooth'], label="Smoothed", color='red')
plt.plot(df['Date'], df['LR_Past'], label="LR Fit", color='green')
plt.title("Baseline Linear Regression Fit")
plt.legend()
plt.show()
In [3]:
try:
future_days = 30
last_day = df['DayIndex'].max()
future_index = np.arange(last_day + 1, last_day + future_days + 1).reshape(-1,1)
future_pred = model.predict(future_index)
future_df = pd.DataFrame({
"Date": pd.date_range(start=df['Date'].max(), periods=future_days+1, closed='right'),
"LR_Forecast": future_pred
})
plt.figure(figsize=(14,6))
plt.plot(df["Date"], df["Smooth"], label="Smoothed", color='red')
plt.plot(df["Date"], df["LR_Past"], label="LR Fit", color='green')
plt.plot(future_df["Date"], future_df["LR_Forecast"], label="LR Forecast", color='blue')
plt.title("Linear Regression Forecast (Next 30 Days)")
plt.legend()
plt.show()
print("Stage 1 Successful: Linear Regression Forecast Generated.")
except Exception as e:
print("Stage 1 Failed, moving to Stage 2...")
print(e)
Stage 1 Failed, moving to Stage 2... DatetimeArray._generate_range() got an unexpected keyword argument 'closed'
Linear Regression Forecasting¶
In [10]:
from sklearn.metrics import mean_squared_error
import math
# Split train/test
split = int(len(df) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
model2 = LinearRegression()
model2.fit(X_train, y_train)
y_pred_test = model2.predict(X_test)
rmse = math.sqrt(mean_squared_error(y_test, y_pred_test))
print("Stage 2 RMSE:", rmse)
plt.figure(figsize=(14,6))
plt.plot(df["Date"], y, label="True Smoothed Data", color='red' )
plt.plot(df["Date"].iloc[split:], y_pred_test, label="LR Test Predictions", color='orange')
plt.title("Stage 2: Linear Regression Test Performance")
plt.legend()
plt.show()
Stage 2 RMSE: 50023.91259288457
As you can see, when using this, the predictions are made, but they are way off from the original smoothed data¶
In [5]:
!pip install tensorflow
!pip install --upgrade protobuf
Requirement already satisfied: tensorflow in /opt/conda/lib/python3.13/site-packages (2.20.0) Requirement already satisfied: absl-py>=1.0.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (2.3.1) Requirement already satisfied: astunparse>=1.6.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=24.3.25 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (25.9.23) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (0.7.0) Requirement already satisfied: google_pasta>=0.1.1 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (0.2.0) Requirement already satisfied: libclang>=13.0.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (18.1.1) Requirement already satisfied: opt_einsum>=2.3.2 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (3.4.0) Requirement already satisfied: packaging in /opt/conda/lib/python3.13/site-packages (from tensorflow) (25.0) Requirement already satisfied: protobuf>=5.28.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (6.33.1) Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (2.32.5) Requirement already satisfied: setuptools in /opt/conda/lib/python3.13/site-packages (from tensorflow) (80.9.0) Requirement already satisfied: six>=1.12.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (1.17.0) Requirement already satisfied: termcolor>=1.1.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (3.2.0) Requirement already satisfied: typing_extensions>=3.6.6 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (4.15.0) Requirement already satisfied: wrapt>=1.11.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (2.0.1) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (1.76.0) Requirement already satisfied: tensorboard~=2.20.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (2.20.0) Requirement already satisfied: keras>=3.10.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (3.12.0) Requirement already satisfied: numpy>=1.26.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (2.3.3) Requirement already satisfied: h5py>=3.11.0 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (3.15.1) Requirement already satisfied: ml_dtypes<1.0.0,>=0.5.1 in /opt/conda/lib/python3.13/site-packages (from tensorflow) (0.5.4) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/conda/lib/python3.13/site-packages (from requests<3,>=2.21.0->tensorflow) (3.4.4) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.13/site-packages (from requests<3,>=2.21.0->tensorflow) (3.11) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.13/site-packages (from requests<3,>=2.21.0->tensorflow) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.13/site-packages (from requests<3,>=2.21.0->tensorflow) (2025.10.5) Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.13/site-packages (from tensorboard~=2.20.0->tensorflow) (3.10) Requirement already satisfied: pillow in /opt/conda/lib/python3.13/site-packages (from tensorboard~=2.20.0->tensorflow) (11.3.0) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /opt/conda/lib/python3.13/site-packages (from tensorboard~=2.20.0->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in /opt/conda/lib/python3.13/site-packages (from tensorboard~=2.20.0->tensorflow) (3.1.4) Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.13/site-packages (from astunparse>=1.6.0->tensorflow) (0.45.1) Requirement already satisfied: rich in /opt/conda/lib/python3.13/site-packages (from keras>=3.10.0->tensorflow) (14.2.0) Requirement already satisfied: namex in /opt/conda/lib/python3.13/site-packages (from keras>=3.10.0->tensorflow) (0.1.0) Requirement already satisfied: optree in /opt/conda/lib/python3.13/site-packages (from keras>=3.10.0->tensorflow) (0.18.0) Requirement already satisfied: markupsafe>=2.1.1 in /opt/conda/lib/python3.13/site-packages (from werkzeug>=1.0.1->tensorboard~=2.20.0->tensorflow) (3.0.3) Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.13/site-packages (from rich->keras>=3.10.0->tensorflow) (4.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.13/site-packages (from rich->keras>=3.10.0->tensorflow) (2.19.2) Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.13/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.10.0->tensorflow) (0.1.2) Requirement already satisfied: protobuf in /opt/conda/lib/python3.13/site-packages (6.33.1)
LSTM¶
In [6]:
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
prices = df['Smooth'].values.reshape(-1,1)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(prices)
# Create sequences (e.g., 60-day windows)
window = 60
X_lstm = []
y_lstm = []
for i in range(window, len(scaled)):
X_lstm.append(scaled[i-window:i])
y_lstm.append(scaled[i])
X_lstm = np.array(X_lstm)
y_lstm = np.array(y_lstm)
In [ ]:
model_lstm = Sequential([
LSTM(50, return_sequences=True, input_shape=(X_lstm.shape[1], 1)),
LSTM(50),
Dense(1)
])
model_lstm.compile(optimizer="adam", loss="mse")
history = model_lstm.fit(X_lstm, y_lstm, epochs=20, batch_size=32, verbose=1)
Epoch 1/20
/opt/conda/lib/python3.13/site-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
55/55 ━━━━━━━━━━━━━━━━━━━━ 5s 49ms/step - loss: 0.0153 Epoch 2/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - loss: 4.2011e-04 Epoch 3/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - loss: 3.2951e-04 Epoch 4/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 49ms/step - loss: 2.9605e-04 Epoch 5/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 49ms/step - loss: 2.5285e-04 Epoch 6/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 49ms/step - loss: 2.1249e-04 Epoch 7/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 50ms/step - loss: 1.6734e-04 Epoch 8/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 48ms/step - loss: 1.1370e-04 Epoch 9/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 49ms/step - loss: 8.6034e-05 Epoch 10/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 49ms/step - loss: 6.7202e-05 Epoch 11/20 55/55 ━━━━━━━━━━━━━━━━━━━━ 3s 49ms/step - loss: 5.6006e-05 Epoch 12/20 15/55 ━━━━━━━━━━━━━━━━━━━━ 1s 48ms/step - loss: 4.7401e-05
In [11]:
future_input = scaled[-window:]
future_predictions = []
for _ in range(30):
pred = model_lstm.predict(future_input.reshape(1,window,1))[0]
future_predictions.append(pred[0])
future_input = np.append(future_input[1:], pred).reshape(window,1)
future_prices = scaler.inverse_transform(np.array(future_predictions).reshape(-1,1))
future_dates = pd.date_range(start=df['Date'].max() + pd.Timedelta(days=1), periods=30)
plt.figure(figsize=(14,6))
plt.plot(df["Date"], df["Smooth"], label="Smoothed Historical", color='red')
plt.plot(future_dates, future_prices, label="LSTM Forecast", color='purple')
plt.title("LSTM Forecast (Next 30 Days)")
plt.legend()
plt.show()
print("Stage 3 Completed: LSTM Forecast Generated.")
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 39ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 37ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 36ms/step
Stage 3 Completed: LSTM Forecast Generated.
As seen in the graph, my code only predicts the next 30 days using the last window of historical data. The model never predicts on the test portion of the existing data, so the first 30 predictions start after the last historical date. That’s why the purple line is disconnected from the historical trend.¶
In [12]:
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#Prepare data
prices = df['Smooth'].values.reshape(-1,1)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(prices)
#Train/test split
split = int(len(scaled) * 0.8)
train_scaled = scaled[:split]
test_scaled = scaled[split:]
window = 60
#Create sequences for LSTM
def create_sequences(data, window):
X = []
y = []
for i in range(window, len(data)):
X.append(data[i-window:i])
y.append(data[i])
return np.array(X), np.array(y)
X_train, y_train = create_sequences(train_scaled, window)
X_test, y_test = create_sequences(test_scaled, window)
#Build & train LSTM
model_lstm = Sequential([
LSTM(50, return_sequences=True, input_shape=(X_train.shape[1], 1)),
LSTM(50),
Dense(1)
])
model_lstm.compile(optimizer="adam", loss="mse")
history = model_lstm.fit(X_train, y_train, epochs=20, batch_size=32, verbose=1)
# Predict on test set
y_pred_scaled = model_lstm.predict(X_test, verbose=0)
y_pred = scaler.inverse_transform(y_pred_scaled)
y_test_actual = scaler.inverse_transform(y_test)
# Create dates for test set
test_dates = df['Date'].iloc[split + window:].reset_index(drop=True)
# Create table of actual vs predicted
results_df = pd.DataFrame({
"Date": test_dates,
"Actual": y_test_actual.flatten(),
"Predicted": y_pred.flatten()
})
print(results_df.head())
# Plot test predictions vs historical trend
plt.figure(figsize=(14,6))
plt.plot(df['Date'], df['Smooth'], label="Historical Smoothed Data", color='red')
plt.plot(test_dates, y_pred, label="LSTM Predicted (Test)", color='purple')
plt.title("LSTM: Test Predictions vs Actual")
plt.xlabel("Date")
plt.ylabel("BTC Price (USD)")
plt.legend()
plt.show()
Epoch 1/20
/opt/conda/lib/python3.13/site-packages/keras/src/layers/rnn/rnn.py:199: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs)
44/44 ━━━━━━━━━━━━━━━━━━━━ 5s 49ms/step - loss: 0.0109 Epoch 2/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - loss: 2.4973e-04 Epoch 3/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 1.6991e-04 Epoch 4/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 1.2698e-04 Epoch 5/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 7.8550e-05 Epoch 6/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 4.3725e-05 Epoch 7/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 3.8212e-05 Epoch 8/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - loss: 4.2404e-05 Epoch 9/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 2.5922e-05 Epoch 10/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 6.9658e-05 Epoch 11/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 2.4112e-05 Epoch 12/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 1.7959e-05 Epoch 13/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 2.1017e-05 Epoch 14/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - loss: 1.8248e-05 Epoch 15/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 2.2219e-05 Epoch 16/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 50ms/step - loss: 1.2086e-05 Epoch 17/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 48ms/step - loss: 1.4268e-05 Epoch 18/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 1.0890e-05 Epoch 19/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 50ms/step - loss: 1.1845e-05 Epoch 20/20 44/44 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 1.5910e-05 Date Actual Predicted 0 2025-02-05 96705.526548 98733.085938 1 2025-02-06 96507.887013 98541.242188 2 2025-02-07 96300.370815 98336.484375 3 2025-02-08 96083.286930 98119.078125 4 2025-02-09 95857.011925 97889.390625
In [ ]:
### References
1. StatQuest with Josh Starmer. (2018). Linear regression, clearly explained!!! [Video]. YouTube.
https://www.youtube.com/watch?v=nk2CQITm_eo
2. GeeksforGeeks. (n.d.). Linear regression.
https://www.geeksforgeeks.org/linear-regression-python-implementation/
3. Steve Brunton. (2020). Radial basis functions (RBFs) [Video]. YouTube.
https://www.youtube.com/watch?v=Oq9xYw6PZ5Y
4. StatQuest with Josh Starmer. (2018). Long short-term memory (LSTM) networks [Video]. YouTube.
https://www.youtube.com/watch?v=8HyCNIVRbSU