< Home
Week 4: Machine Learning¶
Goal: Fit a machine learning model to our data
I again wasn't sure where to start, so asked Copilot to assist me:
Fit a machine learning model to chatgpt1.csv
It suggested predicting the number of likes of a post (LikeCount) based on other numerical attributes of the dataset, which seemed like a good place to start. Here's the breakdown of the steps it gave me:
- Load and explore the data:
- Read the CSV.
- Inspect the structure and handle missing values.
- Preprocess:
- Select numerical features (e.g., ReplyCount, RetweetCount, QuoteCount, hastag_counts, CountLinks).
- Handle any missing values (e.g., fill with 0 or mean).
- Split into training and testing sets.
- Train the model:
- Fit a linear regression model.
- Evaluate with metrics like mean squared error (MSE).
- Visualize and interpret:
- Plot predictions vs. actuals.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Step 1: Load the data
df = pd.read_csv('/Users/srgupta/Desktop/Academany/Data_Science/sahil-gupta/datasets/chatgpt1.csv')
# Quick exploration
print(df.head())
print(df.info())
print(df.describe())
# Step 2: Preprocess
# Select numerical features for prediction
features = ['ReplyCount', 'RetweetCount', 'QuoteCount', 'hastag_counts', 'CountLinks']
target = 'LikeCount'
# Convert to numeric, coercing errors to NaN, then fill with 0
df[features] = df[features].apply(pd.to_numeric, errors='coerce').fillna(0)
df[target] = pd.to_numeric(df[target], errors='coerce').fillna(0)
# Split data
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Step 4: Visualize
plt.scatter(y_test, y_pred)
plt.xlabel('Actual LikeCount')
plt.ylabel('Predicted LikeCount')
plt.title('Actual vs Predicted Likes')
plt.show()
# Optional: Print coefficients
print('Coefficients:', model.coef_)
Datetime Tweet Id \
0 2023-01-22 13:44:34+00:00 1617156270871699456
1 2023-01-22 13:44:39+00:00 1617156291046133761
2 2023-01-22 13:44:44+00:00 1617156308926349312
3 2023-01-22 13:44:49+00:00 1617156332297256961
4 2023-01-22 13:44:52+00:00 1617156345064570880
Text Username \
0 ChatGPTで遊ぶの忘れてた!!\n書類作るコード書いてみてほしいのと、\nどこまで思考整... mochico0123
1 @AlexandrovnaIng Prohibition of ChatGPT has be... Caput_LupinumSG
2 Schaut Euch an, was @fobizz @DianaKnodel alles... ciffi
3 Bow down to chatGPT 🫡..... https://t.co/ENTSzi... Vishwasrisiri
4 Profilinde vatan, Türkiye falan yazan bireyler... 0xGenetikciniz
Permalink \
0 https://twitter.com/mochico0123/status/1617156...
1 https://twitter.com/Caput_LupinumSG/status/161...
2 https://twitter.com/ciffi/status/1617156308926...
3 https://twitter.com/Vishwasrisiri/status/16171...
4 https://twitter.com/0xGenetikciniz/status/1617...
User \
0 https://twitter.com/mochico0123
1 https://twitter.com/Caput_LupinumSG
2 https://twitter.com/ciffi
3 https://twitter.com/Vishwasrisiri
4 https://twitter.com/0xGenetikciniz
Outlinks \
0 NaN
1 NaN
2 ['https://us02web.zoom.us/webinar/register/801...
3 ['https://twitter.com/agadmator/status/1617155...
4 NaN
CountLinks ReplyCount \
0 NaN 1
1 NaN 1
2 ['https://t.co/DsoeVJrPBp', 'https://t.co/HflT... 0
3 ['https://t.co/ENTSzi2AQ9'] 0
4 NaN 0
RetweetCount LikeCount QuoteCount ConversationId Language \
0 0 5 0 1617156270871699456 ja
1 0 5 0 1617148639993806848 en
2 0 4 0 1617156308926349312 de
3 0 2 0 1617156332297256961 en
4 0 4 0 1617156345064570880 tr
Source \
0 <a href="http://twitter.com/download/iphone" r...
1 <a href="http://twitter.com/download/iphone" r...
2 <a href="http://twitter.com/#!/download/ipad" ...
3 <a href="http://twitter.com/download/android" ...
4 <a href="http://twitter.com/download/iphone" r...
Media \
0 NaN
1 NaN
2 [Photo(previewUrl='https://pbs.twimg.com/media...
3 NaN
4 NaN
QuotedTweet \
0 NaN
1 NaN
2 https://twitter.com/DianaKnodel/status/1617153...
3 https://twitter.com/agadmator/status/161715501...
4 NaN
MentionedUsers hashtag \
0 NaN []
1 [User(username='AlexandrovnaIng', id=282705900... []
2 [User(username='fobizz', id=884708145792253952... ['#ChatGPT']
3 NaN []
4 NaN []
hastag_counts
0 0
1 0
2 1
3 0
4 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50001 entries, 0 to 50000
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Datetime 50001 non-null object
1 Tweet Id 50001 non-null int64
2 Text 50001 non-null object
3 Username 50001 non-null object
4 Permalink 50001 non-null object
5 User 50001 non-null object
6 Outlinks 19942 non-null object
7 CountLinks 19942 non-null object
8 ReplyCount 50001 non-null int64
9 RetweetCount 50001 non-null int64
10 LikeCount 50001 non-null int64
11 QuoteCount 50001 non-null int64
12 ConversationId 50001 non-null int64
13 Language 50001 non-null object
14 Source 50001 non-null object
15 Media 9502 non-null object
16 QuotedTweet 3563 non-null object
17 MentionedUsers 17169 non-null object
18 hashtag 50001 non-null object
19 hastag_counts 50001 non-null int64
dtypes: int64(7), object(13)
memory usage: 7.6+ MB
None
Tweet Id ReplyCount RetweetCount LikeCount QuoteCount \
count 5.000100e+04 50001.000000 50001.000000 50001.000000 50001.000000
mean 1.617493e+18 0.929141 1.498510 9.696326 0.219536
std 1.725682e+14 23.251710 46.030058 313.524215 10.356329
min 1.617156e+18 0.000000 0.000000 0.000000 0.000000
25% 1.617354e+18 0.000000 0.000000 0.000000 0.000000
50% 1.617525e+18 0.000000 0.000000 0.000000 0.000000
75% 1.617625e+18 1.000000 0.000000 2.000000 0.000000
max 1.617779e+18 3098.000000 6815.000000 56073.000000 1947.000000
ConversationId hastag_counts
count 5.000100e+04 50001.000000
mean 1.617205e+18 0.783304
std 1.005075e+16 1.975040
min 6.493609e+17 0.000000
25% 1.617302e+18 0.000000
50% 1.617504e+18 0.000000
75% 1.617607e+18 1.000000
max 1.617779e+18 28.000000
Mean Squared Error: 47256.230926211516
/Users/srgupta/Desktop/Academany/Data_Science/.venv/lib/python3.10/site-packages/sklearn/linear_model/_base.py:280: RuntimeWarning: divide by zero encountered in matmul return X @ coef_ + self.intercept_ /Users/srgupta/Desktop/Academany/Data_Science/.venv/lib/python3.10/site-packages/sklearn/linear_model/_base.py:280: RuntimeWarning: overflow encountered in matmul return X @ coef_ + self.intercept_ /Users/srgupta/Desktop/Academany/Data_Science/.venv/lib/python3.10/site-packages/sklearn/linear_model/_base.py:280: RuntimeWarning: invalid value encountered in matmul return X @ coef_ + self.intercept_
Coefficients: [-0.13356273 4.96697982 0.44483169 -0.68156968 0. ]
The model gives a Mean Squared Error of 47256.230926211516, which is quite high. This makes sense given the skewness of the data we observed in Week 2.
Copilot suggests tyring a different model, such as RandomForestRegressor: random forest regressor
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor # Updated import
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Step 1: Load the data
df = pd.read_csv('/Users/srgupta/Desktop/Academany/Data_Science/sahil-gupta/datasets/chatgpt1.csv')
# Quick exploration (optional, comment out if already done)
# print(df.head())
# print(df.info())
# print(df.describe())
# Step 2: Preprocess
# Select numerical features for prediction
features = ['ReplyCount', 'RetweetCount', 'QuoteCount', 'hastag_counts', 'CountLinks']
target = 'LikeCount'
# Convert to numeric, coercing errors to NaN, then fill with 0
df[features] = df[features].apply(pd.to_numeric, errors='coerce').fillna(0)
df[target] = pd.to_numeric(df[target], errors='coerce').fillna(0)
# Split data
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42) # 100 trees by default; adjust as needed
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Step 4: Visualize
plt.scatter(y_test, y_pred)
plt.xlabel('Actual LikeCount')
plt.ylabel('Predicted LikeCount')
plt.title('Actual vs Predicted Likes (Random Forest)')
plt.show()
# Optional: Feature importances
print('Feature Importances:', model.feature_importances_)
Mean Squared Error: 191698.90285443014
Feature Importances: [4.24266383e-02 7.64640761e-01 1.92260679e-01 6.71921221e-04 0.00000000e+00]
The mean squared error increased (by a lot)!
Mean Squared Error: 191698.90285443014
I guess I'll have to explore other models or techniques.