Sahil Gupta - Fab Futures - Data Science
Home About

< Home

Week 4: Machine Learning¶

Goal: Fit a machine learning model to our data

I again wasn't sure where to start, so asked Copilot to assist me:
Fit a machine learning model to chatgpt1.csv

It suggested predicting the number of likes of a post (LikeCount) based on other numerical attributes of the dataset, which seemed like a good place to start. Here's the breakdown of the steps it gave me:

  1. Load and explore the data:
  • Read the CSV.
  • Inspect the structure and handle missing values.
  1. Preprocess:
  • Select numerical features (e.g., ReplyCount, RetweetCount, QuoteCount, hastag_counts, CountLinks).
  • Handle any missing values (e.g., fill with 0 or mean).
  • Split into training and testing sets.
  1. Train the model:
  • Fit a linear regression model.
  • Evaluate with metrics like mean squared error (MSE).
  1. Visualize and interpret:
  • Plot predictions vs. actuals.
In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Step 1: Load the data
df = pd.read_csv('/Users/srgupta/Desktop/Academany/Data_Science/sahil-gupta/datasets/chatgpt1.csv')

# Quick exploration
print(df.head())
print(df.info())
print(df.describe())

# Step 2: Preprocess
# Select numerical features for prediction
features = ['ReplyCount', 'RetweetCount', 'QuoteCount', 'hastag_counts', 'CountLinks']
target = 'LikeCount'

# Convert to numeric, coercing errors to NaN, then fill with 0
df[features] = df[features].apply(pd.to_numeric, errors='coerce').fillna(0)
df[target] = pd.to_numeric(df[target], errors='coerce').fillna(0)

# Split data
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Step 4: Visualize
plt.scatter(y_test, y_pred)
plt.xlabel('Actual LikeCount')
plt.ylabel('Predicted LikeCount')
plt.title('Actual vs Predicted Likes')
plt.show()

# Optional: Print coefficients
print('Coefficients:', model.coef_)
                    Datetime             Tweet Id  \
0  2023-01-22 13:44:34+00:00  1617156270871699456   
1  2023-01-22 13:44:39+00:00  1617156291046133761   
2  2023-01-22 13:44:44+00:00  1617156308926349312   
3  2023-01-22 13:44:49+00:00  1617156332297256961   
4  2023-01-22 13:44:52+00:00  1617156345064570880   

                                                Text         Username  \
0  ChatGPTで遊ぶの忘れてた!!\n書類作るコード書いてみてほしいのと、\nどこまで思考整...      mochico0123   
1  @AlexandrovnaIng Prohibition of ChatGPT has be...  Caput_LupinumSG   
2  Schaut Euch an, was @fobizz @DianaKnodel alles...            ciffi   
3  Bow down to chatGPT 🫡..... https://t.co/ENTSzi...    Vishwasrisiri   
4  Profilinde vatan, Türkiye falan yazan bireyler...   0xGenetikciniz   

                                           Permalink  \
0  https://twitter.com/mochico0123/status/1617156...   
1  https://twitter.com/Caput_LupinumSG/status/161...   
2  https://twitter.com/ciffi/status/1617156308926...   
3  https://twitter.com/Vishwasrisiri/status/16171...   
4  https://twitter.com/0xGenetikciniz/status/1617...   

                                  User  \
0      https://twitter.com/mochico0123   
1  https://twitter.com/Caput_LupinumSG   
2            https://twitter.com/ciffi   
3    https://twitter.com/Vishwasrisiri   
4   https://twitter.com/0xGenetikciniz   

                                            Outlinks  \
0                                                NaN   
1                                                NaN   
2  ['https://us02web.zoom.us/webinar/register/801...   
3  ['https://twitter.com/agadmator/status/1617155...   
4                                                NaN   

                                          CountLinks  ReplyCount  \
0                                                NaN           1   
1                                                NaN           1   
2  ['https://t.co/DsoeVJrPBp', 'https://t.co/HflT...           0   
3                        ['https://t.co/ENTSzi2AQ9']           0   
4                                                NaN           0   

   RetweetCount  LikeCount  QuoteCount       ConversationId Language  \
0             0          5           0  1617156270871699456       ja   
1             0          5           0  1617148639993806848       en   
2             0          4           0  1617156308926349312       de   
3             0          2           0  1617156332297256961       en   
4             0          4           0  1617156345064570880       tr   

                                              Source  \
0  <a href="http://twitter.com/download/iphone" r...   
1  <a href="http://twitter.com/download/iphone" r...   
2  <a href="http://twitter.com/#!/download/ipad" ...   
3  <a href="http://twitter.com/download/android" ...   
4  <a href="http://twitter.com/download/iphone" r...   

                                               Media  \
0                                                NaN   
1                                                NaN   
2  [Photo(previewUrl='https://pbs.twimg.com/media...   
3                                                NaN   
4                                                NaN   

                                         QuotedTweet  \
0                                                NaN   
1                                                NaN   
2  https://twitter.com/DianaKnodel/status/1617153...   
3  https://twitter.com/agadmator/status/161715501...   
4                                                NaN   

                                      MentionedUsers       hashtag  \
0                                                NaN            []   
1  [User(username='AlexandrovnaIng', id=282705900...            []   
2  [User(username='fobizz', id=884708145792253952...  ['#ChatGPT']   
3                                                NaN            []   
4                                                NaN            []   

   hastag_counts  
0              0  
1              0  
2              1  
3              0  
4              0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50001 entries, 0 to 50000
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Datetime        50001 non-null  object
 1   Tweet Id        50001 non-null  int64 
 2   Text            50001 non-null  object
 3   Username        50001 non-null  object
 4   Permalink       50001 non-null  object
 5   User            50001 non-null  object
 6   Outlinks        19942 non-null  object
 7   CountLinks      19942 non-null  object
 8   ReplyCount      50001 non-null  int64 
 9   RetweetCount    50001 non-null  int64 
 10  LikeCount       50001 non-null  int64 
 11  QuoteCount      50001 non-null  int64 
 12  ConversationId  50001 non-null  int64 
 13  Language        50001 non-null  object
 14  Source          50001 non-null  object
 15  Media           9502 non-null   object
 16  QuotedTweet     3563 non-null   object
 17  MentionedUsers  17169 non-null  object
 18  hashtag         50001 non-null  object
 19  hastag_counts   50001 non-null  int64 
dtypes: int64(7), object(13)
memory usage: 7.6+ MB
None
           Tweet Id    ReplyCount  RetweetCount     LikeCount    QuoteCount  \
count  5.000100e+04  50001.000000  50001.000000  50001.000000  50001.000000   
mean   1.617493e+18      0.929141      1.498510      9.696326      0.219536   
std    1.725682e+14     23.251710     46.030058    313.524215     10.356329   
min    1.617156e+18      0.000000      0.000000      0.000000      0.000000   
25%    1.617354e+18      0.000000      0.000000      0.000000      0.000000   
50%    1.617525e+18      0.000000      0.000000      0.000000      0.000000   
75%    1.617625e+18      1.000000      0.000000      2.000000      0.000000   
max    1.617779e+18   3098.000000   6815.000000  56073.000000   1947.000000   

       ConversationId  hastag_counts  
count    5.000100e+04   50001.000000  
mean     1.617205e+18       0.783304  
std      1.005075e+16       1.975040  
min      6.493609e+17       0.000000  
25%      1.617302e+18       0.000000  
50%      1.617504e+18       0.000000  
75%      1.617607e+18       1.000000  
max      1.617779e+18      28.000000  
Mean Squared Error: 47256.230926211516
/Users/srgupta/Desktop/Academany/Data_Science/.venv/lib/python3.10/site-packages/sklearn/linear_model/_base.py:280: RuntimeWarning: divide by zero encountered in matmul
  return X @ coef_ + self.intercept_
/Users/srgupta/Desktop/Academany/Data_Science/.venv/lib/python3.10/site-packages/sklearn/linear_model/_base.py:280: RuntimeWarning: overflow encountered in matmul
  return X @ coef_ + self.intercept_
/Users/srgupta/Desktop/Academany/Data_Science/.venv/lib/python3.10/site-packages/sklearn/linear_model/_base.py:280: RuntimeWarning: invalid value encountered in matmul
  return X @ coef_ + self.intercept_
No description has been provided for this image
Coefficients: [-0.13356273  4.96697982  0.44483169 -0.68156968  0.        ]

The model gives a Mean Squared Error of 47256.230926211516, which is quite high. This makes sense given the skewness of the data we observed in Week 2.

Copilot suggests tyring a different model, such as RandomForestRegressor: random forest regressor

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor  # Updated import
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Step 1: Load the data
df = pd.read_csv('/Users/srgupta/Desktop/Academany/Data_Science/sahil-gupta/datasets/chatgpt1.csv')

# Quick exploration (optional, comment out if already done)
# print(df.head())
# print(df.info())
# print(df.describe())

# Step 2: Preprocess
# Select numerical features for prediction
features = ['ReplyCount', 'RetweetCount', 'QuoteCount', 'hastag_counts', 'CountLinks']
target = 'LikeCount'

# Convert to numeric, coercing errors to NaN, then fill with 0
df[features] = df[features].apply(pd.to_numeric, errors='coerce').fillna(0)
df[target] = pd.to_numeric(df[target], errors='coerce').fillna(0)

# Split data
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)  # 100 trees by default; adjust as needed
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Step 4: Visualize
plt.scatter(y_test, y_pred)
plt.xlabel('Actual LikeCount')
plt.ylabel('Predicted LikeCount')
plt.title('Actual vs Predicted Likes (Random Forest)')
plt.show()

# Optional: Feature importances
print('Feature Importances:', model.feature_importances_)
Mean Squared Error: 191698.90285443014
No description has been provided for this image
Feature Importances: [4.24266383e-02 7.64640761e-01 1.92260679e-01 6.71921221e-04
 0.00000000e+00]

The mean squared error increased (by a lot)!

Mean Squared Error: 191698.90285443014

I guess I'll have to explore other models or techniques.