Sahil Gupta - Fab Futures - Data Science
Home About

< Home

Week 2: Tools¶

Goal: Visualize our dataset

In preparation for the final project, I decided to follow this tutorial on Twitter Sentiment Analysis: Hugging Face tutorial. However, I was having trouble scraping posts from X, and decided to use an existing dataset for this week. I used this "ChatGPT Twitter Dataset" containing a collection of tweets with the hashtag #chatgpt from Kaggle: #ChatGPT data from Kaggle.

Each post in the collection has the following properties, which we can try to visualize in some way:

  • Tweet text
  • User information (username, user ID, location, etc.)
  • Tweet timestamp
  • Retweet and favorite count
  • Hashtags used in the tweet
  • URLs
In [4]:
# Import some packages for data viz
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the Dataset and check the first few rows
df = pd.read_csv('datasets/chatgpt1.csv', low_memory = False)
df.head()
Out[4]:
Datetime Tweet Id Text Username Permalink User Outlinks CountLinks ReplyCount RetweetCount LikeCount QuoteCount ConversationId Language Source Media QuotedTweet MentionedUsers hashtag hastag_counts
0 2023-01-22 13:44:34+00:00 1617156270871699456 ChatGPTで遊ぶの忘れてた!!\n書類作るコード書いてみてほしいのと、\nどこまで思考整... mochico0123 https://twitter.com/mochico0123/status/1617156... https://twitter.com/mochico0123 NaN NaN 1 0 5 0 1617156270871699456 ja <a href="http://twitter.com/download/iphone" r... NaN NaN NaN [] 0
1 2023-01-22 13:44:39+00:00 1617156291046133761 @AlexandrovnaIng Prohibition of ChatGPT has be... Caput_LupinumSG https://twitter.com/Caput_LupinumSG/status/161... https://twitter.com/Caput_LupinumSG NaN NaN 1 0 5 0 1617148639993806848 en <a href="http://twitter.com/download/iphone" r... NaN NaN [User(username='AlexandrovnaIng', id=282705900... [] 0
2 2023-01-22 13:44:44+00:00 1617156308926349312 Schaut Euch an, was @fobizz @DianaKnodel alles... ciffi https://twitter.com/ciffi/status/1617156308926... https://twitter.com/ciffi ['https://us02web.zoom.us/webinar/register/801... ['https://t.co/DsoeVJrPBp', 'https://t.co/HflT... 0 0 4 0 1617156308926349312 de <a href="http://twitter.com/#!/download/ipad" ... [Photo(previewUrl='https://pbs.twimg.com/media... https://twitter.com/DianaKnodel/status/1617153... [User(username='fobizz', id=884708145792253952... ['#ChatGPT'] 1
3 2023-01-22 13:44:49+00:00 1617156332297256961 Bow down to chatGPT 🫡..... https://t.co/ENTSzi... Vishwasrisiri https://twitter.com/Vishwasrisiri/status/16171... https://twitter.com/Vishwasrisiri ['https://twitter.com/agadmator/status/1617155... ['https://t.co/ENTSzi2AQ9'] 0 0 2 0 1617156332297256961 en <a href="http://twitter.com/download/android" ... NaN https://twitter.com/agadmator/status/161715501... NaN [] 0
4 2023-01-22 13:44:52+00:00 1617156345064570880 Profilinde vatan, Türkiye falan yazan bireyler... 0xGenetikciniz https://twitter.com/0xGenetikciniz/status/1617... https://twitter.com/0xGenetikciniz NaN NaN 0 0 4 0 1617156345064570880 tr <a href="http://twitter.com/download/iphone" r... NaN NaN NaN [] 0
In [8]:
# Get some basic stats about the numerical values in the Tweet Dataset
print(df.shape)
df.describe()
(50001, 20)
Out[8]:
Tweet Id ReplyCount RetweetCount LikeCount QuoteCount ConversationId hastag_counts
count 5.000100e+04 50001.000000 50001.000000 50001.000000 50001.000000 5.000100e+04 50001.000000
mean 1.617493e+18 0.929141 1.498510 9.696326 0.219536 1.617205e+18 0.783304
std 1.725682e+14 23.251710 46.030058 313.524215 10.356329 1.005075e+16 1.975040
min 1.617156e+18 0.000000 0.000000 0.000000 0.000000 6.493609e+17 0.000000
25% 1.617354e+18 0.000000 0.000000 0.000000 0.000000 1.617302e+18 0.000000
50% 1.617525e+18 0.000000 0.000000 0.000000 0.000000 1.617504e+18 0.000000
75% 1.617625e+18 1.000000 0.000000 2.000000 0.000000 1.617607e+18 1.000000
max 1.617779e+18 3098.000000 6815.000000 56073.000000 1947.000000 1.617779e+18 28.000000
In [9]:
#Some basic Data Viz: 

#Frequency of Posts by time of day:
df['Datetime'] = pd.to_datetime(df['Datetime'])

plt.hist(df['Datetime'], bins=24, edgecolor='black')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Tweets')
plt.title('Tweets Posted Time Range')
plt.xticks(rotation=90)

plt.show()
No description has been provided for this image
In [10]:
#Boxplots showing range of number of Replies, Retweets, Likes, Quotes, and hashtags: 
fig = plt.figure(figsize = (10, 10))
features = ['ReplyCount',
            'RetweetCount',
            'LikeCount',
            'QuoteCount',
            'hastag_counts']
for i in range(len(features)):
    plt.subplot(3,3, i+1)
    plt.boxplot(df[features[i]])
    plt.title(features[i])
    
No description has been provided for this image
In [11]:
#Same data visualized with Bar Charts: 
fig = plt.figure(figsize = (20, 20))
for i in range(len(features)):
    plt.subplot(4,2, i+1)
    sns.countplot(data=df, x=features[i], order=df[features[i]].value_counts().index[:5])
    plt.title(features[i])
    
No description has been provided for this image
In [13]:
#Most mentioned usernames: 
for i in range(len(df['MentionedUsers'])):
    if isinstance(df['MentionedUsers'][i], str):
        username = df['MentionedUsers'][i].split("username='")[1].split("',")[0]
        df.loc[i, 'MentionedUsers'] = username
    else:
        df.loc[i, 'MentionedUsers'] = np.nan

counts = df['MentionedUsers'].value_counts()
top = counts.nlargest(20)

top.plot(kind='bar')

plt.title('Top 20 Most Repetitive Mentioned Users')
plt.xlabel('Mentioned Users')
plt.ylabel('Frequency')

plt.show()
No description has been provided for this image