< Home
Week 2: Tools¶
Goal: Visualize our dataset
In preparation for the final project, I decided to follow this tutorial on Twitter Sentiment Analysis: Hugging Face tutorial. However, I was having trouble scraping posts from X, and decided to use an existing dataset for this week. I used this "ChatGPT Twitter Dataset" containing a collection of tweets with the hashtag #chatgpt from Kaggle: #ChatGPT data from Kaggle.
Each post in the collection has the following properties, which we can try to visualize in some way:
- Tweet text
- User information (username, user ID, location, etc.)
- Tweet timestamp
- Retweet and favorite count
- Hashtags used in the tweet
- URLs
In [4]:
# Import some packages for data viz
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the Dataset and check the first few rows
df = pd.read_csv('datasets/chatgpt1.csv', low_memory = False)
df.head()
Out[4]:
| Datetime | Tweet Id | Text | Username | Permalink | User | Outlinks | CountLinks | ReplyCount | RetweetCount | LikeCount | QuoteCount | ConversationId | Language | Source | Media | QuotedTweet | MentionedUsers | hashtag | hastag_counts | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-01-22 13:44:34+00:00 | 1617156270871699456 | ChatGPTで遊ぶの忘れてた!!\n書類作るコード書いてみてほしいのと、\nどこまで思考整... | mochico0123 | https://twitter.com/mochico0123/status/1617156... | https://twitter.com/mochico0123 | NaN | NaN | 1 | 0 | 5 | 0 | 1617156270871699456 | ja | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | [] | 0 |
| 1 | 2023-01-22 13:44:39+00:00 | 1617156291046133761 | @AlexandrovnaIng Prohibition of ChatGPT has be... | Caput_LupinumSG | https://twitter.com/Caput_LupinumSG/status/161... | https://twitter.com/Caput_LupinumSG | NaN | NaN | 1 | 0 | 5 | 0 | 1617148639993806848 | en | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | [User(username='AlexandrovnaIng', id=282705900... | [] | 0 |
| 2 | 2023-01-22 13:44:44+00:00 | 1617156308926349312 | Schaut Euch an, was @fobizz @DianaKnodel alles... | ciffi | https://twitter.com/ciffi/status/1617156308926... | https://twitter.com/ciffi | ['https://us02web.zoom.us/webinar/register/801... | ['https://t.co/DsoeVJrPBp', 'https://t.co/HflT... | 0 | 0 | 4 | 0 | 1617156308926349312 | de | <a href="http://twitter.com/#!/download/ipad" ... | [Photo(previewUrl='https://pbs.twimg.com/media... | https://twitter.com/DianaKnodel/status/1617153... | [User(username='fobizz', id=884708145792253952... | ['#ChatGPT'] | 1 |
| 3 | 2023-01-22 13:44:49+00:00 | 1617156332297256961 | Bow down to chatGPT 🫡..... https://t.co/ENTSzi... | Vishwasrisiri | https://twitter.com/Vishwasrisiri/status/16171... | https://twitter.com/Vishwasrisiri | ['https://twitter.com/agadmator/status/1617155... | ['https://t.co/ENTSzi2AQ9'] | 0 | 0 | 2 | 0 | 1617156332297256961 | en | <a href="http://twitter.com/download/android" ... | NaN | https://twitter.com/agadmator/status/161715501... | NaN | [] | 0 |
| 4 | 2023-01-22 13:44:52+00:00 | 1617156345064570880 | Profilinde vatan, Türkiye falan yazan bireyler... | 0xGenetikciniz | https://twitter.com/0xGenetikciniz/status/1617... | https://twitter.com/0xGenetikciniz | NaN | NaN | 0 | 0 | 4 | 0 | 1617156345064570880 | tr | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | [] | 0 |
In [8]:
# Get some basic stats about the numerical values in the Tweet Dataset
print(df.shape)
df.describe()
(50001, 20)
Out[8]:
| Tweet Id | ReplyCount | RetweetCount | LikeCount | QuoteCount | ConversationId | hastag_counts | |
|---|---|---|---|---|---|---|---|
| count | 5.000100e+04 | 50001.000000 | 50001.000000 | 50001.000000 | 50001.000000 | 5.000100e+04 | 50001.000000 |
| mean | 1.617493e+18 | 0.929141 | 1.498510 | 9.696326 | 0.219536 | 1.617205e+18 | 0.783304 |
| std | 1.725682e+14 | 23.251710 | 46.030058 | 313.524215 | 10.356329 | 1.005075e+16 | 1.975040 |
| min | 1.617156e+18 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.493609e+17 | 0.000000 |
| 25% | 1.617354e+18 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.617302e+18 | 0.000000 |
| 50% | 1.617525e+18 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.617504e+18 | 0.000000 |
| 75% | 1.617625e+18 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 1.617607e+18 | 1.000000 |
| max | 1.617779e+18 | 3098.000000 | 6815.000000 | 56073.000000 | 1947.000000 | 1.617779e+18 | 28.000000 |
In [9]:
#Some basic Data Viz:
#Frequency of Posts by time of day:
df['Datetime'] = pd.to_datetime(df['Datetime'])
plt.hist(df['Datetime'], bins=24, edgecolor='black')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Tweets')
plt.title('Tweets Posted Time Range')
plt.xticks(rotation=90)
plt.show()
In [10]:
#Boxplots showing range of number of Replies, Retweets, Likes, Quotes, and hashtags:
fig = plt.figure(figsize = (10, 10))
features = ['ReplyCount',
'RetweetCount',
'LikeCount',
'QuoteCount',
'hastag_counts']
for i in range(len(features)):
plt.subplot(3,3, i+1)
plt.boxplot(df[features[i]])
plt.title(features[i])
In [11]:
#Same data visualized with Bar Charts:
fig = plt.figure(figsize = (20, 20))
for i in range(len(features)):
plt.subplot(4,2, i+1)
sns.countplot(data=df, x=features[i], order=df[features[i]].value_counts().index[:5])
plt.title(features[i])
In [13]:
#Most mentioned usernames:
for i in range(len(df['MentionedUsers'])):
if isinstance(df['MentionedUsers'][i], str):
username = df['MentionedUsers'][i].split("username='")[1].split("',")[0]
df.loc[i, 'MentionedUsers'] = username
else:
df.loc[i, 'MentionedUsers'] = np.nan
counts = df['MentionedUsers'].value_counts()
top = counts.nlargest(20)
top.plot(kind='bar')
plt.title('Top 20 Most Repetitive Mentioned Users')
plt.xlabel('Mentioned Users')
plt.ylabel('Frequency')
plt.show()