< Home
Week 5: Probability¶
Goal: Investigate the probability distribution of our data
I investigated several distributions during Week 2 while visualizing the data.
I'll try to focus on multidimensional distributions this time:
- Mean: find the average value of a distribution
- Variance: measure the spread (how much do value deviate from the mean)
- Covariance: how do pairs of variables vary together?
I'm also curious to understand "Entropy" and "Information" more deeply; I'll see if Copilot can help me explore those ideas using my dataset.
import pandas as pd
import numpy as np
# Load the data
df = pd.read_csv('/Users/srgupta/Desktop/Academany/Data_Science/sahil-gupta/datasets/chatgpt1.csv')
# Select only numerical columns (to avoid errors with strings)
numerical_cols = df.select_dtypes(include=[np.number]).columns
df_num = df[numerical_cols]
# Handle missing values (fill with 0 for simplicity)
df_num = df_num.fillna(0)
# Compute and print mean
print("Mean of numerical columns:")
print(df_num.mean())
print("\n")
# Compute and print variance
print("Variance of numerical columns:")
print(df_num.var())
print("\n")
# Compute and print covariance matrix
print("Covariance matrix of numerical columns:")
print(df_num.cov())
Mean of numerical columns:
Tweet Id 1.617493e+18
ReplyCount 9.291414e-01
RetweetCount 1.498510e+00
LikeCount 9.696326e+00
QuoteCount 2.195356e-01
ConversationId 1.617205e+18
hastag_counts 7.833043e-01
dtype: float64
Variance of numerical columns:
Tweet Id 2.977979e+28
ReplyCount 5.406420e+02
RetweetCount 2.118766e+03
LikeCount 9.829743e+04
QuoteCount 1.072535e+02
ConversationId 1.010176e+32
hastag_counts 3.900782e+00
dtype: float64
Covariance matrix of numerical columns:
Tweet Id ReplyCount RetweetCount LikeCount \
Tweet Id 2.977979e+28 -5.664534e+13 -1.168430e+14 -9.298656e+14
ReplyCount -5.664534e+13 5.406420e+02 5.169555e+02 3.179718e+03
RetweetCount -1.168430e+14 5.169555e+02 2.118766e+03 1.361703e+04
LikeCount -9.298656e+14 3.179718e+03 1.361703e+04 9.829743e+04
QuoteCount -2.289893e+13 9.377046e+01 4.228227e+02 2.989490e+03
ConversationId 4.617246e+28 9.014163e+13 -1.858106e+14 -6.882488e+14
hastag_counts 2.828175e+12 -4.276951e-01 -3.974529e-01 -4.867926e+00
QuoteCount ConversationId hastag_counts
Tweet Id -2.289893e+13 4.617246e+28 2.828175e+12
ReplyCount 9.377046e+01 9.014163e+13 -4.276951e-01
RetweetCount 4.228227e+02 -1.858106e+14 -3.974529e-01
LikeCount 2.989490e+03 -6.882488e+14 -4.867926e+00
QuoteCount 1.072535e+02 -4.110187e+13 -1.023866e-01
ConversationId -4.110187e+13 1.010176e+32 1.982705e+13
hastag_counts -1.023866e-01 1.982705e+13 3.900782e+00
Visualizing the output:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming df_num is already loaded and preprocessed from the previous cell
# If running standalone, include the loading code here
# Visualize distributions: Histograms for each numerical column
df_num.hist(bins=50, figsize=(12, 8), edgecolor='black')
plt.suptitle('Histograms of Numerical Columns (Distributions)')
plt.tight_layout()
plt.show()
# Covariance heatmap
plt.figure(figsize=(10, 8))
cov_matrix = df_num.cov()
sns.heatmap(cov_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Covariance Heatmap of Numerical Columns')
plt.show()
Most of the distributions are quite skewed, producing low means and high variances (as expected).
The Covariance heatmap is somewhat interesting, as it does show which variables have some correclation. For example, RetweetCount and LikeCount has a strong positive covariance (13617.03). Looking at negative covariances, it's interesting that LikeCount has a somewhat negative covariance with hashtag_counts.
Exploring Entropy
I asked Copilot to "compute Shannon entropy on categorical columns." This is how it explained it:
"To compute Shannon entropy on the categorical columns in chatgpt1.csv, we'll use the formula for entropy: ( H(X) = -\sum p(x_i) \log_2 p(x_i) ), where ( p(x_i) ) is the probability of each category ( x_i ). This measures the "information" or unpredictability in the data—higher entropy means more diversity (e.g., many languages vs. mostly one)."
import pandas as pd
import numpy as np
from scipy.stats import entropy
# Load the data (assuming not already loaded)
df = pd.read_csv('/Users/srgupta/Desktop/Academany/Data_Science/sahil-gupta/datasets/chatgpt1.csv')
# Select categorical columns (object types, excluding lists if needed)
categorical_cols = df.select_dtypes(include=['object']).columns
# Filter to simple string columns (exclude complex ones like 'hashtag' or 'MentionedUsers' which are lists)
simple_categorical = [col for col in categorical_cols if col not in ['hashtag', 'MentionedUsers', 'Outlinks', 'Media', 'QuotedTweet']]
print("Shannon Entropy for Categorical Columns:")
for col in simple_categorical:
if df[col].notna().any(): # Skip if all NaN
# Get value counts and normalize to probabilities
counts = df[col].value_counts(normalize=True, dropna=True)
# Compute entropy (base 2 for bits)
ent = entropy(counts, base=2)
print(f"{col}: {ent:.4f} bits")
# Optional: Show top categories for context
print(f" Top categories: {counts.head(3).to_dict()}")
else:
print(f"{col}: No valid data")
print()
Shannon Entropy for Categorical Columns:
Datetime: 15.2512 bits
Top categories: {'2023-01-23 17:11:13+00:00': 0.00015999680006399873, '2023-01-23 14:53:16+00:00': 0.0001399972000559989, '2023-01-23 23:42:50+00:00': 0.00011999760004799904}
Text: 15.5714 bits
Top categories: {'@chatgpt_issac AI': 0.0032799344013119736, 'JUST IN: ChatGPT creator OpenAI signs multi-year, multi-billion investment deal with Microsoft $MSFT.': 0.0002599948001039979, '@chatgpt_issac Ai': 0.00021999560008799824}
Username: 14.9598 bits
Top categories: {'translation_ja': 0.0011999760004799903, 'SaveToNotion': 0.0009399812003759925, 'trandanhmmo': 0.000879982400351993}
Permalink: 15.6097 bits
Top categories: {'https://twitter.com/mochico0123/status/1617156270871699456': 1.999960000799984e-05, 'https://twitter.com/sale_15x/status/1617586686455578627': 1.999960000799984e-05, 'https://twitter.com/dailyrotation/status/1617586488383836160': 1.999960000799984e-05}
User: 14.9598 bits
Top categories: {'https://twitter.com/translation_ja': 0.0011999760004799903, 'https://twitter.com/SaveToNotion': 0.0009399812003759925, 'https://twitter.com/trandanhmmo': 0.000879982400351993}
CountLinks: 14.2237 bits
Top categories: {"['https://t.co/xegxY3jDom']": 0.0009527630127369371, "['https://t.co/rlyimpQw40']": 0.000802326747567947, "['https://t.co/6Hc8vpS0L6']": 0.000651890482398957}
Language: 2.2249 bits
Top categories: {'en': 0.6415071698566028, 'ja': 0.1009179816403672, 'es': 0.06629867402651947}
Source: 3.3021 bits
Top categories: {'<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>': 0.35627287454250917, '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>': 0.24561508769824603, '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>': 0.17943641127177457}
Visualizing Entropy
import pandas as pd
import numpy as np
from scipy.stats import entropy
import matplotlib.pyplot as plt
# Assuming df is loaded and simple_categorical is defined from the previous cell
# If running standalone, include the loading and filtering code here
# Compute entropy as before, but store in a dict for plotting
entropy_dict = {}
for col in simple_categorical:
if df[col].notna().any():
counts = df[col].value_counts(normalize=True, dropna=True)
ent = entropy(counts, base=2)
entropy_dict[col] = ent
else:
entropy_dict[col] = 0 # Or np.nan
# Visualize: Bar chart of entropy values
plt.figure(figsize=(10, 6))
plt.bar(entropy_dict.keys(), entropy_dict.values(), color='skyblue', edgecolor='black')
plt.xlabel('Categorical Columns')
plt.ylabel('Shannon Entropy (bits)')
plt.title('Entropy of Categorical Columns')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Interpretation?
Language has the lowest entropy, which makes sense since most of the posts are in English. Text has the highest, which I guess means it's the most unpredictable?