< Home
Week 3: Fitting¶
Goal: Fit a function to our data
Based on my initial data visualizations from Week 2, most of the data plots are heavily skewed (with the exception of Frequency of posts by time-of-day), which I guess fits with the long-tail pattern of much online behavior.
Based on a quick Google search ("jupyter python fit function to long-tail data"), there seems to be several methods to fit a function to long-tail data. I didn't know where to start, so asked ChatGPT to assist me:
Find the best-fitting function for tweet volume vs time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
# Load data
df = pd.read_csv('datasets/chatgpt1.csv')
df['Datetime'] = pd.to_datetime(df['Datetime'], utc=True)
# Extract hour of day
df['hour'] = df['Datetime'].dt.hour
# Aggregate tweet volume by hour of day
hourly = df.groupby('hour').size().reset_index(name='tweet_count')
x = hourly['hour'].values
y = hourly['tweet_count'].values
# Fit sinusoidal model: y = A*sin(2Ï€x/24 + phi) + C
omega = 2 * np.pi / 24
# Linear least squares using sin and cos
X = np.column_stack([np.sin(omega * x), np.cos(omega * x), np.ones_like(x)])
coef, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
A_sin, A_cos, C = coef
A = np.sqrt(A_sin**2 + A_cos**2)
phi = np.arctan2(A_cos, A_sin)
y_fit = A * np.sin(omega * x + phi) + C
r2 = r2_score(y, y_fit)
# Plot
plt.figure()
plt.scatter(x, y)
plt.plot(x, y_fit)
plt.xlabel("Hour of day")
plt.ylabel("Tweet volume")
plt.title("Tweet Volume vs Hour of Day (Sinusoidal Fit)")
plt.xticks(range(0,24))
plt.show()
A, phi, C, r2
(np.float64(681.7544889776459), np.float64(2.9309947026830927), np.float64(2083.375000000001), 0.510131656299354)
ChatGPT's explanation: Here’s the best-fitting function for tweet volume vs hour of day based on your data.
Data setup¶
- Each tweet contributes 1 count
- Aggregated total tweets for each hour of day (0–23) across all dates
- Goal: capture daily activity pattern
Best model: Sinusoidal (daily cycle)¶
A periodic model is the natural choice here:
$$ \boxed{ y(h) = A \sin\!\left(\frac{2\pi}{24}h + \phi\right) + C } $$
Fitted parameters¶
- Amplitude $A \approx 682$
- Phase shift $\phi \approx 2.93$
- Mean level $C \approx 2083$
Final fitted function¶
$$ \boxed{ y(h) \approx 682 \sin\!\left(\frac{2\pi}{24}h + 2.93\right) + 2083 } $$
Fit quality¶
- R² ≈ 0.51
That’s a strong fit for social media behavior and far better than linear or exponential models for this task.
Interpretation¶
- Tweet activity follows a clear daily rhythm
- Lowest activity: early morning hours
- Peak activity: late afternoon / evening
- Mean baseline: ~ 2,100 tweets per hour-of-day bin
This shape strongly reflects human sleep–wake cycles across time zones.
Now I'll try to find a best-fit function for a long-tail distribution. ChatGPT's free-tier didn't allow me to perform this task again, so I asked VSCode's built-in Copilot to try:
Find a best fit function for tweet volume vs RetweetCount in chatgpt1.csv
# Fits for tweet volume vs RetweetCount (long-tail)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Load data
df = pd.read_csv('datasets/chatgpt1.csv', parse_dates=['Datetime'], low_memory=False)
# extract retweet counts (integers)
rt = pd.to_numeric(df['RetweetCount'], errors='coerce').dropna().astype(int).values
# frequency: number of tweets with given retweet count
vals, counts = np.unique(rt, return_counts=True)
# consider only non-negative counts; common for long-tail: focus on tail > 0
tail_mask = vals > 0
x = vals[tail_mask]
y = counts[tail_mask]
# For fitting, expand counts into sample of retweet counts (one per tweet)
# (this creates the sample distribution of RetweetCount)
sample = np.repeat(x, y)
# choose xmin candidates (try a few quantiles)
xmin_candidates = np.unique(np.quantile(sample, [0.5, 0.6, 0.7, 0.8, 0.9]).astype(int))
xmin_candidates = xmin_candidates[xmin_candidates>=1]
def fit_power_law_mle(data, xmin):
tail = data[data >= xmin]
n = len(tail)
if n <= 1:
return None
# continuous MLE approx
alpha = 1.0 + n / np.sum(np.log(tail / xmin))
# log-likelihood (continuous power law)
ll = n * np.log(alpha - 1) - n * np.log(xmin) - alpha * np.sum(np.log(tail / xmin))
return {'alpha': alpha, 'xmin': xmin, 'n': n, 'll': ll}
def fit_lognormal(data, xmin):
tail = data[data >= xmin]
if len(tail) == 0:
return None
# fit lognormal (fit returns shape, loc, scale)
shape, loc, scale = stats.lognorm.fit(tail, floc=0)
# log-likelihood for lognorm
ll = np.sum(stats.lognorm.logpdf(tail, shape, loc=loc, scale=scale))
return {'shape': shape, 'loc': loc, 'scale': scale, 'xmin': xmin, 'n': len(tail), 'll': ll}
def fit_exponential(data, xmin):
tail = data[data >= xmin]
if len(tail) == 0:
return None
# shift by xmin for better fit: fit exponential for x-xmin >=0
shifted = tail - xmin
loc, scale = stats.expon.fit(shifted, floc=0)
ll = np.sum(stats.expon.logpdf(shifted, loc=loc, scale=scale))
return {'loc': loc, 'scale': scale, 'xmin': xmin, 'n': len(tail), 'll': ll}
# run fits for candidates
pl_fits = []
ln_fits = []
exp_fits = []
for xmin in xmin_candidates:
pl = fit_power_law_mle(sample, xmin)
ln = fit_lognormal(sample, xmin)
ex = fit_exponential(sample, xmin)
if pl: pl_fits.append(pl)
if ln: ln_fits.append(ln)
if ex: exp_fits.append(ex)
# choose best by AIC (AIC = 2k - 2LL). model complexities: power-law k=1 (alpha), lognormal k=2 (shape, scale), expon k=1 (scale)
def aic(ll, k):
return 2*k - 2*ll
best_models = []
if pl_fits:
best_pl = max(pl_fits, key=lambda f: f['ll'])
best_models.append(('powerlaw', best_pl, aic(best_pl['ll'], 1)))
if ln_fits:
best_ln = max(ln_fits, key=lambda f: f['ll'])
best_models.append(('lognormal', best_ln, aic(best_ln['ll'], 2)))
if exp_fits:
best_ex = max(exp_fits, key=lambda f: f['ll'])
best_models.append(('exponential', best_ex, aic(best_ex['ll'], 1)))
# sort by AIC (lower better)
best_models.sort(key=lambda t: t[2])
# Print comparison
print("Candidates ranked by AIC (lower better):")
for name, fit, a in best_models:
print(f"- {name}: xmin={fit['xmin']}, n_tail={fit['n']}, AIC={a:.1f}, ll={fit['ll']:.1f}")
if name == 'powerlaw':
print(f" alpha={fit['alpha']:.3f}")
if name == 'lognormal':
print(f" shape={fit['shape']:.3f}, scale={fit['scale']:.3f}")
if name == 'exponential':
print(f" scale={fit['scale']:.3f}")
# Plot empirical tail (log-log) and fitted curves for best model
plt.figure(figsize=(8,5))
# empirical complementary CDF (CCDF) for discrete retweet counts
sorted_sample = np.sort(sample)
unique_vals, counts_vals = np.unique(sorted_sample, return_counts=True)
cdf = np.cumsum(counts_vals) / counts_vals.sum()
ccdf = 1 - cdf + counts_vals/counts_vals.sum() # conservative CCDF
plt.loglog(unique_vals, ccdf, marker='.', linestyle='none', label='empirical CCDF')
# overlay best fit curves (use parameters from best_models)
xs = np.linspace(max(1, unique_vals.min()), unique_vals.max(), 200)
for name, fit, a in best_models[:2]: # overlay top 2 models
xmin = fit['xmin']
xs_tail = xs[xs>=xmin]
if name == 'powerlaw':
alpha = fit['alpha']
# CCDF ~ (x/xmin)^(1-alpha)
ccdf_fit = (xs_tail / xmin)**(1 - alpha)
plt.loglog(xs_tail, ccdf_fit * ccdf[np.searchsorted(unique_vals, xmin)], label=f'powerlaw α={alpha:.2f}')
elif name == 'lognormal':
shape, loc, scale = fit['shape'], fit['loc'], fit['scale']
cdf_fit = stats.lognorm.cdf(xs_tail, shape, loc=loc, scale=scale)
ccdf_fit = 1 - cdf_fit
# scale to empirical at xmin
emp_at_xmin = ccdf[np.searchsorted(unique_vals, xmin)]
plt.loglog(xs_tail, ccdf_fit * emp_at_xmin / ccdf_fit[0], label=f'lognorm s={shape:.2f}')
elif name == 'exponential':
loc, scale = fit['loc'], fit['scale']
cdf_fit = stats.expon.cdf(xs_tail - xmin, loc=loc, scale=scale)
ccdf_fit = 1 - cdf_fit
emp_at_xmin = ccdf[np.searchsorted(unique_vals, xmin)]
plt.loglog(xs_tail, ccdf_fit * emp_at_xmin / ccdf_fit[0], label=f'expon scale={scale:.2f}')
plt.xlabel('RetweetCount')
plt.ylabel('CCDF')
plt.title('RetweetCount tail: empirical CCDF (log-log) and fitted models')
plt.legend()
plt.grid(True, which='both', ls='--', alpha=0.4)
plt.show()
Candidates ranked by AIC (lower better): - powerlaw: xmin=10, n_tail=763, AIC=6266.9, ll=-3132.5 alpha=2.106 - lognormal: xmin=10, n_tail=763, AIC=7090.5, ll=-3543.3 shape=1.019, scale=24.692 - exponential: xmin=10, n_tail=763, AIC=8001.7, ll=-3999.8 scale=69.565
Not sure how to interpret this ?!