Sahil Gupta - Fab Futures - Data Science
Home About

< Home

Week 3: Fitting¶

Goal: Fit a function to our data

Based on my initial data visualizations from Week 2, most of the data plots are heavily skewed (with the exception of Frequency of posts by time-of-day), which I guess fits with the long-tail pattern of much online behavior.

Based on a quick Google search ("jupyter python fit function to long-tail data"), there seems to be several methods to fit a function to long-tail data. I didn't know where to start, so asked ChatGPT to assist me:
Find the best-fitting function for tweet volume vs time

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# Load data
df = pd.read_csv('datasets/chatgpt1.csv')
df['Datetime'] = pd.to_datetime(df['Datetime'], utc=True)

# Extract hour of day
df['hour'] = df['Datetime'].dt.hour

# Aggregate tweet volume by hour of day
hourly = df.groupby('hour').size().reset_index(name='tweet_count')

x = hourly['hour'].values
y = hourly['tweet_count'].values

# Fit sinusoidal model: y = A*sin(2Ï€x/24 + phi) + C
omega = 2 * np.pi / 24

# Linear least squares using sin and cos
X = np.column_stack([np.sin(omega * x), np.cos(omega * x), np.ones_like(x)])
coef, _, _, _ = np.linalg.lstsq(X, y, rcond=None)

A_sin, A_cos, C = coef
A = np.sqrt(A_sin**2 + A_cos**2)
phi = np.arctan2(A_cos, A_sin)

y_fit = A * np.sin(omega * x + phi) + C
r2 = r2_score(y, y_fit)

# Plot
plt.figure()
plt.scatter(x, y)
plt.plot(x, y_fit)
plt.xlabel("Hour of day")
plt.ylabel("Tweet volume")
plt.title("Tweet Volume vs Hour of Day (Sinusoidal Fit)")
plt.xticks(range(0,24))
plt.show()

A, phi, C, r2
No description has been provided for this image
Out[2]:
(np.float64(681.7544889776459),
 np.float64(2.9309947026830927),
 np.float64(2083.375000000001),
 0.510131656299354)

ChatGPT's explanation: Here’s the best-fitting function for tweet volume vs hour of day based on your data.

Data setup¶

  • Each tweet contributes 1 count
  • Aggregated total tweets for each hour of day (0–23) across all dates
  • Goal: capture daily activity pattern

Best model: Sinusoidal (daily cycle)¶

A periodic model is the natural choice here:

$$ \boxed{ y(h) = A \sin\!\left(\frac{2\pi}{24}h + \phi\right) + C } $$

Fitted parameters¶

  • Amplitude $A \approx 682$
  • Phase shift $\phi \approx 2.93$
  • Mean level $C \approx 2083$

Final fitted function¶

$$ \boxed{ y(h) \approx 682 \sin\!\left(\frac{2\pi}{24}h + 2.93\right) + 2083 } $$

Fit quality¶

  • R² ≈ 0.51

That’s a strong fit for social media behavior and far better than linear or exponential models for this task.

Interpretation¶

  • Tweet activity follows a clear daily rhythm
  • Lowest activity: early morning hours
  • Peak activity: late afternoon / evening
  • Mean baseline: ~ 2,100 tweets per hour-of-day bin

This shape strongly reflects human sleep–wake cycles across time zones.

Now I'll try to find a best-fit function for a long-tail distribution. ChatGPT's free-tier didn't allow me to perform this task again, so I asked VSCode's built-in Copilot to try:
Find a best fit function for tweet volume vs RetweetCount in chatgpt1.csv

In [4]:
# Fits for tweet volume vs RetweetCount (long-tail)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Load data
df = pd.read_csv('datasets/chatgpt1.csv', parse_dates=['Datetime'], low_memory=False)

# extract retweet counts (integers)
rt = pd.to_numeric(df['RetweetCount'], errors='coerce').dropna().astype(int).values

# frequency: number of tweets with given retweet count
vals, counts = np.unique(rt, return_counts=True)
# consider only non-negative counts; common for long-tail: focus on tail > 0
tail_mask = vals > 0
x = vals[tail_mask]
y = counts[tail_mask]

# For fitting, expand counts into sample of retweet counts (one per tweet)
# (this creates the sample distribution of RetweetCount)
sample = np.repeat(x, y)

# choose xmin candidates (try a few quantiles)
xmin_candidates = np.unique(np.quantile(sample, [0.5, 0.6, 0.7, 0.8, 0.9]).astype(int))
xmin_candidates = xmin_candidates[xmin_candidates>=1]

def fit_power_law_mle(data, xmin):
    tail = data[data >= xmin]
    n = len(tail)
    if n <= 1:
        return None
    # continuous MLE approx
    alpha = 1.0 + n / np.sum(np.log(tail / xmin))
    # log-likelihood (continuous power law)
    ll = n * np.log(alpha - 1) - n * np.log(xmin) - alpha * np.sum(np.log(tail / xmin))
    return {'alpha': alpha, 'xmin': xmin, 'n': n, 'll': ll}

def fit_lognormal(data, xmin):
    tail = data[data >= xmin]
    if len(tail) == 0:
        return None
    # fit lognormal (fit returns shape, loc, scale)
    shape, loc, scale = stats.lognorm.fit(tail, floc=0)
    # log-likelihood for lognorm
    ll = np.sum(stats.lognorm.logpdf(tail, shape, loc=loc, scale=scale))
    return {'shape': shape, 'loc': loc, 'scale': scale, 'xmin': xmin, 'n': len(tail), 'll': ll}

def fit_exponential(data, xmin):
    tail = data[data >= xmin]
    if len(tail) == 0:
        return None
    # shift by xmin for better fit: fit exponential for x-xmin >=0
    shifted = tail - xmin
    loc, scale = stats.expon.fit(shifted, floc=0)
    ll = np.sum(stats.expon.logpdf(shifted, loc=loc, scale=scale))
    return {'loc': loc, 'scale': scale, 'xmin': xmin, 'n': len(tail), 'll': ll}

# run fits for candidates
pl_fits = []
ln_fits = []
exp_fits = []
for xmin in xmin_candidates:
    pl = fit_power_law_mle(sample, xmin)
    ln = fit_lognormal(sample, xmin)
    ex = fit_exponential(sample, xmin)
    if pl: pl_fits.append(pl)
    if ln: ln_fits.append(ln)
    if ex: exp_fits.append(ex)

# choose best by AIC (AIC = 2k - 2LL). model complexities: power-law k=1 (alpha), lognormal k=2 (shape, scale), expon k=1 (scale)
def aic(ll, k):
    return 2*k - 2*ll

best_models = []
if pl_fits:
    best_pl = max(pl_fits, key=lambda f: f['ll'])
    best_models.append(('powerlaw', best_pl, aic(best_pl['ll'], 1)))
if ln_fits:
    best_ln = max(ln_fits, key=lambda f: f['ll'])
    best_models.append(('lognormal', best_ln, aic(best_ln['ll'], 2)))
if exp_fits:
    best_ex = max(exp_fits, key=lambda f: f['ll'])
    best_models.append(('exponential', best_ex, aic(best_ex['ll'], 1)))

# sort by AIC (lower better)
best_models.sort(key=lambda t: t[2])

# Print comparison
print("Candidates ranked by AIC (lower better):")
for name, fit, a in best_models:
    print(f"- {name}: xmin={fit['xmin']}, n_tail={fit['n']}, AIC={a:.1f}, ll={fit['ll']:.1f}")
    if name == 'powerlaw':
        print(f"  alpha={fit['alpha']:.3f}")
    if name == 'lognormal':
        print(f"  shape={fit['shape']:.3f}, scale={fit['scale']:.3f}")
    if name == 'exponential':
        print(f"  scale={fit['scale']:.3f}")

# Plot empirical tail (log-log) and fitted curves for best model
plt.figure(figsize=(8,5))
# empirical complementary CDF (CCDF) for discrete retweet counts
sorted_sample = np.sort(sample)
unique_vals, counts_vals = np.unique(sorted_sample, return_counts=True)
cdf = np.cumsum(counts_vals) / counts_vals.sum()
ccdf = 1 - cdf + counts_vals/counts_vals.sum()  # conservative CCDF
plt.loglog(unique_vals, ccdf, marker='.', linestyle='none', label='empirical CCDF')

# overlay best fit curves (use parameters from best_models)
xs = np.linspace(max(1, unique_vals.min()), unique_vals.max(), 200)
for name, fit, a in best_models[:2]:  # overlay top 2 models
    xmin = fit['xmin']
    xs_tail = xs[xs>=xmin]
    if name == 'powerlaw':
        alpha = fit['alpha']
        # CCDF ~ (x/xmin)^(1-alpha)
        ccdf_fit = (xs_tail / xmin)**(1 - alpha)
        plt.loglog(xs_tail, ccdf_fit * ccdf[np.searchsorted(unique_vals, xmin)], label=f'powerlaw α={alpha:.2f}')
    elif name == 'lognormal':
        shape, loc, scale = fit['shape'], fit['loc'], fit['scale']
        cdf_fit = stats.lognorm.cdf(xs_tail, shape, loc=loc, scale=scale)
        ccdf_fit = 1 - cdf_fit
        # scale to empirical at xmin
        emp_at_xmin = ccdf[np.searchsorted(unique_vals, xmin)]
        plt.loglog(xs_tail, ccdf_fit * emp_at_xmin / ccdf_fit[0], label=f'lognorm s={shape:.2f}')
    elif name == 'exponential':
        loc, scale = fit['loc'], fit['scale']
        cdf_fit = stats.expon.cdf(xs_tail - xmin, loc=loc, scale=scale)
        ccdf_fit = 1 - cdf_fit
        emp_at_xmin = ccdf[np.searchsorted(unique_vals, xmin)]
        plt.loglog(xs_tail, ccdf_fit * emp_at_xmin / ccdf_fit[0], label=f'expon scale={scale:.2f}')

plt.xlabel('RetweetCount')
plt.ylabel('CCDF')
plt.title('RetweetCount tail: empirical CCDF (log-log) and fitted models')
plt.legend()
plt.grid(True, which='both', ls='--', alpha=0.4)
plt.show()
Candidates ranked by AIC (lower better):
- powerlaw: xmin=10, n_tail=763, AIC=6266.9, ll=-3132.5
  alpha=2.106
- lognormal: xmin=10, n_tail=763, AIC=7090.5, ll=-3543.3
  shape=1.019, scale=24.692
- exponential: xmin=10, n_tail=763, AIC=8001.7, ll=-3999.8
  scale=69.565
No description has been provided for this image

Not sure how to interpret this ?!