[Kelzang Wangdi] - Fab Futures - Data Science
Home About

Probability¶

Probability in data analyis¶

Probability is the fundamental mathematical framework that allows data analysts and data scientists to quantify uncertainty and make informed decisions based on data. In data analysis, it is used to model random phenomena, calculate the likelihood of future events (such as a customer reordering or a machine failing), and estimate the uncertainty associated with findings derived from a sample to infer conclusions about an entire population (a process known as inferential statistics). Key concepts like probability distributions (e.g., Normal, Binomial) and conditional probability are essential for building and interpreting predictive models, such as those used for risk assessment, anomaly detection, and forecasting trends, ultimately transforming raw data into actionable, confidence-backed insights.

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# -------------------------
# 1. Load your data
# -------------------------
# Change the filename to your uploaded CSV
data = pd.read_csv("~/work/kelzang-wangdi/datasets/StudentsPerformance.csv")  

# Change the column name to the one you want to visualize
column_name = "reading score"

# Extract the data as a NumPy array
x = data[column_name].dropna().values

# -------------------------
# 2. Compute statistics
# -------------------------
mean = np.mean(x)
stddev = np.std(x)

# -------------------------
# 3. Plot histogram and points
# -------------------------
plt.hist(x, bins=30, density=True, alpha=0.6)
plt.plot(x, np.zeros_like(x), '|', ms=10)

# -------------------------
# 4. Plot Gaussian curve
# -------------------------
xi = np.linspace(mean-3*stddev,mean+3*stddev,100)
yi = np.exp(-(xi-mean)**2/(2*stddev**2))/np.sqrt(2*np.pi*stddev**2)
plt.plot(xi,yi,'r')
plt.show()
No description has been provided for this image

Averaging¶

Averaging Gaussian samples reduces errors by (\sqrt{N}) because when we take multiple independent measurements from a normal distribution, the random fluctuations in each sample tend to cancel each other out. Although each individual measurement has a standard deviation equal to the true width of the distribution, the average of (N) such measurements becomes increasingly stable. Mathematically, the standard deviation of the mean (also called the standard error) decreases as (\frac{\sigma}{\sqrt{N}}), where (\sigma) is the original standard deviation. This means that doubling the number of samples does not cut the error in half—it reduces it more slowly, by the square root of the number of samples. As a result, even though taking more samples always improves the accuracy of the estimated mean, the improvement becomes gradually smaller, illustrating the diminishing returns of averaging in statistical estimation.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------
# 1. Load your uploaded data
# -----------------------------
data = pd.read_csv("~/work/kelzang-wangdi/datasets/StudentsPerformance.csv")   # ← change this
column = "math score"               # ← change this

# Extract the data and drop missing values
x = data[column].dropna().values

# True mean and width (std dev) from your data
true_mean = np.mean(x)
true_std = np.std(x)

# -----------------------------
# 2. Settings (same as your code)
# -----------------------------
trials = 100
points = np.arange(10, 500, 25)
means = np.zeros((trials, len(points)))

# -----------------------------
# 3. Sampling from YOUR DATA
# -----------------------------
for p in range(len(points)):
    N = points[p]
    for t in range(trials):
        sample = np.random.choice(x, size=N, replace=True)
        means[t, p] = np.mean(sample)

# -----------------------------
# 4. Theoretical curve
# -----------------------------
plt.plot(points, true_mean + true_std / np.sqrt(points), 'r', label='calculated')
plt.plot(points, true_mean - true_std / np.sqrt(points), 'r')

# -----------------------------
# 5. Estimated mean & stddev
# -----------------------------
estimated_mean = np.mean(means, axis=0)
estimated_std = np.std(means, axis=0)

plt.errorbar(points, estimated_mean, yerr=estimated_std,
             fmt='k-o', capsize=7, label='estimated')

# -----------------------------
# 6. Scatter points for each trial
# -----------------------------
for p in range(len(points)):
    plt.plot(np.full(trials, points[p]), means[:, p], 'o', markersize=2)

plt.xlabel('number of samples averaged')
plt.ylabel('mean estimates')
plt.legend()
plt.show()
No description has been provided for this image

Explanation¶

This graph illustrates the concept of the Central Limit Theorem and the reduction of uncertainty through averaging. The horizontal axis represents the number of samples averaged (the sample size, $n$), and the vertical axis shows the resulting mean estimates from multiple simulations or experiments. The individual colored dots represent the many different mean estimates obtained for each sample size. The black line with error bars (labeled "estimated") shows the sample mean of these estimates and their standard deviation (the error bars) at each $n$. The red line (labeled "calculated") represents the theoretical population mean (the true value) and the theoretical standard error ($\sigma / \sqrt{n}$), showing how the expected range of sample means shrinks as $n$ increases. As the number of samples averaged increases, the spread of the individual mean estimates (the dots) decreases, and both the estimated and calculated uncertainty bands narrow, demonstrating that a larger sample size leads to more consistent and precise estimates that converge toward the true population mean.

Entropy¶

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# -------------------------------------------------------
# 1. Load your data
# -------------------------------------------------------
data = pd.read_csv("~/work/kelzang-wangdi/datasets/StudentsPerformance.csv")        # ← change this
column = "reading score"                    # ← change this

# Extract the column values and drop NaN
vals = data[column].dropna().values

# -------------------------------------------------------
# 2. Define histogram parameters
# -------------------------------------------------------
nbins = 256
xmin, xmax = np.min(vals), np.max(vals)
x = np.linspace(xmin, xmax, nbins)

print(f"{nbins} bins = {np.log2(nbins):.0f} bits")

# -------------------------------------------------------
# 3. Entropy function
# -------------------------------------------------------
def entropy(dist):
    positives = dist[dist > 0]     # avoid 0·log(0)
    return -np.sum(positives * np.log2(positives))

# -------------------------------------------------------
# 4. Distributions
# -------------------------------------------------------

# Uniform distribution
uniform = np.ones(nbins) / nbins

# Histogram of your data → convert to probability distribution
hist, edges = np.histogram(vals, bins=nbins, range=(xmin, xmax), density=False)
data_dist = hist / np.sum(hist)   # normalize

# One-hot distribution (peak at middle)
onehot = np.zeros(nbins)
onehot[nbins // 2] = 1

# -------------------------------------------------------
# 5. Plotting
# -------------------------------------------------------
fig, axs = plt.subplots(3, 1, figsize=(8, 10))
fig.canvas.header_visible = False

# Uniform
axs[0].bar(x, uniform, width=(xmax-xmin)/nbins)
axs[0].set_title(f"Uniform entropy: {entropy(uniform):.1f} bits")

# Your data
axs[1].bar(x[:-1], data_dist[:-1], width=(xmax-xmin)/nbins)
axs[1].set_title(f"Data entropy ({column}): {entropy(data_dist):.1f} bits")

# One-hot
axs[2].bar(x, onehot, width=(xmax-xmin)/nbins)
axs[2].set_title(f"One-hot entropy: {entropy(onehot):.1f} bits")

plt.tight_layout()
plt.show()
256 bins = 8 bits
No description has been provided for this image

Explanation on Entropy¶

The probability entropy, often called Shannon entropy $H$, quantifies the average uncertainty or information content inherent in a discrete random variable, $X$, with a probability distribution $p(x_i)$. The formula $H = - \sum_i p(x_i)\log_2 p(x_i)$ is the expected value of the self-information (or "surprisal") of each outcome, where $-\log_2 p(x_i)$ is the information gained upon observing event $x_i$. A high entropy value indicates that the outcomes are nearly equally probable (maximum uncertainty and thus, high information gained when an event occurs), like a fair coin flip, while a low entropy value indicates that some outcomes are highly probable (low uncertainty, less new information is gained from the observation). Essentially, $H$ measures the theoretical minimum number of bits required, on average, to encode or transmit the information generated by the random variable $X$.

Reference¶

I used ChatGPT as a supplementary learning tool to deepen my understanding of the Python codes, statistical concepts, and machine-learning procedures applied in this work. ChatGPT provided step-by-step explanations of the code structure, clarified the purpose of key functions, and helped interpret outputs such as distributions, entropy, sampling behavior, and convergence patterns. The tool also assisted in adapting existing code examples to my own dataset, making the analytical process clearer and more accessible. Its explanations supported my learning but all final analysis, interpretation, and implementation decisions were made independently.

In [ ]: