[Drukdra Dorji] - Fab Futures - Data Science
Home About

Week 7: Transform Datasets(09 December 2025)¶

The goal of data representation is to uncover the most informative structure within a dataset so that patterns and relationships can be better understood. Preprocessing techniques such as standardization and sphering help achieve this by subtracting the mean and dividing by the standard deviation, ensuring all variables share a zero mean and unit variance. This prevents models from placing undue importance on features with large numeric ranges. Principal Components Analysis (PCA) further enhances understanding by transforming data into uncorrelated principal components that maximize explained variance, making it effective for dimensionality reduction, feature extraction, and visualizing high-dimensional data. Independent Components Analysis (ICA) complements PCA by separating unknown mixed signals—such as in the cocktail-party problem—through maximizing statistical independence of components. Time-series analysis focuses on data that evolves over time, such as DTMF phone signals, and offers additional tools for uncovering trends and changes. Sonification allows users to listen to data, turning patterns into audible cues. Filtering techniques play a crucial role in cleaning and refining signals: proper sampling prevents aliasing by adhering to the Nyquist rate, while digital filters—such as low-pass, high-pass, and band-pass filters—help remove noise, isolate meaningful components, or eliminate unwanted baselines. Filter banks apply multiple band-pass filters simultaneously to detect features across frequency ranges, and Butterworth filters are especially valued for their smooth, flat response in the pass band, making them ideal for applications requiring minimal signal distortion.

Assignments: We Transform the datasets¶

Compiled Dataset: Alcohol-Related Deaths / Burden in Bhutan¶

Introduction to the Dataset¶

This dataset presents a compiled summary of alcohol-related deaths and alcohol-attributable health indicators in Bhutan, drawn from publicly available national and international sources. The data combines information from the Ministry of Health’s Annual Health Bulletins, the National Statistics Bureau’s Vital Statistics Reports, WHO country profiles, and published research such as the Bhutan Health Journal. It includes annual figures on alcohol-related liver disease (ALD) deaths, the proportion of deaths attributed to alcohol in health facilities, trends across multiple years, and population-level alcohol-consumption indicators. The dataset is designed to provide a clear picture of how alcohol contributes to mortality and public health challenges in Bhutan, enabling further analysis, comparison, and interpretation for academic or policy-related purposes.

Transform Dataset¶

In [5]:
import pandas as pd
import numpy as np
import re
from scipy.signal import spectrogram
import matplotlib.pyplot as plt

# ------------------------------------------
# 1. LOAD YOUR CSV FILE
# ------------------------------------------
df = pd.read_csv("datasets/ALD_Data_Big.csv")

# ------------------------------------------
# 2. EXTRACT NUMERIC VALUE FROM STRING
# ------------------------------------------
def extract_numeric(val):
    if isinstance(val, str):
        nums = re.findall(r"[-+]?\d*\.\d+|\d+", val)
        return float(nums[0]) if nums else np.nan
    return val

df["Num"] = df["Value"].apply(extract_numeric)

# ------------------------------------------
# 3. CLEAN YEAR COLUMN (HANDLE "2012 → 2016")
# ------------------------------------------
def extract_year(y):
    if isinstance(y, str) and "→" in y:
        nums = re.findall(r"\d+", y)
        return int(nums[-1])  # take ending year
    try:
        return int(y)
    except:
        return np.nan

df["YearClean"] = df["Year"].apply(extract_year)

# Remove rows without year or numeric value
df = df.dropna(subset=["YearClean", "Num"])

# ------------------------------------------
# 4. GROUP DUPLICATE YEARS BY AVERAGING
# ------------------------------------------
df = df.groupby("YearClean")["Num"].mean().reset_index()

# ------------------------------------------
# 5. INTERPOLATE MISSING YEARS
# ------------------------------------------
year_range = pd.DataFrame({
    "YearClean": range(df["YearClean"].min(), df["YearClean"].max() + 1)
})

df = year_range.merge(df, on="YearClean", how="left")
df["Num"] = df["Num"].interpolate()

# ------------------------------------------
# 6. NORMALIZE VALUES (IMPORTANT)
# ------------------------------------------
signal = df["Num"].values
signal = (signal - signal.mean()) / signal.std()

# ------------------------------------------
# 7. CREATE SPECTROGRAM
# ------------------------------------------
# Auto-adjust window size for small dataset
nperseg = max(4, len(signal) // 3)

f, t, Sxx = spectrogram(signal, fs=1, nperseg=nperseg)

# ------------------------------------------
# 8. PLOT WITH LEGEND (COLORBAR)
# ------------------------------------------
plt.figure(figsize=(9, 4))
plt.pcolormesh(t, f, Sxx, shading="gouraud")

plt.title("Spectrogram of ALD Trend Dataset")
plt.xlabel("Time Window (Years)")
plt.ylabel("Frequency")

# ⭐ LEGEND (SPECTROGRAM COLORBAR)
cbar = plt.colorbar()
cbar.set_label("Intensity (Power)", rotation=90)

plt.tight_layout()
plt.show()

print("Spectrogram shape:", Sxx.shape)
No description has been provided for this image
Spectrogram shape: (3, 2)

Explanation¶

The spectrogram graph visualizes how ALD-related values, such as deaths and incidence changes, varied over time by representing both the magnitude and speed of fluctuations. The horizontal axis shows time windows based on years, the vertical axis represents frequency levels (how quickly the values change), and the color intensity indicates the strength of those changes. Darker colors correspond to stable periods with little variation, while brighter colors highlight years with sharper or faster fluctuations. Overall, the graph provides a clear view of periods of stability and sudden changes in ALD trends, revealing patterns that are not immediately obvious from the raw data.

In [ ]: