[Pieter van der Hijden] - Fab Futures - Data Science
Home About

< Home

Session 5: Probability¶

Synopsis¶

TBD

Resources¶

TBD

Assignment¶

  • Investigate the probability distribution of your data
    • See below
  • Set up template notebooks and slides for your data set analysis
    • Done!

Review¶

A. Research ideas¶

Investigate the probability distribution of my data:

  • given an actual list of fablabs, each located in a certain country
  • given a list of countries, each with population size, land area in sq km2, etc.
  • investigate the distribution of countries by number of fablabs per million inhabitants, same by number of fablabs per 10,000 km2
  • how about linking it to other country categories

B. Research planning and design¶

  • read list of fablabs
  • read list of countries
  • calculate country statistics
  • create histograms
  • analyze
In [14]:
# import python modules
import fabmodules as fm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
In [7]:
# set parameters
    
# colors and more
traffic_strong="grey,red,orange,yellow,green".split(",")
traffic_soft=np.array([(231,231,242),(247,149,148),(247,180,149),(247,220,149),(144,240,156)]) / 255 # grey, red, orange, yellow, green
blue_scale=np.array([(231,231,242),(230,245,255),(189,227,251),(148,209,247),(81,172,230),(16,106,166)]) / 255 # grey, blue 20, 40, 60, 80, 100
green_scale=np.array([(231,231,242),(228,252,231),(196,245,202),(144,240,156),(84,209,95),(18,179,47)]) /255 # grey, green 20, 40, 60, 80, 100

C. Data collection¶

In [8]:
# Read the fablab register
url = "https://fablabs.io/labs.json?class=btn+btn-primary"
fablabs = pd.read_json(url)
fm.log("fablabs ",url,fablabs) 
2025-12-18T13:25Z fablabs  https://fablabs.io/labs.json?class=btn+btn-primary (2696, 23)
In [10]:
# Clean fablabs dataset

# set missing activity status to "unknown"
fablabs['activity_status'] = (fablabs['activity_status'].replace('', pd.NA).fillna('unknown'))

# ensure that country_code always is uppercase
fablabs["country_code"] = fablabs["country_code"].str.upper()

# skip fablabs with activity_status == "closed"
fablabs = fablabs[~(fablabs["activity_status"] == "closed")]

fm.log("fablabs ","cleaning",fablabs) 
2025-12-18T13:25Z fablabs  cleaning (2603, 23)
In [11]:
# Read the country list
url = "http://api.geonames.org/countryInfo?username=fab23workshop"
countries = pd.read_xml(url, parser="etree")
fm.log("countries ",url,countries) 
2025-12-18T13:26Z countries  http://api.geonames.org/countryInfo?username=fab23workshop (250, 18)
In [12]:
# Clean country list

# Set country_code = "NA" where countryName == "Namibia"
countries.loc[countries["countryName"] == "Namibia", "countryCode"] = "NA"

# Set continent = "NA" where continentName == "North America"
countries.loc[countries["continentName"] == "North America", "continent"] = "NA"

fm.log("countries ","cleaning",countries) 
2025-12-18T13:26Z countries  cleaning (250, 18)

D. Data processing¶

In [13]:
# Count fablabs per country and add as fablabs_c to countries dataset

aux = fablabs.groupby('country_code').size() 
aux.name='fablabs_c'
countries = countries.merge(aux,left_on = 'countryCode',right_index = True, how = 'left')
countries['fablabs_c'] = countries['fablabs_c'].fillna(0)

df = countries
# Count fablabs_c * M / population
df['fablabs_cpop'] = (df['fablabs_c'] * 1000000)/ df['population']
    
# Count fablabs_c * 10,000 / area
df['fablabs_carea'] = df['fablabs_c'] * 10000 / df['areaInSqKm']

countries=df
fm.log("countries ","processed",countries) 
2025-12-18T13:26Z countries  processed (250, 21)

E. Data Study and Analysis¶

Chatgpt prompt: pandas: for the countries where fablabs_c >0 i would like to create a histogram of fablabs_cpop

In [15]:
df = countries
# Filter
df_filtered = df[df["fablabs_c"] > 0]

# Histogram of fablab_cpop
plt.hist(df_filtered["fablabs_cpop"].dropna())
plt.xlabel("fablabs_cpop")
plt.ylabel("Frequency")
plt.title("Histogram of fablabs_cpop (countries with fablabs_c > 0)")
plt.show()
No description has been provided for this image

ChatGPT prompt: the plot is rather asymetric and not very informative; maybe i should calculate the mean value first, take into account that a fablabs_cpop value of 1.0 (or more) would be great, but maybe the standard deviation gives us an idea how we should define the pockets of the histogram (if that is the correct word in English)

You're thinking in the right direction — if a histogram is very asymmetric, summary statistics (mean, median, std) can help you understand the distribution and choose more meaningful bins (that’s the word you want, not “pockets”).

In [16]:
df_filtered = df[df["fablabs_c"] > 0]

mean_val = df_filtered["fablabs_cpop"].mean()
median_val = df_filtered["fablabs_cpop"].median()
std_val = df_filtered["fablabs_cpop"].std()

mean_val, median_val, std_val
Out[16]:
(np.float64(1.7844279691128933), 0.6484243288808196, 4.904440202652281)
In [18]:
df = countries['countryCode,fablabs_c,fablabs_cpop,population'.split(',')]
df.to_csv("outputs/countries.csv", index=False)
fm.log("countries","selection",df)
2025-12-18T13:27Z countries selection (250, 4)

Conclusion¶

ChatGPT was blocked fgor days. Therefore I tried Copilot. It took oer my problem, instead of helping me. Here is the result:

image1 from copilotBCEI.f909a884-58ef-42c6-84e7-868bfa2baaf3.png

From CoPilot: Here’s the full analysis of your dataset: The distribution of fablabs_cpop (for countries with at least one fablab) is highly skewed, with mean ≈ 1.80, median ≈ 0.65, standard deviation ≈ 4.92, and skewness ≈ 6.44. This indicates a long right tail: most countries cluster at low values, but a few outliers dominate the upper range._

🔍 Key Findings Mean vs. Median: The mean is much higher than the median, confirming the skew._

Standard Deviation: At nearly 5, the spread is wide compared to the central tendency._

Skewness: A value above 6 shows extreme asymmetry, driven by countries like FO (41.2), IS (36.8), GF (15.3), and BT (9.3)._

📊 Histogram Insights Normal-scale histogram: Most countries fall below 5.0, with a dense cluster under 1.0._

Log-scale histogram: Reveals the long tail more clearly, showing how rare but impactful the high values are._

Std-based bins: Using bins aligned with standard deviation highlights how few countries exceed even 1–2 std above the mean._

📈 KDE & Boxplot_ KDE plot: Shows a sharp peak near 0.5–1.0, then a long tapering tail._

Boxplot: Outliers dominate the visualization, reinforcing the skewness._

🧩 Interpretation Most countries have modest fablabs_cpop values (<1.0)._

A handful of territories (small populations with multiple fablabs) inflate the distribution, e.g., Faroe Islands, Iceland, French Guiana._

Bin suggestion: The Freedman–Diaconis rule recommends ~22 bins, balancing detail and readability._

CoPilot prompt: Maybe it is better to skip any country with less than 1 million population Could you suggest the python code for me?

In [12]:
# Filter: keep only countries with >= 1,000,000 population AND fablabs_c > 0
df_filtered = df[(df["population"] >= 1_000_000) & (df["fablabs_c"] > 0)]

# Now you can analyze fablabs_cpop distribution
stats = {
    "mean": df_filtered["fablabs_cpop"].mean(),
    "median": df_filtered["fablabs_cpop"].median(),
    "std": df_filtered["fablabs_cpop"].std(),
    "skew": df_filtered["fablabs_cpop"].skew()
}

print(stats)
{'mean': np.float64(0.7967723969002516), 'median': 0.497799527975779, 'std': 0.8733868149300991, 'skew': np.float64(1.8342123054356458)}

CoPilot: Once you’ve filtered your dataset (e.g. excluding countries under 1M population), you can visualize the distribution with histograms, KDE, and boxplots using matplotlib and seaborn. Here’s a complete example:

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# For now, just filter fablabs_c > 0
df_filtered = df[df["fablabs_c"] > 0]

# Extract the series
values = df_filtered["fablabs_cpop"].dropna()

# --- Histogram (normal scale) ---
plt.figure(figsize=(8,5))
sns.histplot(values, bins=30, kde=False)
plt.title("Histogram of fablabs_cpop (normal scale)")
plt.xlabel("fablabs_cpop")
plt.ylabel("Count")
plt.show()

# --- Histogram (log scale) ---
plt.figure(figsize=(8,5))
sns.histplot(values, bins=30, kde=False)
plt.yscale("log")
plt.title("Histogram of fablabs_cpop (log scale)")
plt.xlabel("fablabs_cpop")
plt.ylabel("Count (log scale)")
plt.show()

# --- Histogram with std-based bins ---
import numpy as np
mean = values.mean()
std = values.std()
bin_edges = np.arange(0, values.max() + std, std)

plt.figure(figsize=(8,5))
sns.histplot(values, bins=bin_edges, kde=False)
plt.title("Histogram of fablabs_cpop (std-based bins)")
plt.xlabel("fablabs_cpop")
plt.ylabel("Count")
plt.show()

# --- KDE plot ---
plt.figure(figsize=(8,5))
sns.kdeplot(values, fill=True)
plt.title("KDE plot of fablabs_cpop")
plt.xlabel("fablabs_cpop")
plt.show()

# --- Boxplot ---
plt.figure(figsize=(6,4))
sns.boxplot(x=values)
plt.title("Boxplot of fablabs_cpop")
plt.xlabel("fablabs_cpop")
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

What would be interesting to learn more from the SDG profile?¶

  • See the 17 SDGs as nodes and the profiles as the links between the nodes?
  • See the fablabs as the nodes and the profiles as the link?
  • Just look at the distribution over the countries (like now)?
  • Confront with the global SDG index and see matches and mismatches?
In [ ]:
 

F. Data Publishing and Access¶

In [ ]:
 

G. Data Preservation¶

In [ ]:
 

H. Data Re-use¶

Evaluation and Follow-up¶

I did indeed manage to generate some interesting images of probability distribution. However, I find the result unsatisfactory. I'll have to delve deeper into this first. It's too much magic right now that I can't properly explain or defend.

Follow-up¶

Make my own basic set of python functions for statistics.

Review¶