[Pieter van der Hijden] - Fab Futures - Data Science
Home About

< Home

DataScience Session 2: Tools¶

Synopsis¶

Neil introduced tools, with a focus on open source tools and especially programming tools (see resources).

Resources¶

  • Tools
    • Programming
      • Javascript
      • Rust; fixes security and stability issues in other languages
      • Python tutorial
      • Jupyter
      • Git; distributed version control
    • Mathematics
      • Jax; Python, NumPy acceleration on CPUs, GPUs, TPUs
      • NumPy; tutorial; data types
      • Numba; Python compilation for speed
      • PyTorch; deep learning
      • scikit-learn
      • SciPy
    • Visualization
      • D3; wide range of browser-based visualizations
      • Matplotlib; tutorial; gallery
      • Plotly; low code, browser applications
      • Seaborn; statistical plotting
      • Vega-Altair; declarative rather then imperative visualization
    • Data
      • files
      • CSV, XML, YAML, binary, ...
      • spreadsheets
      • JupyterLab spreadsheet editor
      • pandas
      • MySQL

My addition:

  • The Data Workbench and private collection of tools

Assignment¶

  • Visualize your datasets
    1. See below for timelines of fablabs
    2. See also the visualizations we used in the notebook for session 01.

A. Research ideas¶

The purpose of this notebook is:

  • build timeseries of fablab data
  • visualize these as timelines

B. Research planning and design¶

  • collect archived copies of fablab data for the last six years (save for use in session 3)
  • create a combined dataset
  • produce timelines at world level, at single country level, for all countries
In [1]:
# import python modules
import fabmodules as fm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from datetime import datetime
In [2]:
# set parameters
output_path = fm.output_path = "outputs/"
today = fm.today = datetime.today().strftime("%Y-%m-%d")

# colors and more
traffic_strong="grey,red,orange,yellow,green".split(",")
traffic_soft=np.array([(231,231,242),(247,149,148),(247,180,149),(247,220,149),(144,240,156)]) / 255 # grey, red, orange, yellow, green
blue_scale=np.array([(231,231,242),(230,245,255),(189,227,251),(148,209,247),(81,172,230),(16,106,166)]) / 255 # grey, blue 20, 40, 60, 80, 100
green_scale=np.array([(231,231,242),(228,252,231),(196,245,202),(144,240,156),(84,209,95),(18,179,47)]) /255 # grey, green 20, 40, 60, 80, 100

C. Data collection¶

Fablabs.io only produces a snapshot (ie no timelines) of the fablab register. So we need to combine snapshots from the past compile timelines.

  1. We manually copied the backup collection labs.json (starting December 2018) to ../datasets. We only picked the files "medio" (close to or on Jun 30) and "ultimo" (close to or on Dec31) of each calendar year.
  2. Once we have read these files we will refer to their dates as 18u, 19m, 19u, ..., 24m, 24u, 25m (24u = 2024 ultimo, 25m = 2025 medio).
  3. We will limit the fields we read to:
    • id, slug, name for identification
    • activity_status (plan, active, corona, closed) for detecting inflow and outflow (empty or NaN values will be replaced by "unknown"); activity_status did not exist before 19u.
    • kind_name (mini, mobile, fablab) for detecting eventual grow (empty or NaN values will be replaced by "unknown")
    • country_code (ISO-3166 2char) for grouping
  4. We will compose one single dataframe with all these data where an extra column "date_code" indicates their date, eg 24u or 25m.

List the archived fablab files¶

In [3]:
# Cesar García helped to figure this out a couple of years ago
# list the datasets directory
history = sorted(os.listdir("datasets"))
history = sorted(os.listdir("datasets"))
history
Out[3]:
['.gitignore',
 '.ipynb_checkpoints',
 '2019.07.04_labs.json',
 '2019.12.27_labs.json',
 '2020.07.01_labs.json',
 '2020.12.31_labs.json',
 '2021.07.06_labs.json',
 '2021.12.31_labs.json',
 '2022.07.03_labs.json',
 '2022.12.31_labs.json',
 '2023.06.30_labs.json',
 '2023.12.31_labs.json',
 '2024.06.30_labs.json',
 '2024.12.31_labs.json',
 '2025.06.30_labs.json']

Read the fablab files and combine into a single dataset (add a date_code to each record)¶

In [4]:
# if labs.json files are not well-formatted and cause errors:
# they will be skipped (see [0,0] in log messages)

# read and clean a single labs.json file
def read_single_labs(filename):
    date_code = filename[2:4] + ("m" if filename[5:7] == "06" or filename[5:7] == "07" else "u")
    # Read the fablab register
    url = "datasets/"+filename
    try:
        df = pd.read_json(url) # we read all fields as activity_status did not exist before 19u
        if not ('activity_status' in df.columns):
            df['activity_status'] = "unknown"
        df['activity_status'] = df['activity_status'].replace('', pd.NA)
        df['activity_status'] = df['activity_status'].fillna('unknown')
        df['kind_name'] = df['kind_name'].replace('', pd.NA)
        df['kind_name'] = df['kind_name'].fillna('unknown')
        df["country_code"] = df["country_code"].str.upper()
        df['date_code'] = date_code
    except:
        df = pd.DataFrame()
    fm.log(date_code,url,df)
    return df
In [5]:
# we restrict the fields to the most relevant ones
# we always can extend this fieldlist later

fieldlist = "date_code,id,slug,name,activity_status,kind_name,country_code".split(",")
combined = pd.DataFrame()
for each in history:
    if "labs.json" in each:
        df = read_single_labs(each)
        if df.shape[0] > 0:
            aux = df[fieldlist]
            combined = pd.concat([combined, aux])
fm.log("combined","dataframe",combined)
2025-12-18T12:32Z 19m datasets/2019.07.04_labs.json (1729, 24)
2025-12-18T12:32Z 19u datasets/2019.12.27_labs.json (1849, 24)
2025-12-18T12:32Z 20m datasets/2020.07.01_labs.json (1933, 24)
2025-12-18T12:32Z 20u datasets/2020.12.31_labs.json (1998, 24)
2025-12-18T12:32Z 21m datasets/2021.07.06_labs.json (2029, 24)
2025-12-18T12:32Z 21u datasets/2021.12.31_labs.json (2039, 24)
2025-12-18T12:32Z 22m datasets/2022.07.03_labs.json (2069, 24)
2025-12-18T12:32Z 22u datasets/2022.12.31_labs.json (2117, 24)
2025-12-18T12:32Z 23m datasets/2023.06.30_labs.json (2144, 24)
2025-12-18T12:32Z 23u datasets/2023.12.31_labs.json (2221, 24)
2025-12-18T12:32Z 24m datasets/2024.06.30_labs.json (2548, 24)
2025-12-18T12:32Z 24u datasets/2024.12.31_labs.json (2597, 24)
2025-12-18T12:32Z 25m datasets/2025.06.30_labs.json (2636, 24)
2025-12-18T12:32Z combined dataframe (27909, 7)
In [6]:
combined
Out[6]:
date_code id slug name activity_status kind_name country_code
0 19m 530 kaasfabriek Kaasfabriek | FabLab regio Alkmaar unknown fab_lab NL
1 19m 1001 vmssfablab Valley Middle School of STEM FAB LAB unknown fab_lab US
2 19m 1193 fablab276valdereuil FabLab 276 Val-de-Reuil unknown fab_lab FR
3 19m 17 ping PiNG unknown fab_lab FR
4 19m 20 fablabinsastrasbourg FabLab INSA Strasbourg unknown fab_lab FR
... ... ... ... ... ... ... ...
2631 25m 13 fablabegypt Fab Lab Egypt active fab_lab EG
2632 25m 2348 martiguesFabLab Martigues'FabLab active fab_lab FR
2633 25m 1730 biscastmanfablab BISCAST ManFabLab active fab_lab PH
2634 25m 1587 fablabjordan The Makerspace (previously TechWorks Amman) active fab_lab JO
2635 25m 2210 fablabaq FabLab L'Aquila active fab_lab IT

27909 rows × 7 columns

Read the countries from geonames.org¶

For the moment, we assume the list of countries as maintained by geonames.org offers sufficient data.

In [7]:
# Read the country list
url = "http://api.geonames.org/countryInfo?username=fab23workshop"
countries = pd.read_xml(url, parser="etree")

# Set country_code = "NA" where countryName == "Namibia"
countries.loc[countries["countryName"] == "Namibia", "countryCode"] = "NA"

# Set continent = "NA" where continentName == "North America"
countries.loc[countries["continentName"] == "North America", "continent"] = "NA"

fm.log("countries ",url,countries) 
2025-12-18T12:33Z countries  http://api.geonames.org/countryInfo?username=fab23workshop (250, 18)
In [8]:
countries.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   countryCode       250 non-null    object 
 1   countryName       250 non-null    object 
 2   isoNumeric        250 non-null    int64  
 3   isoAlpha3         250 non-null    object 
 4   fipsCode          247 non-null    object 
 5   continent         250 non-null    object 
 6   continentName     250 non-null    object 
 7   capital           241 non-null    object 
 8   areaInSqKm        250 non-null    float64
 9   population        250 non-null    int64  
 10  currencyCode      249 non-null    object 
 11  languages         247 non-null    object 
 12  geonameId         250 non-null    int64  
 13  west              250 non-null    float64
 14  north             250 non-null    float64
 15  east              250 non-null    float64
 16  south             250 non-null    float64
 17  postalCodeFormat  177 non-null    object 
dtypes: float64(5), int64(3), object(10)
memory usage: 35.3+ KB

D. Data processing¶

Count operational fablabs by country¶

In [9]:
# Count operational fablabs (all activity status except "closed") 
df = combined
df_filtered = df[df["activity_status"] != "closed"]

result = (
    df_filtered
    .groupby(["country_code", "date_code"])
    .size()
    .reset_index(name="count")
)

fm.log("result","",result) 
2025-12-18T12:33Z result  (1689, 3)

Check result for single country (use iso-2 code, uppercase)¶

In [10]:
country_code = "FR" # change if you like
result[result["country_code"]==country_code]
Out[10]:
country_code date_code count
501 FR 19m 213
502 FR 19u 217
503 FR 20m 225
504 FR 20u 234
505 FR 21m 234
506 FR 21u 235
507 FR 22m 236
508 FR 22u 242
509 FR 23m 242
510 FR 23u 245
511 FR 24m 275
512 FR 24u 279
513 FR 25m 281

E. Data Study and Analysis¶

In [11]:
# make countries available in fabmodules
fm.countries = countries

Line chart world fablabs¶

In [12]:
# Create line chart for World 
fm.plot_counts(result)
No description has been provided for this image

Line chart single country fablab¶

In [13]:
# create line chart for single country
country_code = "FR" # change if you like
df = result
print("Plotting:", country_code)  
fm.plot_counts(df, country=country_code)
Plotting: FR
No description has been provided for this image

Line charts of all countries (remove cell output before committing)¶

In [ ]:
# create line charts for all countries
# BE CAREFUL: REMOVE CELL OUTPUT BEFORE UPDATING REPOSITORY!!!
df = result
for c in sorted(df["country_code"].dropna().unique()):
    print("Plotting:", c)   # optional
    fm.plot_counts(df, country=c)

F. Data Publishing and Access¶

See the images added to the outputs folder (not in repository).

G. Data Preservation¶

Save result file for later use¶

In [14]:
# save result for use in session 3
with pd.ExcelWriter(output_path + "fablabcounts.xlsx", engine="openpyxl") as writer:
    result.to_excel(writer, index=False)

H. Data Re-use¶

See session 03.

Evaluation and Follow-up¶

So far we only noted the end of year number of fablabs and updated a simple spreadsheet. Now, long time wish could be fullfilled to combine different "freeze" copies of the labs.json files and calculate the end of year counts as a time series automatically. (Note: the old numbers were correct nonetheless). Now, a lot more analyses are possible. Above, we showed the timelines per country, but there is more as follow-up.

Follow-up¶

  • Calculate the inflow and outflow of fablabs for every 6-month period during the last six years; visualize and analyze these figures. How is our "birth rate" and "death rate" developing (by country?)? Have any fablabs been added over the past six years that have since closed?

Review¶

In [ ]: