[Pieter van der Hijden] - Fab Futures - Data Science
Home About

< Home

DataScience Session 3: Fitting¶

Synopsis¶

Neil presented "fitting", in short: adjust a function to match data. We quickly went through variables (scalar, vector, matrix), functions (linear, nonlinear), arithmetic (sum, integral, derivative), and errors involved, to arrive at fitting (linear, polynomial) and the right tool for it: the numpy function polifit.

Resources¶

  • polyfit
  • The Nature of Mathematical Modeling
  • My additions:
    • The Bubonic Plague, Falling Roof Beams, Dead Fish and Fablabs; system dynamics to the fablabs impact

Assignment¶

  • Fit a function to your data

A. Research ideas¶

The purpose of this notebook is to fit a linear function to the fablab counts over time.

B. Research planning and design¶

  • read the fablabs timelines (result dataset from session 02)
  • convert the date_codes (eg 24u, 25m) to numeric
  • take the global fablab counts over time
  • try to fit various functions
  • extend the results to an arbitrary country, all countries
In [1]:
# import python modules
import fabmodules as fm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime
from matplotlib.ticker import FuncFormatter
In [2]:
# set parameters
output_path = fm.output_path = "outputs/"
today = fm.today = datetime.today().strftime("%Y-%m-%d")

# colors and more
traffic_strong="grey,red,orange,yellow,green".split(",")
traffic_soft=np.array([(231,231,242),(247,149,148),(247,180,149),(247,220,149),(144,240,156)]) / 255 # grey, red, orange, yellow, green
blue_scale=np.array([(231,231,242),(230,245,255),(189,227,251),(148,209,247),(81,172,230),(16,106,166)]) / 255 # grey, blue 20, 40, 60, 80, 100
green_scale=np.array([(231,231,242),(228,252,231),(196,245,202),(144,240,156),(84,209,95),(18,179,47)]) /255 # grey, green 20, 40, 60, 80, 100

C. Data collection¶

Read the fablab timelines¶

In [3]:
# read results from session 2
url = output_path + "fablabcounts.xlsx" 
result = pd.read_excel(url)
# country_code Not-Available should be replaced by "NA" (Namibia)
result["country_code"]=result["country_code"].fillna("NA")
fm.log("fablabcounts",url,result)
2025-12-17T15:21Z fablabcounts outputs/fablabcounts.xlsx (1689, 3)
In [4]:
# check dataframe structure
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1689 entries, 0 to 1688
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   country_code  1689 non-null   object
 1   date_code     1689 non-null   object
 2   count         1689 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 39.7+ KB
In [5]:
# check result for single country
country_code = "FR"
result[result["country_code"]==country_code]
Out[5]:
country_code date_code count
501 FR 19m 213
502 FR 19u 217
503 FR 20m 225
504 FR 20u 234
505 FR 21m 234
506 FR 21u 235
507 FR 22m 236
508 FR 22u 242
509 FR 23m 242
510 FR 23u 245
511 FR 24m 275
512 FR 24u 279
513 FR 25m 281

Read countries dataset¶

In [6]:
# Read the country list
url = "http://api.geonames.org/countryInfo?username=fab23workshop"
countries = pd.read_xml(url, parser="etree")

# Set country_code = "NA" where countryName == "Namibia"
countries.loc[countries["countryName"] == "Namibia", "countryCode"] = "NA"

# Set continent = "NA" where continentName == "North America"
countries.loc[countries["continentName"] == "North America", "continent"] = "NA"

fm.countries = countries
fm.log("countries ",url,countries)
2025-12-17T15:21Z countries  http://api.geonames.org/countryInfo?username=fab23workshop (250, 18)

D. Data processing¶

Convert the data_code to numeric, add as extra field "x"¶

In [7]:
# function to convert date_code into numeric
# function created by chatGPT
def convert_dc(date_code):
    base=int(date_code[0:2])
    frac = .0 if date_code[2]=="m" else .5
    return base + frac

convert_dc("19u")
Out[7]:
19.5
In [8]:
# add numeric date codes to result
result['x']=result['date_code'].apply(convert_dc)
In [9]:
# check result for single country
result[result["country_code"]==country_code]
Out[9]:
country_code date_code count x
501 FR 19m 213 19.0
502 FR 19u 217 19.5
503 FR 20m 225 20.0
504 FR 20u 234 20.5
505 FR 21m 234 21.0
506 FR 21u 235 21.5
507 FR 22m 236 22.0
508 FR 22u 242 22.5
509 FR 23m 242 23.0
510 FR 23u 245 23.5
511 FR 24m 275 24.0
512 FR 24u 279 24.5
513 FR 25m 281 25.0

E. Data Study and Analysis¶

Calculate linear fit coefficients for all countries¶

In [10]:
# function generated by ChatGPT via interactive process
def compute_trend_coeffs(df):
    """
    df must contain:
      - country_code
      - x (numeric axis value)
      - count (y-value)
    Returns a dataframe with:
      country_code | slope | intercept
    """
    rows = []

    for c in sorted(df["country_code"].dropna().unique()):
        data = df[df["country_code"] == c]

        # Need at least 2 points for a linear fit
        if len(data) >= 2:
            m, b = np.polyfit(data["x"], data["count"], 1)
        else:
            m, b = np.nan, np.nan

        rows.append({"country_code": c, "slope": m, "intercept": b})

    return pd.DataFrame(rows)
In [11]:
# calculate coefficients and show
coeff_df = compute_trend_coeffs(result)
coeff_df
Out[11]:
country_code slope intercept
0 AE 3.846154e-01 -0.846154
1 AF 3.822909e-17 1.000000
2 AL 2.743091e-16 2.000000
3 AM 7.692308e-02 -0.628205
4 AR 1.197802e+00 -9.274725
... ... ... ...
137 VE 6.593407e-02 1.472527
138 VN 1.120879e+00 -6.582418
139 XK NaN NaN
140 YT 1.371545e-16 1.000000
141 ZA -2.527473e-01 13.021978

142 rows × 3 columns

F. Data Publishing and Access¶

Create function for plotting with trend lines¶

In [12]:
# plot counts plus trend line
def plot_counts(df, trend_df, country=None):
    """
    df: dataframe with ['country_code', 'date_code', 'x', 'count']
    trend_df: dataframe with ['country_code', 'slope', 'intercept']
    country: optional filter. If None, counts are summed across all countries.
    """

    # ---- Select data ----
    if country:
        data = df[df["country_code"] == country].copy()
    else:
        # Sum counts across countries for each date
        data = (
            df.groupby(["date_code", "x"], as_index=False)["count"]
              .sum()
        )

    # ---- Sort by date ----
    data = data.sort_values("x")

    # ---- Retrieve trend coefficients ----
    if country:
        row = trend_df[trend_df["country_code"] == country]
    else:
        # global trend: compute on the fly
        if len(data) >= 2:
            m, b = np.polyfit(data["x"], data["count"], 1)
        else:
            m, b = np.nan, np.nan
        row = None

    if row is not None and not row.empty:
        m = row["slope"].iloc[0]
        b = row["intercept"].iloc[0]

    # ---- Plot figure ----
    plt.figure(figsize=(10, 5))

    # Main line
    plt.plot(data["date_code"], data["count"], marker="o")

    # Annotate integer values
    for x_pos, y in zip(data["date_code"], data["count"]):
        plt.text(x_pos, y, str(int(y)), ha='center', va='bottom')

    # ---- Trend line ----
    if not np.isnan(m):
        x_vals = np.array(data["x"])
        y_trend = m * x_vals + b
        plt.plot(data["date_code"], y_trend, linestyle="--", label="Trend line")

    # ---- Axis formatting ----
    plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, _: f"{int(x)}"))

    plt.xlabel(f"year (m = medio, u = ultimo)\n(chart prepared {today}) ")
    plt.ylabel('Fablab counts \n(any activity_status except "closed")')

    countryName = fm.get_country(country)
    title = f"{countryName if country else 'World'}: Fablab counts over time"
    plt.title(title)
    plt.xticks(rotation=45)

    plt.legend()
    plt.grid(True)
    plt.tight_layout()

    # ---- Save & show ----
    aux = "ct."+ country.lower() if country else "world"
    plt.savefig(output_path + aux + "_flcounts_over_time.png", dpi=300, bbox_inches="tight")
    plt.show()

Line chart world fablabs with trend¶

In [13]:
# Create world fablabs count over time plus trend 
result_df = result
plot_counts(result_df, coeff_df, country=None)
No description has been provided for this image

Line chart single country fablabs with trend¶

In [14]:
# Create single country fablabs count over time plus trend 
country_code = "FR"
result_df = result
plot_counts(result_df, coeff_df, country=country_code)
No description has been provided for this image

Line charts with trend of all countries (remove cell output before committing)¶

In [ ]:
# Create fablabs count over time plus trend for all countries
df = result
for c in sorted(df["country_code"].dropna().unique()):
    print("Plotting:", c)   # optional
    plot_counts(df, coeff_df, country=c)
    #plot_counts_with_trend(df, coeff_df, country=c)

G. Data Preservation¶

It's important that we archive a copy of labs.json every six months.

H. Data Re-use¶

Based on the archived copies of labs.json, we can automatically generate up-to-date overviews with a notebook like this one.

Evaluation and Follow-up¶

It would be interesting to experiment with higher degrees of fitting. It could lead to a clear indicator of countries and their fablab growth speed.

Follow-up¶

  • Experiment with higher order degree of fitting
  • Experiment with omitting line chart; only display datapoints and trend line
  • Go deeper into data and derive inflow and outflow functions; apply fitting to inflow and outflow functions
  • Calculate overall growth by country; visualize this, eg on world map
  • Repeat country level activities at continent level as well
  • Repeat continent level activities at other country groupings as well (eg WorldBank categories like UMIC Upper Middle Income Countries etc)

Review¶