< Home
DataScience Session 4: Machine Learning¶
Synopsis¶
After an introduction to neural networks, a metamodel for "deep learning", the focus was on machine learning, automated learning from data.
Resources¶
- huge model collections
- models targeting (edge) embedded devices
- large language models that can write machine learning models
- ONNX; Open Neural Network Exchange interchange format
- frameworks
Assignment¶
- Fit a machine learning model to your data
A. Research ideas¶
The purpose of this notebook is to select a dataset and demonstrate how to fit a machine learning model to it.
B. Research planning and design¶
- identify a use case
- construct the data set
- apply the machine learning software
# import python modules
import numpy as np
import pandas as pd
C. Data collection¶
Three options:
- Explore the near future of the fab network
- Construct missing logos for fablabs
- Keep it Simple
Explore the near future of the fab network¶
We have detailed data on the number of fab labs over the past six years. On the average, this number increased by 125 annually (net). Can we learn from these figures how things will develop in the future? Or better yet, can we use part of the data to train a model and then use another part to test the models accuracy?
The number of fablabs at the end of year n is the number at the end of year n-1 plus the inflow minus the outflow. That's a conservation law. If we want to predict anything, it will have to be the inflow and the outflow. Are these absolute or relative, depending on the context, like country, continent, other group of countries? But with a fablab presence in 142 countries and a net increase of 125 per year worldwide, the data per country becomes very sparse.
Conclusion: regardless of WHETHER this case is suitable for machine learning and HOW we should approach it, the DATA VOLUME is currently too limited.
Construct missing logos for fablabs¶
In session 01, we saw that 2,179 of the 2,696 fablabs (81%) (including closed labs) had an avatar_URL, i.e., a logo. This logi might be related to country, fablab name and details. Can we use those 81% of logos to come up with a suitable logo for the missing 19% of fablabs?
I don't know IF this is possible, but according to Chatgpt, it should be. HOW wasn't clear. I had to submit 200 MB of logos for analysis. After that, it went quiet for days because I had exceeded some limit.
Conclusion: regardless of WHETHER this case is suitable for machine learning and HOW we should approach it, the DATA VOLUME is currently too large.
Keep it simple¶
We had to get back to basics. For a change, I gave Google Gemini the following prompt: give me a very simple Python program in English to demonstrate the essence of machine learning. We will present it below.
D. Data processing¶
Google Gemini: To demonstrate the essence of machine learning in its simplest form, we use Linear Regression.
The goal is to teach the computer to recognize a pattern: if x increases, what happens to y? In this example, we will "teach" the model that y=2x. We provide the data, and the model figures out the rule.
Simple Linear Regression in Python: We will use scikit-learn, the industry-standard library for basic machine learning.
Code from Google Gemini¶
# code from Google Gemini:
# 1. Import the necessary libraries
from sklearn.linear_model import LinearRegression
import numpy as np
# 2. Provide Data (Input: X, Output: y)
# We are telling the computer: when X is 1, y is 2. When X is 2, y is 4...
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
# 3. Initialize the Model
# Think of this as creating an "empty brain"
model = LinearRegression()
# 4. Training (The "Learning" phase)
# The model looks at the data and finds the relationship (y = 2 * x)
model.fit(X, y)
# 5. Prediction
# Now we ask the model to predict y for a value it has never seen before
unknown_x = np.array([[10]])
prediction = model.predict(unknown_x)
print(f"The model predicts that for X=10, Y is: {prediction[0]}")
The model predicts that for X=10, Y is: 20.0
Variant 1: Non linear model¶
# own variant
# we work with another relationship: y=x**2
# 1. Import the necessary libraries
from sklearn.linear_model import LinearRegression
import numpy as np
# 2. Provide Data (Input: X, Output: y)
# We are telling the computer: when X is 1, y is 1. When X is 2, y is 4...
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 4, 9, 16, 25])
# 3. Initialize the Model
# Think of this as creating an "empty brain"
model = LinearRegression()
# 4. Training (The "Learning" phase)
# The model looks at the data and finds the relationship (y = x**2)??
model.fit(X, y)
# 5. Prediction
# Now we ask the model to predict y for a value it has never seen before
unknown_x = np.array([[10]])
prediction = model.predict(unknown_x)
print(f"The model predicts that for X=10, Y is: {prediction[0]}")
The model predicts that for X=10, Y is: 53.00000000000001
Conclusion¶
This is wrong, but why?
Variant 2: Linear model, missing data¶
# own variant
# we work with the original relationship: y=2*x
# we offer our data with a white spot
# 1. Import the necessary libraries
from sklearn.linear_model import LinearRegression
import numpy as np
# 2. Provide Data (Input: X, Output: y)
# We are telling the computer: when X is 1, y is 2. When X is 2, y is 4...
X = np.array([[1], [2], [3], [5]])
y = np.array([1, 4, 6, 10])
# 3. Initialize the Model
# Think of this as creating an "empty brain"
model = LinearRegression()
# 4. Training (The "Learning" phase)
# The model looks at the data and finds the relationship (y = x*2)?
model.fit(X, y)
# 5. Prediction
# Now we ask the model to predict y for a value it has never seen before
unknown_x = np.array([[10]])
prediction = model.predict(unknown_x)
print(f"The model predicts that for X=10, Y is: {prediction[0]}")
The model predicts that for X=10, Y is: 21.199999999999996
Conclusion¶
Again wrong, but why?
F. Data Publishing and Access¶
G. Data Preservation¶
H. Data Re-use¶
Evaluation and Follow-up¶
The literal copy from Gemini works fine. Any changes resulted in wrong results; not yet clear what is the cause. For me, the topic of Machine Learning requires more time to study.
Follow-up¶
- Start with reviewing theory and practice as Hendra Kusumah taught us at the FabFest 2022 in Bali: TinyML for Wild Nature Conservation; using Seeed SenseCAP K1100.
Review¶