< Home - Week 2>

Week 1: Playground¶

The goal for this week is to start using the environment and to select the datasets to analyze for the next weeks

The class environment¶

There are 4 elements:

Jupyter Notebook, hosted in a browser: an IDE connected to Jupyter Hub/Lab and to GitLab
Jupyter Hub and Jupyter Lab: a multi-user platform that makes it easy to serve interactive Jupyter Notebooks (and JupyterLab) to groups of users, like students in a class or data scientists in a company, providing a shared, pre-configured environment for coding, data analysis, and visualization without individual setup. It manages individual notebook servers for each user, allowing them to access resources and collaborate, while system administrators control environments and access through customizable authentication
GitLab: manage the source code and the pipeline used to publish students notebooks to the student web site
FabLabs.io: manage accounts and host content (the classes and the students web sites)

The workflow:

Start the IDE (see link at https://class.academany.org/futures/data-science/2025/people.html The link is on the line with your name in the list))
Write your code/content in a Jupyter notebook (.ipynb file)
Run it. I will run on the GitLab server but you will see the output in the IDE
Stage, commit and publish the file to GitLab
It will automaticaly be published to the student web site (https://class.academany.org/futures/data-science/2025/labs/global/students//...)

Limitations:

Since Jupyter Hub/Lab and GitLab are hosted on the Internet, we depend on the infrastructure. But the good news is that we don't have to manage that infrastructure. For the future, we could deploy something similar on our lab cloud infrastructure or even locally.
Jupyter Lab is instanciated on-demand, for each student. The configuration is not saved on a permanent storage. If we add a package/library, it will not be permanent.

Demo: draw a red box¶

In [1]:

import matplotlib.pyplot as plt
import numpy as np

# Create a blank image (e.g., a 100x100 white image)
img = np.ones((100, 100, 3)) * 255  # White image (RGB)

# Draw a red square
img[20:50, 30:70] = [255, 0, 0]  # Red color

# Display the image
plt.imshow(img.astype(np.uint8))
plt.axis('off')  # Hide axes
plt.show()

No description has been provided for this image

Demo : write something¶

In [2]:

import time
from IPython.display import clear_output

# A simple list of "frames"
animation = [
    "This",
    "This is",
    "This is something"
]

for i in range(21): # Run for 20 frames
    print(animation[i % len(animation)])
    clear_output(wait=True) # Clears the current cell output
    time.sleep(0.3) # Waits a bit so you can see the movement

This is something

Selected datasets¶

Here are the datasets I would like to use for this course:

AI impact jobs by 2030¶

Source: Kaggle
Description: this dataset simulates the future of work in the age of artificial intelligence. It models how various professions, skills, and education levels might be impacted by AI-driven automation by the year 2030. The output is a score, i.e the probability of automation by IA for each specific case.
My strategy: I need this because I need to know when I have to start another career :-)

Loan approval analysis and prediction¶

Source: Kaggle
Description: complete dataset of 50,000 loan applications across Credit Cards, Personal Loans, and Lines of Credit. Includes customer demographics, financial profiles, credit behavior, and approval decisions based on real US & Canadian banking criteria. The output is binary (granted or denied loan)
My strategy: use machine learning to learn and predict

Restaurant tips¶

Source: Kaggle
Description: this dataset contains information collected from a restaurant’s bills and tips. It helps analysis how different factors — such as the total bill, gender of the customer, day of the week, meal time, and group size — influence the amount of tip given to the server.
My strategy: to help my daughter to figure out what day in a week brings more money :-)

House Property Sales¶

Source: Kaggle
Description: property sales data for the 2007-2019 period for one specific region. The data contains sales prices for houses and units with 1,2,3,4,5 bedrooms. These are the cross-depended variables.
My strategy: this one is a time series, I could make some good use of it

Anomaly detection in network traffic¶

Source: Kaggle
Description: this dataset is designed for the development and evaluation of anomaly detection models in embedded system network security. It includes network traffic features that simulate both normal and malicious behavior, making it suitable for supervised learning tasks focused on identifying security breaches in networked embedded systems. The data in this file is structured with various network-related features, including packet size, inter-arrival time, protocol type, source and destination IPs, TCP flags, and frequency-domain features extracted using Wavelet Transform. The dataset also includes a target column that marks each data entry as either normal (0) or anomalous (1).
My strategy: I don't know yet

Noisy and faulty sensors¶

Source: Kaggle
Description: Time series of measurements on sensors uniquely identified by a Sensor Id. During the serie of measurement the sensor is disconnected or on failure.

In [ ]: