Philippe Libioulle - Fab Futures - Data Science
Home About

< Week 6 - Home - Week 8>

Week 7: transforms¶

The goal for this week is to find the most informative data representations

Experiment 1 - "listen" to the house market¶

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("datasets/House_Property_Sales_Time_Series.csv", usecols=["datesold", "price"],parse_dates=["datesold"])
# Group sales by month
yearly_grouped = df.groupby(pd.Grouper(key='datesold', freq='YE')).sum()
print(yearly_grouped)
# Map to 0-9 values
original_min = yearly_grouped['price'].min()
original_max = yearly_grouped['price'].max()
target_min = 0
target_max = 9
mapped = (((yearly_grouped['price'] - original_min) / (original_max - original_min)) * (target_max - target_min) + target_min).astype(int)
print(mapped.values)
                 price
datesold              
2007-12-31    76789450
2008-12-31   315547250
2009-12-31   707427239
2010-12-31   870123280
2011-12-31   925445775
2012-12-31  1026547544
2013-12-31  1172689133
2014-12-31  1696767719
2015-12-31  2284017698
2016-12-31  2482304198
2017-12-31  3051009584
2018-12-31  2548984623
2019-12-31   878345143
[0 0 1 2 2 2 3 4 6 7 9 7 2]
In [2]:
# Display basic chart
plt.plot(mapped.index, mapped.values,'o')
plt.xlabel('Years')
plt.ylabel('Sales')
plt.show()
No description has been provided for this image
In [3]:
# Play a different sound when the market grows, is stable or decline

import numpy as np
from IPython.display import Audio, display

digits = mapped.values
rate = 44100
def custom_tone(digit,duration,rate,amplitude):
    frequencies = {
        '1':(697,1209),'2':(697,2222),'3':(697,1477),
        '4':(770,1209),'5':(770,3333),'6':(770,1477),
        '7':(852,1209),'8':(852,6666),'9':(852,1477),
        '*':(941,1209),'0':(941,1336),'#':(941,1477)}
    low,high = frequencies[digit]
    t = np.linspace(0,duration,int(rate*duration),endpoint=False)
    tone = amplitude*(np.sin(2*np.pi*low*t)+np.sin(2*np.pi*high*t))
    return tone
def generate_tone(digits,duration=0.3,silence=0.3,rate=44100,amplitude=0.5):
    data = np.array([])
    previous = 0
    for digit in digits:   
        if (digit == previous):
            tone = custom_tone('5',duration,rate,amplitude)
        else:
            if (digit > previous):
                tone = custom_tone('8',duration,rate,amplitude)
            else:
                tone = custom_tone('2',duration,rate,amplitude)
        previous = digit
        silence = np.zeros(int(rate*duration))
        data = np.concatenate((data,tone,silence))
    return data
data = generate_tone(digits,rate=rate)
data = data/np.max(np.abs(data))*(2**15-1)
data = data.astype(np.int16)
display(Audio(data,rate=rate))
Your browser does not support the audio element.

Experiment 2 - Improve raw sensor data¶

In [30]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("datasets/sensor-fault-detection.csv", sep=';', parse_dates=["Timestamp"])
df.head()
Out[30]:
Timestamp SensorId Value
0 2017-03-01 23:20:00+03:00 1 18.479807
1 2017-03-02 04:00:00+03:00 1 19.539112
2 2017-03-23 06:25:00+03:00 1 19.250198
3 2017-03-23 19:35:00+03:00 1 18.961285
4 2017-04-04 15:10:00+03:00 1 25.321623
In [31]:
df.shape
Out[31]:
(62629, 3)
In [32]:
print(df['SensorId'].unique())
df.describe()
[1]
Out[32]:
SensorId Value
count 62629.0 62629.000000
mean 1.0 24.203861
std 0.0 5.411599
min 1.0 6.886155
25% 1.0 21.369419
50% 1.0 24.550188
75% 1.0 27.443794
max 1.0 149.601822
In [33]:
plt.figure(figsize=(10, 6)) 
plt.plot(df['Timestamp'], df['Value'], '.')
plt.title('Sensor Data')
plt.xlabel('Time')
plt.ylabel('Measurement')
plt.grid(True)
plt.xticks(rotation=45) 
plt.tight_layout() 

# we have outliers and some missing data! 
# 8 outliers detected - all showing the exact same suspicious value of 149.6° (likely a sensor fault), occurring in March 2017. 
No description has been provided for this image
In [38]:
# Check for missing values
print(f"\nMissing values:")
print(df.isnull().sum())

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Removed {duplicates} duplicate rows")

# Sort by Timestamp and SensorId
df = df.sort_values('Timestamp').reset_index(drop=True)

# Remove any rows with missing values (if any)
df = df.dropna()
# Calculate Q1, Q3, and IQR
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print('Lower: ', lower_bound, ' Upper: ', upper_bound)

# Remove outliers
df = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]

plt.figure(figsize=(16, 6) )
plt.plot(df['Timestamp'], df['Value'], '.')
plt.title('Sensor Data - without outliers')
plt.xlabel('Time')
plt.ylabel('Measurement')
plt.grid(True)
plt.xticks(rotation=45) 
plt.tight_layout() 

# now we can see there are gaps with no readings !

# Define the start and end dates for filtering
start_date = '2017-04-15'
end_date = '2017-05-15'

# Filter the DataFrame
filtered_df = df.loc[(df['Timestamp'] >= start_date) & (df['Timestamp'] <= end_date)]
plt.figure(figsize=(16, 6) )
plt.plot(filtered_df['Timestamp'], filtered_df['Value'], '.')
plt.title('Sensor Data - one month zoom')
plt.xlabel('Time')
plt.ylabel('Measurement')
plt.grid(True)
plt.xticks(rotation=45) 
plt.tight_layout()

# Define the start and end dates for filtering
start_date = '2017-04-29'
end_date = '2017-04-30'

# Filter the DataFrame
filtered_df = df.loc[(df['Timestamp'] >= start_date) & (df['Timestamp'] <= end_date)]
plt.figure(figsize=(16, 6) )
plt.plot(filtered_df['Timestamp'], filtered_df['Value'], '.')
plt.title('Sensor Data - one day zoom')
plt.xlabel('Time')
plt.ylabel('Measurement')
plt.grid(True)
plt.xticks(rotation=45) 
plt.tight_layout()
Missing values:
Timestamp    0
SensorId     0
Value        0
dtype: int64

Duplicate rows: 0
Lower:  12.835803025000002  Upper:  36.362971305
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

AI prompt: "Using the uploaded file, are there missing data for some days across the time range ?

AI answer: Yes, there are missing data for some days.

alt text

  • 50 days completely missing (12.6% of the 396-day period)
  • 347 days have data (87.4% coverage)
  • 2 days with sparse data (only 5 readings each: Nov 19 & 21, 2016)

Key patterns:

  • Average: 180.5 readings/day (~every 8 minutes)
  • Most days have 276 readings (every 5 minutes)
  • Maximum: 288 readings/day (every 5 minutes continuously)

Notable gaps:

  • Early September 2016 (2 days)
  • Mid-December 2016 (5 consecutive days)
  • Late January to early February 2017 (15+ consecutive days) - largest gap

The sensor appears to have been offline or experiencing connectivity issues during these periods. This is common in IoT sensor deployments due to maintenance, network issues, or power outages.

So, what do we do from here ...

  1. Interpolation: it would help for small gaps, not for days or weeks without data. There is no way to figure out what happened exactly during those days.
  2. Cut more data: readings < 16 looks suspicious... The sensor went down just after and there is no other period of time with similar values
  3. Cut & Glue: ignoring the period of times with no data... I don't like it, it could break the model. Maybe there is some rythm in the data, over long period of time
  4. That's all noise, don't worry: maybe that sensor is just a door sensor (binary) with a lot of noise...

I discovered later that the sensor was a temperature sensor, a PT100. Here is the datasheet

Alt Text

It brings "prior knowledge":

  • Depending on the model, there are limits (low and high)
  • The mysterious 149.6 degrees appears to be the highest temperature in the range for some PT100 models
  • Response time is not immediate, as we could see it in the "one day zoom" previously. It could help to figure out some missing readings
  • ...
In [ ]: