Yosuke Tsuchiya - Fab Futures - Data Science
Home About

Week 04-01: Transform¶

Assignment: Analyze your data set

Read the Printer Log¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import ReadPrusaLog as rpl
import pandas as pd
In [2]:
## benchy
df_temp_benchy = rpl.get_printer_temp_data('./datasets/printer_data_temperature_benchy.txt')
df_pos_benchy = rpl.get_printer_pos_data('./datasets/printer_data_position_benchy.txt')
df_benchy = pd.merge(df_pos_benchy,df_temp_benchy,on='timestamp',how="inner")
df_benchy = df_benchy.query('timestamp >= "2025/11/24 21:28:06" & timestamp <= "2025/11/24 22:14:24"')
df_benchy.loc[:,'filament'] = 1
df_benchy.loc[:,'lastdry'] = 7
df_benchy.loc[:,'model'] = 0

timestamp = df_benchy['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_benchy.loc[:,'ts_sum'] = ts_sum

## dimensions
df_temp_dimensions = rpl.get_printer_temp_data('./experiments/printer_data_temp_dimensions.txt')
df_pos_dimensions = rpl.get_printer_pos_data('./experiments/printer_data_position_dimensions.txt')
df_dimensions = pd.merge(df_pos_dimensions,df_temp_dimensions,on='timestamp',how='inner')
df_dimensions = df_dimensions.query('timestamp >= "2025/12/01 10:33:50" & timestamp <= "2025/12/01 10:50:46"')
df_dimensions.loc[:,'filament'] = 1
df_dimensions.loc[:,'lastdry'] = 14
df_dimensions.loc[:,'model'] = 1

timestamp = df_dimensions['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_dimensions.loc[:,'ts_sum'] = ts_sum

## finish
df_temp_finish = rpl.get_printer_temp_data('./experiments/printer_data_temp_finish.txt')
df_pos_finish = rpl.get_printer_pos_data('./experiments/printer_data_position_finish.txt')
df_finish = pd.merge(df_pos_finish,df_temp_finish,on='timestamp',how='inner')
df_finish = df_finish.query('timestamp <= "2025/12/01 11:22:00"')
df_finish.loc[:,'filament'] = 1
df_finish.loc[:,'lastdry'] = 14
df_finish.loc[:,'model'] = 2

timestamp = df_finish['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_finish.loc[:,'ts_sum'] = ts_sum

df_integrate = pd.concat([df_benchy,df_dimensions,df_finish],ignore_index=True)
In [3]:
df_integrate.columns
Out[3]:
Index(['logtime_x', 'X_pos', 'Y_pos', 'travel_distance', 'Z_pos', 'E_pos',
       'e_move', 'e_total', 'count_a_pos', 'count_b_pos', 'count_z_pos',
       'timestamp', 'ts_nano_x', 'logtime_y', 'hotend_temp_current',
       'hotend_temp_setting', 'bed_temp_current', 'bed_temp_setting',
       'heatbreak_temp_current', 'heatbreak_temp_setting', 'hotend_power',
       'bed_heater_power', 'hotend_fan_power', 'ts_nano_y', 'filament',
       'lastdry', 'model', 'ts_sum'],
      dtype='object')

PCA¶

First, I tried to do Principal Components Analysis. The purpos of PCA would be:

  • to visualize understand which parameters are important
  • to reduce demensions for model creation

The following code picked up 11 variables and reduced to 5 dimensions by PCA.

In [4]:
import sklearn
#df_integrate = df_integrate.drop(columns=['logtime_x','logtime_y','timestamp'])
print_id = df_integrate["model"].values
htc = df_integrate['heatbreak_temp_current']
df_integrate = df_integrate.loc[:,['X_pos','Y_pos','Z_pos','E_pos','e_total','hotend_temp_current','bed_temp_current','heatbreak_temp_current','hotend_power','bed_heater_power','hotend_fan_power']]
X = df_integrate.to_numpy()
#X = df_integrate.loc[:,['heatbreak_temp_current','e_total']].to_numpy()
y = df_integrate['e_total'].to_numpy()
scaler = sklearn.preprocessing.StandardScaler()
Xscale = scaler.fit_transform(X)

pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
print(f"explained variance: {pca.explained_variance_}")
Xpca = pca.transform(Xscale)
explained variance: [3.30476453 1.5819547  1.33216565 1.16649031 0.72861201]

"PCA.explained_variance_ratio" show...

  • 1st component explains about 33% of the data
  • 2nd component explains about 15% of the data
  • 3rd component explains about 13% of the data
In [5]:
contribution_ratios = pd.DataFrame(pca.explained_variance_ratio_)
contribution_ratios
Out[5]:
0
0 0.330415
1 0.158166
2 0.133192
3 0.116627
4 0.072848

With using "contribution_ratios.cumsum()", we can find out unti which components cover the explanation of total data. (cumlative sum of explained variance ratio). Here, until component5 could explain over 80%

In [6]:
cumulative_contribution_ratios = contribution_ratios.cumsum()
cumulative_contribution_ratios
Out[6]:
0
0 0.330415
1 0.488581
2 0.621772
3 0.738400
4 0.811247

The following graph plot the contribution_ratio and cuml;ative of contribution ratio.

In [7]:
cont_cumcont_ratios = pd.concat([contribution_ratios, cumulative_contribution_ratios], axis=1).T
cont_cumcont_ratios.index = ['contribution_ratio', 'cumulative_contribution_ratio']  

x_axis = range(1, contribution_ratios.shape[0] + 1) 
plt.figure(figsize=(10,8))
plt.rcParams['font.size'] = 10
plt.bar(x_axis, contribution_ratios.iloc[:, 0], align='center')  
plt.plot(x_axis, cumulative_contribution_ratios.iloc[:, 0], 'r.-')  
plt.xlabel('Number of principal components')  
plt.ylabel('Contribution ratio(blue),\nCumulative contribution ratio(red)') 

plt.tight_layout()
plt.savefig('./images/pca-brief.png')
plt.show()
No description has been provided for this image

We can find out what variables affects in whch component by using "pca.components_.T".

From the result, the most contributed variable for component 1 would be "e_total"(0.498), "Z_pos"(0.493), "heatbreak_temp_current"(0.397), "bed_temp_current"(0.261).

Also, the most contributed variable for component 2 would be "hotend_temp_current"(0.649),"X_pos"(0.248)

In [8]:
loadings = pd.DataFrame(pca.components_.T,index=df_integrate.columns)
loading = loadings.round(3)
loading.columns = ['comp1','comp2','comp3','comp4','comp5']
loading = loading.sort_values('comp1',ascending=False)
loading.to_csv('./notupload/pca-comps.csv')
loading
Out[8]:
comp1 comp2 comp3 comp4 comp5
e_total 0.498 -0.069 0.038 -0.165 0.302
Z_pos 0.493 -0.051 0.081 -0.194 0.258
heatbreak_temp_current 0.397 -0.192 0.013 -0.107 -0.102
bed_temp_current 0.261 -0.114 -0.335 0.508 -0.580
X_pos 0.068 0.248 -0.564 -0.336 0.117
Y_pos 0.063 -0.327 0.570 0.227 0.079
hotend_fan_power 0.000 0.000 0.000 -0.000 -0.000
hotend_temp_current -0.114 0.649 0.235 0.019 0.021
E_pos -0.187 -0.142 -0.303 0.551 0.685
hotend_power -0.209 -0.517 -0.265 -0.308 -0.025
bed_heater_power -0.428 -0.255 0.138 -0.320 -0.062

Now, I draw the scatter plot of component 1 and component 2 as folliwng. The marker coloer represent the model type (purple is label "0" (benchy), yellow is label "1" (dimensions) and green is label "2" (surface_finish).

In [9]:
plt.figure(figsize=(10,8))
a = plt.scatter(Xpca[:,0],Xpca[:,1],s=5,alpha=0.4,c=htc,cmap="viridis")
plt.colorbar(a)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.title("PCA of PrinterLog")
plt.savefig('./images/pca-plot.png')
plt.show()
No description has been provided for this image

Clustering of PCA result¶

I did the clustering of PCA result with using Gaussian Mixture Model by Scikit-learn.

First, I create the cluster label of "heatbreak_temp_current" and "e_total" again (as did in week03-02: Density Estimation Assignment)

In [10]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3,covariance_type='full')

data = df_integrate.loc[:,['heatbreak_temp_current','e_total']].to_numpy()
gmm.fit(data)
hbk_e_clusters = gmm.predict(data)
probabilities = gmm.predict_proba(data)

#plt.figure(figsize=(6,4))
#plt.scatter(data[:,0],data[:,1],c=cluster_labels,cmap='viridis',s=10,alpha=0.7)
#plt.show()
df_integrate['hbk_e_clusters'] = hbk_e_clusters

Then, I did Gaussian Mixture Model clustering of PCA result. The following graph

In [11]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3)
gmm.fit(Xpca)
cluster_labels = gmm.predict(Xpca)
fig = plt.figure(figsize=(6,6))
#fig.patch.set_facecolor('white')
#plt.rcParams['axes.facecolor'] = 'black'
#plt.rcParams['axes.edgecolor'] = 'white'
plt.scatter(Xpca[:,0],Xpca[:,1],c=cluster_labels,cmap="viridis",s=5)
plt.xlabel("component 1")
plt.ylabel("component 2")
plt.tight_layout()
#plt.savefig('./notupload/pca.png')
plt.show()

plt.scatter(Xpca[:,0],Xpca[:,1],c=hbk_e_clusters,cmap="viridis",s=5)
plt.xlabel("component 1")
plt.ylabel("component 2")
plt.tight_layout()
#plt.savefig('./notupload/pca.png')
plt.show()
No description has been provided for this image
No description has been provided for this image

The purpose of creating these two scatter plots was to verify the degree of match between the GMM clustering results of the PCA results and the GMM clustering results of “heatbreak_temp_current” and ‘e_total’. In other words, I wanted to confirm what the clusters formed by the PCA results signify. As shown in “dataframe.crosstab”, the match rate was approximately 16%.

In [12]:
pd.crosstab(df_integrate.hbk_e_clusters, cluster_labels)
Out[12]:
col_0 0 1 2
hbk_e_clusters
0 152 0 983
1 613 0 1519
2 0 1605 485
In [13]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(df_integrate.hbk_e_clusters, cluster_labels)
Out[13]:
0.3193859057405077

Machine Learning by PCA dimension reduction data¶

Now, when building a model using only valid variables from existing printer logs, how effective can the resulting model be? Here, we'll use PCA results to build the regression prediction model shown in Week-02-02 using Scikit-learn. We'll use MLPRegressor for the regression prediction model.

First, we perform PCA again. This time, we exclude “e_total”—which we use as the independent variable (Y)—from the PCA (since using PCA results including e_total for model building is inappropriate).

In [50]:
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
import numpy as np
from datetime import datetime as dt

df = df_integrate.loc[:,['Z_pos','hotend_temp_current','bed_temp_current','heatbreak_temp_current','bed_heater_power']]

X = df.to_numpy()

y = df_integrate['e_total'].to_numpy()
scaler = sklearn.preprocessing.StandardScaler()
Xscale = scaler.fit_transform(X)

pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(Xscale)
print(f"explained variance: {pca.explained_variance_}")
Xpca = pca.transform(Xscale)
explained variance: [2.35068022 1.06854097]

The result of PCA is mostly the same with above

In [51]:
loadings = pd.DataFrame(pca.components_.T,index=df.columns)
loadings.round(3)
Out[51]:
0 1
Z_pos 0.523 0.102
hotend_temp_current -0.237 0.846
bed_temp_current 0.425 -0.003
heatbreak_temp_current 0.488 -0.196
bed_heater_power -0.502 -0.486

Now, let's proceed to build the model.

Another important point here is the use of standardization, which we learned in this lesson. When performing machine learning, normalizing the scale of the explanatory variables (X) is essential for maintaining model accuracy. Here, we will perform standardization using Scikit-learn's StandardScaler.

In [52]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)


X_train,X_test,y_train,y_test = model_selection.train_test_split(Xscale,y,test_size=0.2)
print(X_train.shape, X_test.shape,y_train.shape,y_test.shape)
(4285, 5) (1072, 5) (4285,) (1072,)
In [53]:
model = MLPRegressor(hidden_layer_sizes=(50,50), max_iter=10000)

starttime = dt.now()
model.fit(X_train,y_train)
endtime = dt.now()
print("Predict:",model.score(X_test,y_test)," time:", (endtime.timestamp() - starttime.timestamp()))

fig,ax = plt.subplots()
fig.figsize=(10,7)
#fig.patch.set_facecolor('black')

ax.set_title("Loss Curve")
ax.plot(model.loss_curve_)
ax.set_xlabel("Iteration")
ax.set_ylabel("Loss")
#ax.tick_params(axis="both",color="white",labelcolor="white")
plt.grid()
plt.savefig("./images/ml-2nd.png")
plt.show()
Predict: 0.9988269668826948  time: 4.227211952209473
No description has been provided for this image

Well, we could get a nice loss_curve model. The following graph show how does the predict match the real data with using test data. It represent a beautiful liner regressions.

In [54]:
y_predict = model.predict(X_test)
y_real = y_test

plt.scatter(y_predict,y_real)
plt.xlabel("E Total Predict")
plt.ylabel("E Total Real")
plt.show()
No description has been provided for this image

I also check the model performance by using some metrics as follow:

  • MAE(Mean Absolute Error)
  • RMSE(Root Mean Squared Error)
  • R-Squared

And, I could found out each metrics show the model are good performance.

In [55]:
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error

mse = mean_absolute_error(y_real,y_predict)
rmse = np.sqrt(mean_squared_error(y_real,y_predict))

r2 = r2_score(y_real,y_predict)
print(f"MAE: {mse},RMSE:{rmse}, R^2:{r2}")
MAE: 29.50602971961079,RMSE:39.03881498921886, R^2:0.9988269668826948

The following show the permutation importance of each features.

In [56]:
from sklearn.inspection import permutation_importance

columns = df.columns
result = permutation_importance(model,X_test,y_test,n_repeats=15)
print(X_test.shape,y_test.shape)

importances = result.importances_mean
imp2 = pd.DataFrame()
imp2['feature_name'] = columns
imp2.set_index('feature_name')
imp2['feature_value'] = importances
imp2 = imp2.sort_values('feature_value',ascending=False)
imp2 = imp2.round(4)
imp2.to_csv('./notupload/permutation-importance-2nd.csv')
imp2
(1072, 5) (1072,)
Out[56]:
feature_name feature_value
0 Z_pos 1.3734
4 bed_heater_power 0.6992
3 heatbreak_temp_current 0.1617
2 bed_temp_current 0.0910
1 hotend_temp_current 0.0021

These results show that we can build a model with comparable performance using the dimension reduction results from PCA (reduced to 5 dimensions). Although not done here, we could likely build a model predicting a good e_total value by constructing a new model using variables showing positive contribution up to the first and second principal components of PCA.

Outcome¶

In the transforms session, we learned various techniques for transforming data. For this assignment, we specifically tried PCA as a dimensionality reduction transformation technique.

Using PCA allowed us to understand which elements of the dataset were important. By applying PCA after identifying correlations through mutual information and understanding the data structure via clustering in previous sessions, we were able to pinpoint the truly critical key parameters.

In [ ]: