Week 04-01: Transform¶
Assignment: Analyze your data set
Read the Printer Log¶
import numpy as np
import matplotlib.pyplot as plt
import ReadPrusaLog as rpl
import pandas as pd
## benchy
df_temp_benchy = rpl.get_printer_temp_data('./datasets/printer_data_temperature_benchy.txt')
df_pos_benchy = rpl.get_printer_pos_data('./datasets/printer_data_position_benchy.txt')
df_benchy = pd.merge(df_pos_benchy,df_temp_benchy,on='timestamp',how="inner")
df_benchy = df_benchy.query('timestamp >= "2025/11/24 21:28:06" & timestamp <= "2025/11/24 22:14:24"')
df_benchy.loc[:,'filament'] = 1
df_benchy.loc[:,'lastdry'] = 7
df_benchy.loc[:,'model'] = 0
timestamp = df_benchy['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_benchy.loc[:,'ts_sum'] = ts_sum
## dimensions
df_temp_dimensions = rpl.get_printer_temp_data('./experiments/printer_data_temp_dimensions.txt')
df_pos_dimensions = rpl.get_printer_pos_data('./experiments/printer_data_position_dimensions.txt')
df_dimensions = pd.merge(df_pos_dimensions,df_temp_dimensions,on='timestamp',how='inner')
df_dimensions = df_dimensions.query('timestamp >= "2025/12/01 10:33:50" & timestamp <= "2025/12/01 10:50:46"')
df_dimensions.loc[:,'filament'] = 1
df_dimensions.loc[:,'lastdry'] = 14
df_dimensions.loc[:,'model'] = 1
timestamp = df_dimensions['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_dimensions.loc[:,'ts_sum'] = ts_sum
## finish
df_temp_finish = rpl.get_printer_temp_data('./experiments/printer_data_temp_finish.txt')
df_pos_finish = rpl.get_printer_pos_data('./experiments/printer_data_position_finish.txt')
df_finish = pd.merge(df_pos_finish,df_temp_finish,on='timestamp',how='inner')
df_finish = df_finish.query('timestamp <= "2025/12/01 11:22:00"')
df_finish.loc[:,'filament'] = 1
df_finish.loc[:,'lastdry'] = 14
df_finish.loc[:,'model'] = 2
timestamp = df_finish['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_finish.loc[:,'ts_sum'] = ts_sum
df_integrate = pd.concat([df_benchy,df_dimensions,df_finish],ignore_index=True)
df_integrate.columns
Index(['logtime_x', 'X_pos', 'Y_pos', 'travel_distance', 'Z_pos', 'E_pos',
'e_move', 'e_total', 'count_a_pos', 'count_b_pos', 'count_z_pos',
'timestamp', 'ts_nano_x', 'logtime_y', 'hotend_temp_current',
'hotend_temp_setting', 'bed_temp_current', 'bed_temp_setting',
'heatbreak_temp_current', 'heatbreak_temp_setting', 'hotend_power',
'bed_heater_power', 'hotend_fan_power', 'ts_nano_y', 'filament',
'lastdry', 'model', 'ts_sum'],
dtype='object')
PCA¶
First, I tried to do Principal Components Analysis. The purpos of PCA would be:
- to visualize understand which parameters are important
- to reduce demensions for model creation
The following code picked up 11 variables and reduced to 5 dimensions by PCA.
import sklearn
#df_integrate = df_integrate.drop(columns=['logtime_x','logtime_y','timestamp'])
print_id = df_integrate["model"].values
htc = df_integrate['heatbreak_temp_current']
df_integrate = df_integrate.loc[:,['X_pos','Y_pos','Z_pos','E_pos','e_total','hotend_temp_current','bed_temp_current','heatbreak_temp_current','hotend_power','bed_heater_power','hotend_fan_power']]
X = df_integrate.to_numpy()
#X = df_integrate.loc[:,['heatbreak_temp_current','e_total']].to_numpy()
y = df_integrate['e_total'].to_numpy()
scaler = sklearn.preprocessing.StandardScaler()
Xscale = scaler.fit_transform(X)
pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
print(f"explained variance: {pca.explained_variance_}")
Xpca = pca.transform(Xscale)
explained variance: [3.30476453 1.5819547 1.33216565 1.16649031 0.72861201]
"PCA.explained_variance_ratio" show...
- 1st component explains about 33% of the data
- 2nd component explains about 15% of the data
- 3rd component explains about 13% of the data
contribution_ratios = pd.DataFrame(pca.explained_variance_ratio_)
contribution_ratios
| 0 | |
|---|---|
| 0 | 0.330415 |
| 1 | 0.158166 |
| 2 | 0.133192 |
| 3 | 0.116627 |
| 4 | 0.072848 |
With using "contribution_ratios.cumsum()", we can find out unti which components cover the explanation of total data. (cumlative sum of explained variance ratio). Here, until component5 could explain over 80%
cumulative_contribution_ratios = contribution_ratios.cumsum()
cumulative_contribution_ratios
| 0 | |
|---|---|
| 0 | 0.330415 |
| 1 | 0.488581 |
| 2 | 0.621772 |
| 3 | 0.738400 |
| 4 | 0.811247 |
The following graph plot the contribution_ratio and cuml;ative of contribution ratio.
cont_cumcont_ratios = pd.concat([contribution_ratios, cumulative_contribution_ratios], axis=1).T
cont_cumcont_ratios.index = ['contribution_ratio', 'cumulative_contribution_ratio']
x_axis = range(1, contribution_ratios.shape[0] + 1)
plt.figure(figsize=(10,8))
plt.rcParams['font.size'] = 10
plt.bar(x_axis, contribution_ratios.iloc[:, 0], align='center')
plt.plot(x_axis, cumulative_contribution_ratios.iloc[:, 0], 'r.-')
plt.xlabel('Number of principal components')
plt.ylabel('Contribution ratio(blue),\nCumulative contribution ratio(red)')
plt.tight_layout()
plt.savefig('./images/pca-brief.png')
plt.show()
We can find out what variables affects in whch component by using "pca.components_.T".
From the result, the most contributed variable for component 1 would be "e_total"(0.498), "Z_pos"(0.493), "heatbreak_temp_current"(0.397), "bed_temp_current"(0.261).
Also, the most contributed variable for component 2 would be "hotend_temp_current"(0.649),"X_pos"(0.248)
loadings = pd.DataFrame(pca.components_.T,index=df_integrate.columns)
loading = loadings.round(3)
loading.columns = ['comp1','comp2','comp3','comp4','comp5']
loading = loading.sort_values('comp1',ascending=False)
loading.to_csv('./notupload/pca-comps.csv')
loading
| comp1 | comp2 | comp3 | comp4 | comp5 | |
|---|---|---|---|---|---|
| e_total | 0.498 | -0.069 | 0.038 | -0.165 | 0.302 |
| Z_pos | 0.493 | -0.051 | 0.081 | -0.194 | 0.258 |
| heatbreak_temp_current | 0.397 | -0.192 | 0.013 | -0.107 | -0.102 |
| bed_temp_current | 0.261 | -0.114 | -0.335 | 0.508 | -0.580 |
| X_pos | 0.068 | 0.248 | -0.564 | -0.336 | 0.117 |
| Y_pos | 0.063 | -0.327 | 0.570 | 0.227 | 0.079 |
| hotend_fan_power | 0.000 | 0.000 | 0.000 | -0.000 | -0.000 |
| hotend_temp_current | -0.114 | 0.649 | 0.235 | 0.019 | 0.021 |
| E_pos | -0.187 | -0.142 | -0.303 | 0.551 | 0.685 |
| hotend_power | -0.209 | -0.517 | -0.265 | -0.308 | -0.025 |
| bed_heater_power | -0.428 | -0.255 | 0.138 | -0.320 | -0.062 |
Now, I draw the scatter plot of component 1 and component 2 as folliwng. The marker coloer represent the model type (purple is label "0" (benchy), yellow is label "1" (dimensions) and green is label "2" (surface_finish).
plt.figure(figsize=(10,8))
a = plt.scatter(Xpca[:,0],Xpca[:,1],s=5,alpha=0.4,c=htc,cmap="viridis")
plt.colorbar(a)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.title("PCA of PrinterLog")
plt.savefig('./images/pca-plot.png')
plt.show()
Clustering of PCA result¶
I did the clustering of PCA result with using Gaussian Mixture Model by Scikit-learn.
First, I create the cluster label of "heatbreak_temp_current" and "e_total" again (as did in week03-02: Density Estimation Assignment)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3,covariance_type='full')
data = df_integrate.loc[:,['heatbreak_temp_current','e_total']].to_numpy()
gmm.fit(data)
hbk_e_clusters = gmm.predict(data)
probabilities = gmm.predict_proba(data)
#plt.figure(figsize=(6,4))
#plt.scatter(data[:,0],data[:,1],c=cluster_labels,cmap='viridis',s=10,alpha=0.7)
#plt.show()
df_integrate['hbk_e_clusters'] = hbk_e_clusters
Then, I did Gaussian Mixture Model clustering of PCA result. The following graph
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3)
gmm.fit(Xpca)
cluster_labels = gmm.predict(Xpca)
fig = plt.figure(figsize=(6,6))
#fig.patch.set_facecolor('white')
#plt.rcParams['axes.facecolor'] = 'black'
#plt.rcParams['axes.edgecolor'] = 'white'
plt.scatter(Xpca[:,0],Xpca[:,1],c=cluster_labels,cmap="viridis",s=5)
plt.xlabel("component 1")
plt.ylabel("component 2")
plt.tight_layout()
#plt.savefig('./notupload/pca.png')
plt.show()
plt.scatter(Xpca[:,0],Xpca[:,1],c=hbk_e_clusters,cmap="viridis",s=5)
plt.xlabel("component 1")
plt.ylabel("component 2")
plt.tight_layout()
#plt.savefig('./notupload/pca.png')
plt.show()
The purpose of creating these two scatter plots was to verify the degree of match between the GMM clustering results of the PCA results and the GMM clustering results of “heatbreak_temp_current” and ‘e_total’. In other words, I wanted to confirm what the clusters formed by the PCA results signify. As shown in “dataframe.crosstab”, the match rate was approximately 16%.
pd.crosstab(df_integrate.hbk_e_clusters, cluster_labels)
| col_0 | 0 | 1 | 2 |
|---|---|---|---|
| hbk_e_clusters | |||
| 0 | 152 | 0 | 983 |
| 1 | 613 | 0 | 1519 |
| 2 | 0 | 1605 | 485 |
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(df_integrate.hbk_e_clusters, cluster_labels)
0.3193859057405077
Machine Learning by PCA dimension reduction data¶
Now, when building a model using only valid variables from existing printer logs, how effective can the resulting model be? Here, we'll use PCA results to build the regression prediction model shown in Week-02-02 using Scikit-learn. We'll use MLPRegressor for the regression prediction model.
First, we perform PCA again. This time, we exclude “e_total”—which we use as the independent variable (Y)—from the PCA (since using PCA results including e_total for model building is inappropriate).
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
import numpy as np
from datetime import datetime as dt
df = df_integrate.loc[:,['Z_pos','hotend_temp_current','bed_temp_current','heatbreak_temp_current','bed_heater_power']]
X = df.to_numpy()
y = df_integrate['e_total'].to_numpy()
scaler = sklearn.preprocessing.StandardScaler()
Xscale = scaler.fit_transform(X)
pca = sklearn.decomposition.PCA(n_components=2)
pca.fit(Xscale)
print(f"explained variance: {pca.explained_variance_}")
Xpca = pca.transform(Xscale)
explained variance: [2.35068022 1.06854097]
The result of PCA is mostly the same with above
loadings = pd.DataFrame(pca.components_.T,index=df.columns)
loadings.round(3)
| 0 | 1 | |
|---|---|---|
| Z_pos | 0.523 | 0.102 |
| hotend_temp_current | -0.237 | 0.846 |
| bed_temp_current | 0.425 | -0.003 |
| heatbreak_temp_current | 0.488 | -0.196 |
| bed_heater_power | -0.502 | -0.486 |
Now, let's proceed to build the model.
Another important point here is the use of standardization, which we learned in this lesson. When performing machine learning, normalizing the scale of the explanatory variables (X) is essential for maintaining model accuracy. Here, we will perform standardization using Scikit-learn's StandardScaler.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train,X_test,y_train,y_test = model_selection.train_test_split(Xscale,y,test_size=0.2)
print(X_train.shape, X_test.shape,y_train.shape,y_test.shape)
(4285, 5) (1072, 5) (4285,) (1072,)
model = MLPRegressor(hidden_layer_sizes=(50,50), max_iter=10000)
starttime = dt.now()
model.fit(X_train,y_train)
endtime = dt.now()
print("Predict:",model.score(X_test,y_test)," time:", (endtime.timestamp() - starttime.timestamp()))
fig,ax = plt.subplots()
fig.figsize=(10,7)
#fig.patch.set_facecolor('black')
ax.set_title("Loss Curve")
ax.plot(model.loss_curve_)
ax.set_xlabel("Iteration")
ax.set_ylabel("Loss")
#ax.tick_params(axis="both",color="white",labelcolor="white")
plt.grid()
plt.savefig("./images/ml-2nd.png")
plt.show()
Predict: 0.9988269668826948 time: 4.227211952209473
Well, we could get a nice loss_curve model. The following graph show how does the predict match the real data with using test data. It represent a beautiful liner regressions.
y_predict = model.predict(X_test)
y_real = y_test
plt.scatter(y_predict,y_real)
plt.xlabel("E Total Predict")
plt.ylabel("E Total Real")
plt.show()
I also check the model performance by using some metrics as follow:
- MAE(Mean Absolute Error)
- RMSE(Root Mean Squared Error)
- R-Squared
And, I could found out each metrics show the model are good performance.
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
mse = mean_absolute_error(y_real,y_predict)
rmse = np.sqrt(mean_squared_error(y_real,y_predict))
r2 = r2_score(y_real,y_predict)
print(f"MAE: {mse},RMSE:{rmse}, R^2:{r2}")
MAE: 29.50602971961079,RMSE:39.03881498921886, R^2:0.9988269668826948
The following show the permutation importance of each features.
from sklearn.inspection import permutation_importance
columns = df.columns
result = permutation_importance(model,X_test,y_test,n_repeats=15)
print(X_test.shape,y_test.shape)
importances = result.importances_mean
imp2 = pd.DataFrame()
imp2['feature_name'] = columns
imp2.set_index('feature_name')
imp2['feature_value'] = importances
imp2 = imp2.sort_values('feature_value',ascending=False)
imp2 = imp2.round(4)
imp2.to_csv('./notupload/permutation-importance-2nd.csv')
imp2
(1072, 5) (1072,)
| feature_name | feature_value | |
|---|---|---|
| 0 | Z_pos | 1.3734 |
| 4 | bed_heater_power | 0.6992 |
| 3 | heatbreak_temp_current | 0.1617 |
| 2 | bed_temp_current | 0.0910 |
| 1 | hotend_temp_current | 0.0021 |
These results show that we can build a model with comparable performance using the dimension reduction results from PCA (reduced to 5 dimensions). Although not done here, we could likely build a model predicting a good e_total value by constructing a new model using variables showing positive contribution up to the first and second principal components of PCA.
Outcome¶
In the transforms session, we learned various techniques for transforming data. For this assignment, we specifically tried PCA as a dimensionality reduction transformation technique.
Using PCA allowed us to understand which elements of the dataset were important. By applying PCA after identifying correlations through mutual information and understanding the data structure via clustering in previous sessions, we were able to pinpoint the truly critical key parameters.