Yosuke Tsuchiya - Fab Futures - Data Science
Home About

Week 03-02: Density Estimation¶

Assignment: Fit a probability distribution to your data

Read the log data¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import ReadPrusaLog as rpl
import pandas as pd
In [2]:
## benchy
df_temp_benchy = rpl.get_printer_temp_data('./datasets/printer_data_temperature_benchy.txt')
df_pos_benchy = rpl.get_printer_pos_data('./datasets/printer_data_position_benchy.txt')
df_benchy = pd.merge(df_pos_benchy,df_temp_benchy,on='timestamp',how="inner")
df_benchy = df_benchy.query('timestamp >= "2025/11/24 21:28:06" & timestamp <= "2025/11/24 22:14:24"')
df_benchy.loc[:,'filament'] = 1
df_benchy.loc[:,'lastdry'] = 7
df_benchy.loc[:,'model'] = 1

timestamp = df_benchy['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_benchy.loc[:,'ts_sum'] = ts_sum

## dimensions
df_temp_dimensions = rpl.get_printer_temp_data('./experiments/printer_data_temp_dimensions.txt')
df_pos_dimensions = rpl.get_printer_pos_data('./experiments/printer_data_position_dimensions.txt')
df_dimensions = pd.merge(df_pos_dimensions,df_temp_dimensions,on='timestamp',how='inner')
df_dimensions = df_dimensions.query('timestamp >= "2025/12/01 10:33:50" & timestamp <= "2025/12/01 10:50:46"')
df_dimensions.loc[:,'filament'] = 1
df_dimensions.loc[:,'lastdry'] = 14
df_dimensions.loc[:,'model'] = 2

timestamp = df_dimensions['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_dimensions.loc[:,'ts_sum'] = ts_sum

## finish
df_temp_finish = rpl.get_printer_temp_data('./experiments/printer_data_temp_finish.txt')
df_pos_finish = rpl.get_printer_pos_data('./experiments/printer_data_position_finish.txt')
df_finish = pd.merge(df_pos_finish,df_temp_finish,on='timestamp',how='inner')
df_finish = df_finish.query('timestamp <= "2025/12/01 11:22:00"')
df_finish.loc[:,'filament'] = 1
df_finish.loc[:,'lastdry'] = 14
df_finish.loc[:,'model'] = 3

timestamp = df_finish['ts_nano_x'].to_numpy()
ts_diff = np.diff(timestamp)
ts_diff = np.insert(ts_diff,0,0)
ts_sum = np.cumsum(ts_diff)
df_finish.loc[:,'ts_sum'] = ts_sum

df_integrate = pd.concat([df_benchy,df_dimensions,df_finish],ignore_index=True)

Filter the variables¶

In [3]:
df_fil = df_integrate.loc[:,['X_pos','Y_pos','Z_pos','travel_distance','E_pos','e_move','e_total','hotend_temp_current','bed_temp_current','heatbreak_temp_current','hotend_power','bed_heater_power','model']]
fil_col = df_fil.columns
fil_col
Out[3]:
Index(['X_pos', 'Y_pos', 'Z_pos', 'travel_distance', 'E_pos', 'e_move',
       'e_total', 'hotend_temp_current', 'bed_temp_current',
       'heatbreak_temp_current', 'hotend_power', 'bed_heater_power', 'model'],
      dtype='object')

Review: Probability¶

In last week's assignment, I calculated the mutual information for all variable pairs and identified the top 15 variable pairs. The results showed that many of the top pairs involved the extruder's total operation volume (e_total) and temperature-related metrics (such as heatbreak_temp and bed_temp). In particular, heatbreak_temp emerged as a strong candidate for a hidden parameter.

In [4]:
nbins = 256
# 1 dimension entropy
def entropy(dist):
    index = np.where(dist > 0) # o log(0) = 0
    positives = dist[index]
    return -np.sum(positives*np.log2(positives))

# 2 dimension entropy
def entropy2(dist):
    indexx,indexy = np.where(dist > 0) # 0 log(0) = 0
    positives = dist[indexx,indexy]
    return -np.sum(positives*np.log2(positives))

# information
def information(x,y):
    xhist,xedges = np.histogram(x,nbins)
    xdist = xhist/np.sum(xhist)
    yhist,yedges = np.histogram(y,nbins)
    ydist = yhist/np.sum(yhist)
    xyhist,xedges,yedges = np.histogram2d(x,y,[nbins,nbins])
    xydist = xyhist/np.sum(xyhist)
    Hx = entropy(xdist)
    Hy = entropy(ydist)
    Hxy = entropy2(xydist)

    return Hx+Hy-Hxy
In [5]:
cols = df_fil.columns

check_conb = []
for colx in cols:
    for coly in cols:
        if colx == coly:
            continue
            
        #p = p_val_matrix.at[colx,coly]
        
        #if p < 0.05:
        check_conb.append([colx,coly])
        for r in check_conb:
            if r[0] == coly and r[1] == colx:
                check_conb.remove(r)

position = 1
np.set_printoptions(precision=4)

t = df_integrate['ts_sum'].to_numpy()
results = []
#plt.figure(figsize=(30,120))
for r in check_conb:
    
    x = df_fil[r[0]].to_numpy()
    y = df_fil[r[1]].to_numpy()
    
    samples = np.column_stack((x,y))
    covar = np.cov(samples,rowvar=False)
    #print(f"{npts:.0e} points")
    #print(f"covariance:\n{covar}")

    #plt.subplot(18,3,position)
    I = information(x,y)
    #plt.scatter(x,y,c=t,cmap="viridis")
    #plt.title(f"{r[0]},{r[1]} mutual information: {I:.1f} bits, coeff: {r[3]:.3f}",fontsize=14)

    results.append([r[0],r[1],I])


pd.set_option('display.float_format', '{:.8f}'.format)
im_df = pd.DataFrame(results)
im_df.columns = ['val1','val2','MI']
im_df = im_df.sort_values('MI',ascending=False)
#im_df.to_csv('./notupload/imrank.csv')

im_df = im_df.head(15)

#position = 1
#plt.figure(figsize=(30,120))
#for i,row in im_df.iterrows():
#    val1 = row['val1']
#    val2 = row['val2']
#    im = row['IM']
#    covar = row['covar']
#    
#    x = df_fil[val1].to_numpy()
#    y = df_fil[val2].to_numpy()
#    plt.subplot(18,3,position)
#    plt.scatter(x,y,c=t,cmap="viridis")
#    plt.title(f"{val1},{val2},MI:{im:.1f} bits, covar:{covar}")

#    position = position + 1

#plt.tight_layout()
#plt.savefig('./notupload/mi.png')
#plt.show() 

print(im_df)
                      val1                    val2         MI
17                 e_total                   Z_pos 5.69411601
42  heatbreak_temp_current                 e_total 4.46009771
38  heatbreak_temp_current                   Z_pos 3.91004411
61        bed_heater_power                 e_total 3.61903158
64        bed_heater_power  heatbreak_temp_current 3.06378718
34        bed_temp_current                 e_total 3.04824054
57        bed_heater_power                   Z_pos 3.02876757
44  heatbreak_temp_current        bed_temp_current 2.68947032
15                 e_total                   X_pos 2.54290969
19                 e_total                   E_pos 2.51502211
30        bed_temp_current                   Z_pos 2.47586471
51            hotend_power                 e_total 2.39772923
63        bed_heater_power        bed_temp_current 2.37586247
36  heatbreak_temp_current                   X_pos 2.30328650
40  heatbreak_temp_current                   E_pos 2.29541352

K-mean Clustering by Scikit-learn¶

From the result of 03-03.Probability assignment, here, I will focus on the pair of "heatbreak_temp_current" and "e_total".

First, I tried K-mean Clustering by Scikit-learn.

In [6]:
from sklearn.cluster import KMeans 

x = df_fil.loc[:,['heatbreak_temp_current','e_total']].to_numpy()
kmeans_model = KMeans(n_clusters = 3).fit(x)
kmean_labels = kmeans_model.labels_

x = df_fil['heatbreak_temp_current']
y = df_fil['e_total']

I got a label of cluster and plot it:

In [7]:
plt.figure(figsize=(8,6))
plt.scatter(x,y,c=kmean_labels,cmap="viridis",s=20)
plt.show()
No description has been provided for this image

Voronoi tesselation¶

Next, I tried "Voronoi tesselation" with referring Class sample code: Here is the result of "heatbreak_temp_current" and "e_total" labeled with k-mean cluster colors.

In [8]:
import matplotlib.pyplot as plt
from scipy.spatial import Voronoi,voronoi_plot_2d
import scipy.stats
import numpy as np
import time
#
# k-means parameters
#
npts = 1000
nsteps = 1000
momentum = 0.
xs = [0,5,10]
ys = [0,10,5]
np.random.seed(10)
#
# set dataset
# need normalizaiton by using scipy.stats.zscore()
#
x = df_fil['heatbreak_temp_current'].to_numpy()
x = scipy.stats.zscore(x)
y = df_fil['e_total'].to_numpy()
y = scipy.stats.zscore(y)
#for i in range(len(xs)):
#    x = np.append(x,np.random.normal(loc=xs[i],scale=1,size=npts))
#    y = np.append(y,np.random.normal(loc=ys[i],scale=1,size=npts))

def kmeans(x,y,momentum,nclusters):
    #
    # choose starting points
    #
    indices = np.random.uniform(low=0,high=len(x),size=nclusters).astype(int)
    mux = x[indices]
    muy = y[indices]
    #
    # do k-means iteration
    #
    for i in range(nsteps):
        #
        # find closest points
        #
        X = np.outer(x,np.ones(len(mux)))
        Y = np.outer(y,np.ones(len(muy)))
        Mux = np.outer(np.ones(len(x)),mux)
        Muy = np.outer(np.ones(len(x)),muy)
        distances = np.sqrt((X-Mux)**2+(Y-Muy)**2)
        mins = np.argmin(distances,axis=1)
        #
        # update means
        #
        for i in range(len(mux)):
            index = np.where(mins == i)
            mux[i] = np.sum(x[index])/len(index[0])
            muy[i] = np.sum(y[index])/len(index[0])
    #
    # find distances
    #
    distances = 0
    for i in range(len(mux)):
        index = np.where(mins == i)
        distances += np.sum(np.sqrt((x[index]-mux[i])**2+(y[index]-muy[i])**2))
    return mux,muy,distances

def plot_kmeans(x,y,mux,muy):
    xmin = np.min(x)
    xmax = np.max(x)
    ymin = np.min(y)
    ymax = np.max(y)
    fig,ax = plt.subplots()
    plt.scatter(x,y,c=kmean_labels,cmap='viridis')
    plt.plot(mux,muy,'r.',markersize=20)
    plt.xlim(xmin,xmax)
    plt.ylim(xmin,xmax)
    plt.title(f"{len(mux)} clusters")
    plt.show()

def plot_Voronoi(x,y,mux,muy):
    xmin = np.min(x)
    xmax = np.max(x)
    ymin = np.min(y)
    ymax = np.max(y)
    fig,ax = plt.subplots()
    plt.scatter(x,y,c=kmean_labels,cmap='viridis')
    vor = Voronoi(np.stack((mux,muy),axis=1))
    voronoi_plot_2d(vor,ax=ax,show_points=True,show_vertices=False,point_size=20)
    plt.xlim(xmin,xmax)
    plt.ylim(xmin,xmax)
    plt.title(f"{len(mux)} clusters")
    plt.show()

distances = np.zeros(5)

mux,muy,distances[0] = kmeans(x,y,momentum,1)
plot_kmeans(x,y,mux,muy)

mux,muy,distances[1] = kmeans(x,y,momentum,2)
plot_kmeans(x,y,mux,muy)

mux,muy,distances[2] = kmeans(x,y,momentum,3)
plot_Voronoi(x,y,mux,muy)

mux,muy,distances[3] = kmeans(x,y,momentum,4)
plot_Voronoi(x,y,mux,muy)

mux,muy,distances[4] = kmeans(x,y,momentum,5)
plot_Voronoi(x,y,mux,muy)

plt.plot(np.arange(1,6),distances,'o')
plt.xlabel('number of clusters')
plt.ylabel('total distances to clusters')
#plt.xaxis.get_major_locator().set_params(integer=True)
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Gaussiam Mixture Model by Scikit learn¶

I found a library of Gaussian Mixture Model in Scikit learn. So, I tried to use it.

As a result, I got a cluster label of GMM, and added to the log dataframe.

The clusters are similar with K-mean clustering.

In [20]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from matplotlib.colors import LogNorm

gmm = GaussianMixture(n_components=3,covariance_type='full')

data = df_fil.loc[:,['heatbreak_temp_current','e_total']].to_numpy()
scaler = StandardScaler()
sdata = scaler.fit_transform(data)
gmm.fit(sdata)
cluster_labels = gmm.predict(sdata)

x = np.linspace(-3, 2)
y = np.linspace(-2, 2)
X, Y = np.meshgrid(x, y)
XX = np.array([X.ravel(), Y.ravel()]).T
Z =  - gmm.score_samples(XX)
Z = Z.reshape(X.shape)

fig,ax = plt.subplots(figsize=(6,4))

ax.scatter(sdata[:,0],sdata[:,1],c=cluster_labels,cmap='viridis',s=0.5)
cont = ax.contourf(X,Y,Z,norm=LogNorm(vmin=1.0,vmax=100.0),levels=np.logspace(-1, 3, 20), alpha=0.2, linestyles='dashed')
ax.scatter(sdata[:,0],sdata[:,1],c=cluster_labels,cmap='viridis',s=1)

plt.savefig('./notupload/sklearn-gmm.png')
plt.show()

df_fil['clusters'] = cluster_labels
No description has been provided for this image

Kernel Density Estimation - Gaussuam Mixture model¶

Then, I tried GMM sample code provided in the class.

In [17]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import scipy.stats
from sklearn import mixture
In [18]:
x = df_fil['heatbreak_temp_current'].to_numpy()
x = scipy.stats.zscore(x)
y = df_fil['e_total'].to_numpy()
y = scipy.stats.zscore(y)

#
# Gaussuam mixture model parameters
#
npts = len(x)
nclusters = 3
nsteps = 25
nplot = 100
xs = [0,5,10]
ys = [0,10,5]
np.random.seed(0)

indices = np.random.uniform(low=0,high=len(x),size=nclusters).astype(int)
mux = x[indices]
muy = y[indices]
varx = (np.max(x)-np.min(x))**2
vary = (np.max(y)-np.min(y))**2
pc = np.ones(nclusters)/nclusters
#
# plot before iteration
#
fig,ax = plt.subplots()
plt.plot(x,y,'.')
plt.errorbar(mux,muy,xerr=np.sqrt(varx),yerr=np.sqrt(vary),fmt='r.',markersize=20)
plt.autoscale()
plt.title('before iteration')
plt.savefig('./notupload/gm_before.png')
plt.show()

#
# do E-M iterations
#
for i in range(nsteps):
    #
    # construct matrices
    #
    xm = np.outer(x,np.ones(nclusters))
    ym = np.outer(y,np.ones(nclusters))
    muxm = np.outer(np.ones(npts),mux)
    muym = np.outer(np.ones(npts),muy)
    varxm = np.outer(np.ones(npts),varx)
    varym = np.outer(np.ones(npts),varx)
    pcm = np.outer(np.ones(npts),pc)
    #
    # use model to update probabilities
    #
    pvgc = (1/np.sqrt(2*np.pi*varxm))*\
       np.exp(-(xm-muxm)**2/(2*varxm))*\
       (1/np.sqrt(2*np.pi*varym))*\
       np.exp(-(ym-muym)**2/(2*varym))
    pvc = pvgc*np.outer(np.ones(npts),pc)
    pcgv = pvc/np.outer(np.sum(pvc,1),np.ones(nclusters))
    #
    # use probabilities to update model
    #
    pc = np.sum(pcgv,0)/npts
    mux = np.sum(xm*pcgv,0)/(npts*pc)
    muy = np.sum(ym*pcgv,0)/(npts*pc)
    varx = 0.1+np.sum((xm-muxm)**2*pcgv,0)/(npts*pc)
    vary = 0.1+np.sum((ym-muym)**2*pcgv,0)/(npts*pc)
#
# plot after iteration
#
fig,ax = plt.subplots()
plt.plot(x,y,'.')
plt.errorbar(mux,muy,xerr=np.sqrt(varx),yerr=np.sqrt(vary),fmt='r.',markersize=20)
plt.autoscale()
plt.title('after iteration')
plt.savefig('./notupload/gm_after.png')
plt.show()
#
# plot distribution
#
xplot = np.linspace(np.min(x),np.max(x),nplot)
yplot = np.linspace(np.min(y),np.max(y),nplot)
(X,Y) = np.meshgrid(xplot,yplot)
p = np.zeros((nplot,nplot))
for c in range(nclusters):
    p += np.exp(-(X-mux[c])**2/(2*varx[c]))/np.sqrt(2*np.pi*varx[c])\
       *np.exp(-(Y-muy[c])**2/(2*vary[c]))/np.sqrt(2*np.pi*vary[c])\
       *pc[c]
fig, ax = plt.subplots(subplot_kw={"projection":"3d"})
ax.plot_surface(X,Y,p,cmap='viridis')
plt.title('probability distribution')
plt.savefig('./notupload/gm_3d.png')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Cluster Weighted Modeling¶

Cluster Weighted Modeling is very interesting. But class sample code is based on two clustering in two dimensions. As seeing above, "heatbreak_temp_current" and "e_total" could have three possible clusters... It should be revised.

So, I asked ChatGPT as follow:

This sample code for "Cluster Weighted Modeling" provided only the way for two clusters in two dimensions. Please revise it to treate three clusters.

<showing the sample code of Cluster Weighted Modeling (variance ver)>

And, ChatGPT show the following code. I modified "print_ids" (defining the cluster) as the labeled data created from GMM model by scikit-learn.

In [17]:
import numpy as np
import matplotlib.pyplot as plt

# ===========================
# 1. データの準備
# ===========================
# 例: state をプリントIDから作る
#    (print_id が 3種類あることを想定)
print_ids = df_fil["clusters"].values
unique_ids = np.unique(print_ids)
id_to_state = {pid: i for i, pid in enumerate(unique_ids)}
states = np.array([id_to_state[pid] for pid in print_ids])   # 0,1,2,... の整数ラベル

# 特徴量: heatbreak_temp_current vs e_total
x = df_fil["heatbreak_temp_current"].values
y = df_fil["e_total"].values

# ざっくりスケールを合わせるために z-score
x = (x - x.mean()) / x.std()
y = (y - y.mean()) / y.std()

points = np.vstack((x, y))     # 2×N
dim = 2
nstates = len(unique_ids)
npts = points.shape[1]

# ===========================
# 2. モデルのハイパーパラメータ
# ===========================
nclusters = 3          # クラスタ数: 適宜変えて試してよい
momentum = 0.9
rate = 0.1
min_var = 0.1
niterations = 100

np.random.seed(10)

# ===========================
# 3. パラメータの初期化
# ===========================
indices = np.random.randint(0, npts, nclusters)
means = points[:, indices]                          # (2, K)
dmean = np.zeros((dim, nclusters))
variances = np.outer(np.var(points, axis=1),        # (2, K)
                     np.ones(nclusters))
pc = np.ones(nclusters) / nclusters                 # p(c)
pxgc = np.zeros((nclusters, npts))                  # p(x|c)
psgxc = np.ones((nclusters, nstates)) / nstates     # p(s|x,c) 初期は一様

# ===========================
# 4. 可視化補助関数
# ===========================
def plot_clusters(points, states, means, variances, psgxc, title):
    """クラスタ中心 + 分散(±1σ の十字)を描画"""
    # stateごとに色やマーカーを分ける(とりあえず 3状態想定)
    markers = ['o', 's', '^', 'x']
    colors  = ['C0', 'C1', 'C2', 'C3']
    for s in range(nstates):
        idx = (states == s)
        plt.plot(points[0, idx],
                 points[1, idx],
                 ls='',
                 marker=markers[s % len(markers)],
                 markeredgecolor=colors[s % len(colors)],
                 alpha=0.4,
                 label=f'state {s} (print_id={unique_ids[s]})')

    # クラスタ
    for c in range(means.shape[1]):
        mx, my = means[0, c], means[1, c]
        # そのクラスタが一番担当している state
        s_hat = np.argmax(psgxc[c, :])
        dx = np.sqrt(variances[0, c])
        dy = np.sqrt(variances[1, c])
        plt.plot([mx - dx, mx + dx], [my, my], 'r-')
        plt.plot([mx, mx], [my - dy, my + dy], 'r-')
        plt.plot([mx], [my],
                 'r',
                 marker=f'${s_hat}$',
                 markersize=14)

    plt.title(title)
    plt.xlabel("heatbreak_temp_current (z-score)")
    plt.ylabel("e_total (z-score)")
    plt.legend(loc='best', fontsize=8)
    plt.tight_layout()

# ===========================
# 5. 反復前のプロット
# ===========================
plt.figure(figsize=(6, 5))
plot_clusters(points, states, means, variances, psgxc,
              "clusters before EM")
plt.show()

# ===========================
# 6. EM 反復
# ===========================
for it in range(niterations):
    # --- E-step: p(x|c) の計算 ---
    for c in range(nclusters):
        mean = np.outer(means[:, c], np.ones(npts))
        var  = np.outer(variances[:, c], np.ones(npts))
        # 各次元独立ガウスの積
        pxgc[c, :] = np.prod(
            np.exp(-(points - mean) ** 2 / (2 * var)) /
            np.sqrt(2 * np.pi * var),
            axis=0
        )

    # p(x, c) = p(x|c) p(c)
    pxc = pxgc * np.outer(pc, np.ones(npts))

    # p(c|x)
    pcgx = pxc / np.outer(np.ones(nclusters), np.sum(pxc, axis=0) + 1e-12)

    # --- state を介した重み付け ---
    # states: 0..nstates-1 の真のラベル
    psxc = psgxc[:, states] * pxc              # p(s,x,c)
    pcgsx = psxc / np.outer(np.ones(nclusters),
                            np.sum(psxc, axis=0) + 1e-12)  # p(c|s,x)

    # --- M-step: p(c) 更新 ---
    pc = np.sum(pcgsx, axis=1) / npts         # (K,)

    # --- M-step: 各クラスタの平均・分散更新 ---
    for c in range(nclusters):
        w = pcgsx[c, :]                       # (N,)
        w_sum = np.sum(w) + 1e-12
        newmean = momentum * dmean[:, c] + \
                  np.sum(points * np.outer(np.ones(dim), w), axis=1) / w_sum
        dmean[:, c] = newmean - means[:, c]
        means[:, c] += rate * dmean[:, c]

        m = np.outer(means[:, c], np.ones(npts))
        variances[:, c] = (np.sum((points - m) ** 2 *
                                  np.outer(np.ones(dim), w), axis=1) /
                           w_sum + min_var)

    # --- クラスタ→state の責任度 p(s|x,c) を更新 ---
    for s in range(nstates):
        idx = np.where(states == s)[0]
        if len(idx) == 0:
            continue
        psgxc[:, s] = (np.sum(pcgsx[:, idx], axis=1) /
                       (np.sum(pcgsx, axis=1) + 1e-12))

# ===========================
# 7. 反復後のクラスタ & 決定境界
# ===========================
plt.figure(figsize=(6, 5))
plot_clusters(points, states, means, variances, psgxc,
              "clusters after EM")
plt.show()

# --- 決定境界の描画(例: state 0 の確率) ---
ngrid = 100
xmin, xmax = np.min(points[0, :]), np.max(points[0, :])
ymin, ymax = np.min(points[1, :]), np.max(points[1, :])
xx = np.linspace(xmin, xmax, ngrid)
yy = np.linspace(ymin, ymax, ngrid)
mx, my = np.meshgrid(xx, yy)
grid_pts = np.vstack((mx.ravel(), my.ravel()))     # 2×(ngrid^2)

pxgc_grid = np.zeros((nclusters, ngrid * ngrid))
for c in range(nclusters):
    mean = np.outer(means[:, c], np.ones(ngrid * ngrid))
    var  = np.outer(variances[:, c], np.ones(ngrid * ngrid))
    pxgc_grid[c, :] = np.prod(
        np.exp(-(grid_pts - mean) ** 2 / (2 * var)) /
        np.sqrt(2 * np.pi * var),
        axis=0
    )

pxc_grid = pxgc_grid * np.outer(pc, np.ones(ngrid * ngrid))  # p(x,c)

# state 0 の確率 P(state=0 | x)
state_idx = 0
p_state0 = np.sum(np.outer(psgxc[:, state_idx], np.ones(ngrid * ngrid)) *
                  pxc_grid, axis=0) / (np.sum(pxc_grid, axis=0) + 1e-12)
p_state0 = p_state0.reshape(ngrid, ngrid)

plt.figure(figsize=(6, 5))
plot_clusters(points, states, means, variances, psgxc,
              "clusters & decision boundary")
# P(state0)=0.5 の等高線を境界として描画
cs = plt.contour(mx, my, p_state0, levels=[0.5], colors='k')
plt.clabel(cs, fmt='state0=0.5', fontsize=8)
plt.show()

# 3D の確率面(お好みで)
#from mpl_toolkits.mplot3d import Axes3D  # noqa: F401
#fig, ax = plt.subplots(subplot_kw={"projection": "3d"}, figsize=(7, 5))
#ax.plot_wireframe(mx, my, p_state0, rstride=5, cstride=5, color='gray')

#ax.set_xlabel("heatbreak_temp_current (z)")
#ax.set_ylabel("e_total (z)")
#ax.set_zlabel("P(state 0 | x)")
#plt.title("probability surface for state 0")
#plt.show()

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# surface
ax.plot_wireframe(mx, my, p, rstride=10, cstride=10, cmap='viridis', alpha=0.8)

# scatter (bottom, z=0)
ax.scatter(points[0,:], points[1,:], zs=0, zdir='z',
           s=3, c=states, cmap='viridis', alpha=0.6)

#ax.xaxis.pane.set_facecolor("black")
#ax.yaxis.pane.set_facecolor("black")
#ax.zaxis.pane.set_facecolor("black")
#ax.patch.set_facecolor("black")
ax.set_xlabel('heatbreak_temp_current (z)')
ax.set_ylabel('e_total (z)')
ax.set_zlabel('probability state 0')

plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Also, to plot the covariance of the clusters, I asked ChatGPT as follow:

This sample code is covariance version for "Cluster Weighted Modeling". Please revise it to treate three clusters.

<showing the sample code of Cluster Weighted Modeling (covariance ver)>

And, I got the following code. I modified "states_raw" as cluster_labels which is created from GMM model by scikit-learn.

In [16]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa

# =========================
# 1. データの準備
# =========================
# ここはあなたの DataFrame / 列名に合わせてください
# 例: df_fil に heatbreak_temp_current, e_total, print_id がある前提
x_raw = df_fil["heatbreak_temp_current"].to_numpy()
y_raw = df_fil["e_total"].to_numpy()
states_raw = df_fil["clusters"].to_numpy()    # 0,1,2 の3状態だと仮定

# z-score 正規化(平均0, 分散1に揃える)
x = (x_raw - x_raw.mean()) / x_raw.std()
y = (y_raw - y_raw.mean()) / y_raw.std()

points = np.vstack((x, y))       # shape = (2, N)
states = states_raw.astype(int)  # shape = (N,)
npts = points.shape[1]

# ユニークな state を数えて nstates を自動設定(0,1,2 になっている前提)
unique_states = np.unique(states)
nstates = len(unique_states)

# =========================
# 2. モデルのハイパーパラメータ
# =========================
dim = 2
nclusters = 3       # ★ ここを 3 クラスタに
momentum = 0.5
rate = 0.1
min_var = 0.1
niterations = 100
np.random.seed(10)

# =========================
# 3. パラメータ初期化
# =========================
#indices = np.random.randint(0, npts, nclusters)
indices = [1200,2400,3600]
print(indices)
means = points[:, indices]                      # (2, nclusters)
dmean = np.zeros((dim, nclusters))              # 速度項(momentum 用)

covariances = np.zeros((nclusters, 2, 2))
# 初期は各クラスタ同じ等方的な共分散にしておく
var_x = np.var(points[0, :])
var_y = np.var(points[1, :])
for c in range(nclusters):
    covariances[c, 0, 0] = var_x
    covariances[c, 1, 1] = var_y

pc = np.ones(nclusters) / nclusters            # p(c)
pxgc = np.zeros((nclusters, npts))             # p(x | c)
psgxc = np.ones((nclusters, nstates)) / nstates  # p(s | x,c) の初期値(各 state 等確率)

# =========================
# 4. プロット用の関数
# =========================
markers = ['o', 's', '^']   # state 0,1,2 用マーカー

def plot_data_with_states():
    for s in range(nstates):
        idx = np.where(states == unique_states[s])[0]
        plt.scatter(points[0, idx], points[1, idx],
                    s=5, alpha=0.5,
                    label=f"state {s} (id={unique_states[s]})",
                    marker=markers[s % len(markers)])
    plt.xlabel("heatbreak_temp_current (z-score)")
    plt.ylabel("e_total (z-score)")

def plot_covar(scale=1.0):
    """クラスタ中心+共分散の主軸を描く"""
    for cluster in range(means.shape[1]):
        x0 = means[0, cluster]
        y0 = means[1, cluster]
        # そのクラスタで一番ありそうな状態のラベル
        state_type = np.argmax(psgxc[cluster, :])

        # 共分散の固有値・固有ベクトル
        w, v = np.linalg.eig(covariances[cluster])
        w = np.sqrt(w)  # 標準偏差スケール
        w *= scale

        # 第1主成分方向
        plt.plot([x0 - w[0] * v[0, 0], x0 + w[0] * v[0, 0]],
                 [y0 - w[0] * v[1, 0], y0 + w[0] * v[1, 0]],
                 'r',lw=2)
        # 第2主成分方向
        plt.plot([x0 - w[1] * v[0, 1], x0 + w[1] * v[0, 1]],
                 [y0 - w[1] * v[1, 1], y0 + w[1] * v[1, 1]],
                 'r',lw=2)

        #plt.plot([x0], [y0], 'r', marker=f'${state_type}$', markersize=15)
        plt.plot([x0], [y0], 'ro', markersize=10)
        plt.text(x0, y0, f'{state_type}', color='r', fontsize=12,ha='center', va='center')

# =========================
# 5. EM 前のクラスタの様子
# =========================
plt.figure()
plot_data_with_states()
plot_covar(scale=1.5)
plt.title("covariance clusters BEFORE EM")
plt.legend()
plt.show()

# =========================
# 6. EM 反復
# =========================
for it in range(niterations):
    # ---- E step ----
    for cluster in range(nclusters):
        mean = np.outer(means[:, cluster], np.ones(npts))   # (2, N)
        cinv = np.linalg.pinv(covariances[cluster])         # (2, 2)
        dif = points - mean                                 # (2, N)
        # ガウシアン確率密度 p(x | c)
        pxgc[cluster, :] = (
            np.sqrt(np.linalg.det(cinv)) / (2 * np.pi) ** (dim / 2)
            * np.exp(-0.5 * np.sum(dif * (cinv @ dif), axis=0))
        )

    # p(x,c) = p(x | c) p(c)
    pxc = pxgc * np.outer(pc, np.ones(npts))                # (C, N)

    # p(c | x)
    pcgx = pxc / np.outer(np.ones(nclusters), np.sum(pxc, axis=0))

    # p(s,x,c) ∝ p(s | x,c) p(x,c)  (states で列を選ぶ)
    psxc = psgxc[:, states] * pxc                           # (C, N)

    # p(c | s,x)
    pcgsx = psxc / np.outer(np.ones(nclusters), np.sum(psxc, axis=0))

    # p(c)
    pc = np.sum(pcgsx, axis=1) / npts

    # ---- M step ----
    for cluster in range(nclusters):
        weights = pcgsx[cluster, :]                         # (N,)
        W = np.outer(np.ones(dim), weights)                 # (2, N)

        # 新しい平均
        newmean = momentum * dmean[:, cluster] + \
                  np.sum(points * W, axis=1) / np.sum(W, axis=1)

        dmean[:, cluster] = newmean - means[:, cluster]
        means[:, cluster] += rate * dmean[:, cluster]

        # 共分散更新
        m = np.outer(means[:, cluster], np.ones(npts))      # (2, N)
        dif = points - m                                    # (2, N)
        # 共分散行列: Σ = Σ_i w_i (x_i - μ)(x_i - μ)^T / Σ_i w_i
        cov = (dif * W) @ dif.T / np.sum(weights)
        covariances[cluster] = cov + min_var * np.eye(dim)

    # p(s | x,c) の更新
    for si, state_val in enumerate(unique_states):
        idx = np.where(states == state_val)[0]
        # state s に属する点だけ抜き出して、クラスタごとの重みを合計
        psgxc[:, si] = np.sum(pcgsx[:, idx], axis=1) / np.sum(pcgsx, axis=1)

# =========================
# 7. EM 後のクラスタ
# =========================
plt.figure()
plot_data_with_states()
plot_covar(scale=1.0)
plt.title("covariance clusters AFTER EM")
plt.legend()
plt.show()

# =========================
# 8. decision boundary(state 0 の確率が 0.5 の線)
# =========================
ngrid = 100
xmin, xmax = x.min(), x.max()
ymin, ymax = y.min(), y.max()
gx = np.linspace(xmin, xmax, ngrid)
gy = np.linspace(ymin, ymax, ngrid)
mx, my = np.meshgrid(gx, gy)
grid_points = np.vstack((mx.ravel(), my.ravel()))          # (2, G)

pxgc_grid = np.zeros((nclusters, grid_points.shape[1]))
for cluster in range(nclusters):
    mean = np.outer(means[:, cluster], np.ones(grid_points.shape[1]))
    cinv = np.linalg.pinv(covariances[cluster])
    dif = grid_points - mean
    pxgc_grid[cluster, :] = (
        np.sqrt(np.linalg.det(cinv)) / (2 * np.pi) ** (dim / 2)
        * np.exp(-0.5 * np.sum(dif * (cinv @ dif), axis=0))
    )

pxc_grid = pxgc_grid * np.outer(pc, np.ones(grid_points.shape[1]))
denom = np.sum(pxc_grid, axis=0)

# state0 の確率 p(state=0 | x)
p_state0 = np.sum(np.outer(psgxc[:, 0], np.ones(grid_points.shape[1])) * pxc_grid,
                  axis=0) / denom
p_state0 = p_state0.reshape((ngrid, ngrid))

plt.figure()
plot_data_with_states()
plot_covar(scale=1.0)
plt.contour(mx, my, p_state0, levels=[0.5], colors='k')
plt.title("covariance clusters & decision boundary (p(state0)=0.5)")
plt.legend()
plt.show()

# =========================
# 9. 3D probability surface + 底面にログ軌跡
# =========================
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111, projection='3d')

ax.plot_wireframe(mx, my, p_state0,color="gray")

# 底面 (z=0) に元データの軌跡をプロット
ax.scatter(x, y, np.zeros_like(x),c=states, cmap='viridis', alpha=0.3)

#ax.xaxis.pane.set_facecolor("black")
#ax.yaxis.pane.set_facecolor("black")
#ax.zaxis.pane.set_facecolor("black")
#ax.patch.set_facecolor("black")
ax.set_xlabel("heatbreak_temp_current (z)")
ax.set_ylabel("e_total (z)")
ax.set_zlabel("p(state 0)")
plt.title("probability surface for state 0")
plt.show()
[1200, 2400, 3600]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Outcome¶

Regarding the relationship between heatbreak_temp and e_total, we tested clustering using three methods: K-means clustering, Gaussian mixture modeling, and Cluster Weighted Modeling. The results clearly showed that, regardless of the model used, this pair of variables cleanly separated into three distinct clusters. This likely indicates that the printing process involves several distinct phases, each with different total extruder operation levels.

Now, in machine learning, we incorporate these hidden parameters as features and remove irrelevant ones to retrain the model, thereby improving prediction accuracy. Beyond this, what other applications could this clustering yield?