[Maki TANAKA] - Fab Futures - Data Science
Home About

Final Presentation¶

No description has been provided for this image

1. Dataset¶

I chose my dataset, "Wine dataset" from kaggle.
This dataset is the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.
The reason I chose this is;

  1. I like wine(it's really easy to know)
  2. Data are so interesting because we can analyse the ground difference.

2. My goal¶

My goal for this course is;

  • To know the basis of data science
  • To know how to analysis the data
  • To know how to use python

I try to find the charactor of the dataset through experiencing the many analysis process.

3. Analysis the dataset¶

(My python codes are basically made by chatGPT as asking)

(1) Basic data check¶

First, I load the dataset and displayed each attribute.
Then displayed the table.
(It was really first step to use this dataset)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# === 1. Load dataset ===
df = pd.read_csv('./datasets/Wine_dataset.csv')   
In [2]:
print(df.columns)
df
Index(['class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash',
       'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
       'Proanthocyanins', 'Color intensity', 'Hue',
       'OD280/OD315 of diluted wines', 'Proline '],
      dtype='object')
Out[2]:
class Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740
174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750
175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835
176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840
177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560

178 rows × 14 columns

(2) Correlation among attributes¶

Then checked correlation among each attribute using heatmap.

In [3]:
from scipy.stats import spearmanr
import numpy as np
import seaborn as sns

cols = df.columns

cormatrix = pd.DataFrame(index=cols,columns=cols,dtype=float)
p_val_matrix = pd.DataFrame(index=cols,columns=cols,dtype=float)

for colx in cols:
    for coly in cols:
        if colx == coly:
            cormatrix.loc[colx,coly] = 1.0
            p_val_matrix.loc[colx,coly] = 1.0
        elif(pd.isna(cormatrix.loc[colx,coly])):
            corr_coef,p_value = spearmanr(df[colx],df[coly])

            cormatrix.loc[colx,coly] = corr_coef
            cormatrix.loc[coly,colx] = corr_coef
            p_val_matrix.loc[colx,coly] = p_value
            p_val_matrix.loc[coly,colx] = p_value

plt.figure(figsize=(10,8))
sns.heatmap(p_val_matrix,annot=True,annot_kws={"size": 6},cmap="binary_r",vmin=0.0,vmax=0.05)
plt.title("p-value of spearman correlation in all variables")
plt.tight_layout()
#plt.savefig('./notupload/sp-pval.png')
plt.show()

plt.figure(figsize=(10,8))
sns.heatmap(cormatrix,annot=True,annot_kws={"size": 6}, cmap="coolwarm",vmin=-1.0,vmax=1.0)
plt.title("co-efficience of spearman correlation in all variables")
plt.tight_layout()
#plt.savefig('./notupload/spearman.png')
plt.show()
No description has been provided for this image
No description has been provided for this image

(3) Check the correlation by k-means and PCA¶

Next, data are divided into three groups using k-means. Then, visualizing with PCA showed the data clearly formed three clusters.
I asked chatGPT to make those graphs.
At first, I checked the correlation between class and each feature.

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# First attribute:class
target_col = df.columns[0]
print("Target column:", target_col)

# ---------------------------------------------
# ④ pairplot(散布図の組み合わせ)
# ※ 全部やると時間がかかるので最初の6特徴だけ使用
# ---------------------------------------------
sns.pairplot(df.iloc[:, :7], hue=target_col)
plt.show()
Target column: class
No description has been provided for this image

Then checked the correlation by using k-means and PCA.

In [6]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Class は除外して 13特徴量のみ使う
X = df.drop(columns=["class"])

# -----------------------------
# 2. 標準化(大事!)
# -----------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# -----------------------------
# 3. PCAで2次元に圧縮
# -----------------------------
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# -----------------------------
# 4. K-meansでクラスタリング
# -----------------------------
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(X_pca)

# -----------------------------
# 5. 結果プロット
# -----------------------------
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=clusters, cmap="viridis", s=40)

# クラスタ中心も描画
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c="red", s=200, marker="X")

scatter = plt.scatter(pca_data[:,0], pca_data[:,1], c=df[target_col], cmap='viridis')
plt.title("K-means Clustering on Wine Dataset (PCA 2D)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.colorbar(scatter, label='class')
plt.show()
No description has been provided for this image

It became clear that there are 3 distributions among classes. So I want to know which features affect to the data. I checked it by the bar graph.

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# ----------------------------------------
# 1. データ読み込み
# ----------------------------------------
df = pd.read_csv('./datasets/Wine_dataset.csv')

# 目的変数以外の特徴量
X = df.drop(columns=['class'])

# 標準化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ----------------------------------------
# 2. k-means クラスタリング
# ----------------------------------------
kmeans = KMeans(n_clusters=3, random_state=0)
clusters = kmeans.fit_predict(X_scaled)

# クラスタ中心
centers = kmeans.cluster_centers_

# ----------------------------------------
# 3. クラスタごとの差分を計算
# ----------------------------------------
# 各特徴量のクラスタ中心の標準偏差(変動量の大きさ)
importance = np.std(centers, axis=0)

# データフレーム化
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': importance
}).sort_values(by='importance', ascending=False)

# ----------------------------------------
# 4. 棒グラフで可視化
# ----------------------------------------
plt.figure(figsize=(12, 6))
plt.bar(importance_df['feature'], importance_df['importance'])
plt.xticks(rotation=45, ha='right')
plt.title('Feature Importance Based on Cluster Center Variations (K-Means)')
plt.ylabel('Importance (std of cluster centers)')
plt.tight_layout()
plt.show()
No description has been provided for this image

(4) Machine Learning¶

I tried to classify the wine data into 3 classes and machine learning using a neural network in JAX. It seems that "Alcohol" and "Color intensity" have correlation by heatmap. So I tried to do machine learning with those parameters.

In [10]:
import jax
import jax.numpy as jnp
from jax import random, grad, jit
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Alcohol → X, Color intensity → y
X_np = df[['Alcohol']].to_numpy()
y_np = df[['Color intensity']].to_numpy()

# -----------------------------
# ② 正規化(Min-Max)
# -----------------------------
scaler_x = MinMaxScaler()
scaler_y = MinMaxScaler()

X_np = scaler_x.fit_transform(X_np)
y_np = scaler_y.fit_transform(y_np)

X = jnp.array(X_np)
y = jnp.array(y_np)

# -----------------------------
# ③ train/test 分割
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# -----------------------------
# ④ パラメータ初期化(Xavier)
# -----------------------------
def init_params(key):
    key1, key2 = random.split(key)

    in_dim = 1
    hidden_dim = 4
    out_dim = 1

    # Xavier/He 初期化
    W1 = random.normal(key1, (in_dim, hidden_dim)) * jnp.sqrt(2/in_dim)
    b1 = jnp.zeros(hidden_dim)

    W2 = random.normal(key2, (hidden_dim, out_dim)) * jnp.sqrt(2/hidden_dim)
    b2 = jnp.zeros(out_dim)

    return (W1, b1, W2, b2)

key = random.PRNGKey(0)
params = init_params(key)

# -----------------------------
# ⑤ forward
# -----------------------------
@jit
def forward(params, x):
    W1, b1, W2, b2 = params
    h = jnp.tanh(x @ W1 + b1)
    out = h @ W2 + b2  # ← 回帰なので線形
    return out

# -----------------------------
# ⑥ loss (MSE)
# -----------------------------
@jit
def loss(params, X, y):
    ypred = forward(params, X)
    return jnp.mean((ypred - y) ** 2)

# -----------------------------
# ⑦ update
# -----------------------------
@jit
def update(params, X, y, lr=0.01):
    grads = grad(loss)(params, X, y)
    return jax.tree.map(lambda p, g: p - lr * g, params, grads)

# -----------------------------
# ⑧ 学習ループ
# -----------------------------
loss_list = []

for step in range(2000):
    params = update(params, X_train, y_train, lr=0.01)
    if step % 100 == 0:
        L = loss(params, X_train, y_train)
        loss_list.append(float(L))
        print(f"step {step} / loss={L:.6f}")

# -----------------------------
# ⑨ テストデータで評価
# -----------------------------
test_loss = loss(params, X_test, y_test)
print("\n=== Test Loss ===")
print(float(test_loss))

# -----------------------------
# ⑩ 学習曲線のプロット
# -----------------------------
plt.figure(figsize=(6,4))
plt.plot(loss_list)
plt.title("Training Loss Curve (every 100 steps)")
plt.xlabel("Checkpoint")
plt.ylabel("Loss")
plt.show()

# -----------------------------
# ⑪ 回帰曲線のプロット
# -----------------------------
plt.figure(figsize=(7,5))

# 実データ
plt.scatter(X_train, y_train, label="train data", alpha=0.6)
plt.scatter(X_test, y_test, label="test data", alpha=0.6)

# モデル予測(曲線として描く)
X_line = jnp.linspace(0, 1, 200).reshape(-1, 1)
y_line = forward(params, X_line)

plt.plot(X_line, y_line, linewidth=3, label="NN prediction")

plt.title("Alcohol → Color intensity (Neural Network Regression)")
plt.xlabel("Alcohol (normalized)")
plt.ylabel("Color intensity (normalized)")
plt.legend()
plt.show()
step 0 / loss=0.221567
step 100 / loss=0.034887
step 200 / loss=0.029292
step 300 / loss=0.027851
step 400 / loss=0.027481
step 500 / loss=0.027387
step 600 / loss=0.027362
step 700 / loss=0.027356
step 800 / loss=0.027354
step 900 / loss=0.027353
step 1000 / loss=0.027353
step 1100 / loss=0.027353
step 1200 / loss=0.027352
step 1300 / loss=0.027352
step 1400 / loss=0.027352
step 1500 / loss=0.027351
step 1600 / loss=0.027351
step 1700 / loss=0.027351
step 1800 / loss=0.027350
step 1900 / loss=0.027350

=== Test Loss ===
0.024792863056063652
No description has been provided for this image
No description has been provided for this image

It seems ok. So I did a minimal neural network with the whole wine dataset(13 features and 3 classes).

In [11]:
import jax
import jax.numpy as jnp
from jax import random, grad, jit
import numpy as np
import matplotlib.pyplot as plt


# All 13 features
X_np = df.drop(columns=['class']).to_numpy()

# Convert class labels (1,2,3) → (0,1,2)
y_np = df['class'].to_numpy() - 1   

# One-hot encoding
y_onehot = np.eye(3)[y_np]   # shape (178, 3)

# Convert to jax arrays
X = jnp.array(X_np, dtype=jnp.float32)
y = jnp.array(y_onehot, dtype=jnp.float32)

# --------------------------------------
# 2. Train/Test split (80/20)
# --------------------------------------
num_data = X.shape[0]
num_train = int(num_data * 0.8)

key = random.PRNGKey(0)
perm = random.permutation(key, num_data)

X_train = X[perm[:num_train]]
y_train = y[perm[:num_train]]

X_test = X[perm[num_train:]]
y_test = y[perm[num_train:]]

# --------------------------------------
# 3. Neural network architecture
# --------------------------------------
in_dim = 13        # input features
hidden_dim = 8     # hidden layer size
out_dim = 3        # wine classes

# Xavier initialization
def init_params(key):
    k1, k2 = random.split(key)
    W1 = random.normal(k1, (in_dim, hidden_dim)) * jnp.sqrt(2/in_dim)
    b1 = jnp.zeros(hidden_dim)

    W2 = random.normal(k2, (hidden_dim, out_dim)) * jnp.sqrt(2/hidden_dim)
    b2 = jnp.zeros(out_dim)

    return (W1, b1, W2, b2)

params = init_params(key)

# --------------------------------------
# 4. Forward pass
# --------------------------------------
@jit
def forward(params, x):
    W1, b1, W2, b2 = params
    h = jnp.tanh(x @ W1 + b1)
    logits = h @ W2 + b2
    return logits

# --------------------------------------
# 5. Loss (softmax cross entropy)
# --------------------------------------
@jit
def loss(params, x, t):
    logits = forward(params, x)
    log_probs = logits - jax.nn.logsumexp(logits, axis=1, keepdims=True)
    return -jnp.mean(jnp.sum(t * log_probs, axis=1))

# --------------------------------------
# 6. Update function (SGD)
# --------------------------------------
@jit
def update(params, x, t, lr=0.01):
    grads = grad(loss)(params, x, t)
    return jax.tree.map(lambda p, g: p - lr * g, params, grads)

# --------------------------------------
# 7. Training loop
# --------------------------------------
loss_history = []

for step in range(1001):
    params = update(params, X_train, y_train, lr=0.01)

    if step % 20 == 0:
        L = loss(params, X_train, y_train)
        loss_history.append(L)
    if step % 200 == 0:
        print(f"step {step}  loss = {float(L):.4f}")

# --------------------------------------
# 8. Plot training loss curve
# --------------------------------------
plt.plot(loss_history)
plt.title("Training Loss Curve")
plt.xlabel("Iteration (x20)")
plt.ylabel("Loss")
plt.show()

# --------------------------------------
# 9. Evaluate test accuracy
# --------------------------------------
@jit
def predict(params, x):
    return jnp.argmax(forward(params, x), axis=1)

y_pred = predict(params, X_test)
y_true = jnp.argmax(y_test, axis=1)

accuracy = jnp.mean(y_pred == y_true)
print(f"\nTest Accuracy: {float(accuracy)*100:.2f}%")
step 0  loss = 1.1826
step 200  loss = 1.0912
step 400  loss = 1.0912
step 600  loss = 1.0912
step 800  loss = 1.0912
step 1000  loss = 1.0912
No description has been provided for this image
Test Accuracy: 47.22%

The result is not good as expected. So I asked chatGPT to modify the code as follows to make the learning effects observable.

In [12]:
import jax
import jax.numpy as jnp
from jax import random, grad, jit
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# -----------------------------
# ① データ準備
# -----------------------------
df = pd.read_csv('./datasets/Wine_dataset.csv')
X_np = df.drop(columns=['class']).to_numpy()
y_np = df['class'].to_numpy() - 1
y_onehot = np.eye(3)[y_np]

# 標準化
X_np = StandardScaler().fit_transform(X_np)

# JAX配列に変換
X = jnp.array(X_np, dtype=jnp.float32)
y = jnp.array(y_onehot, dtype=jnp.float32)

# -----------------------------
# ② train/test split
# -----------------------------
num_data = X.shape[0]
num_train = int(num_data * 0.8)
key = random.PRNGKey(0)
perm = random.permutation(key, num_data)

X_train = X[perm[:num_train]]
y_train = y[perm[:num_train]]
X_test = X[perm[num_train:]]
y_test = y[perm[num_train:]]

# -----------------------------
# ③ NN構造
# -----------------------------
in_dim = 13
hidden_dim = 16   # 増加
out_dim = 3

def init_params(key):
    k1, k2 = random.split(key)
    W1 = random.normal(k1, (in_dim, hidden_dim)) * jnp.sqrt(2/in_dim)
    b1 = jnp.zeros(hidden_dim)
    W2 = random.normal(k2, (hidden_dim, out_dim)) * jnp.sqrt(2/hidden_dim)
    b2 = jnp.zeros(out_dim)
    return (W1, b1, W2, b2)

params = init_params(key)

# -----------------------------
# ④ forward
# -----------------------------
@jit
def forward(params, x):
    W1, b1, W2, b2 = params
    h = jax.nn.relu(x @ W1 + b1)   # tanh → relu
    logits = h @ W2 + b2
    return logits

# -----------------------------
# ⑤ loss
# -----------------------------
@jit
def loss(params, x, t):
    logits = forward(params, x)
    log_probs = logits - jax.nn.logsumexp(logits, axis=1, keepdims=True)
    return -jnp.mean(jnp.sum(t * log_probs, axis=1))

# -----------------------------
# ⑥ update
# -----------------------------
@jit
def update(params, x, t, lr=0.05):   # 学習率を上げる
    grads = grad(loss)(params, x, t)
    return jax.tree.map(lambda p, g: p - lr * g, params, grads)

# -----------------------------
# ⑦ 学習ループ
# -----------------------------
loss_history = []

for step in range(1001):
    params = update(params, X_train, y_train, lr=0.05)
    if step % 20 == 0:
        L = loss(params, X_train, y_train)
        loss_history.append(float(L))
    if step % 200 == 0:
        print(f"step {step}  loss = {float(L):.4f}")

# -----------------------------
# ⑧ 学習曲線
# -----------------------------
plt.plot(loss_history)
plt.title("Training Loss Curve")
plt.xlabel("Iteration (x20)")
plt.ylabel("Loss")
plt.show()

# -----------------------------
# ⑨ テスト精度
# -----------------------------
@jit
def predict(params, x):
    return jnp.argmax(forward(params, x), axis=1)

y_pred = predict(params, X_test)
y_true = jnp.argmax(y_test, axis=1)
accuracy = jnp.mean(y_pred == y_true)
print(f"\nTest Accuracy: {float(accuracy)*100:.2f}%")
step 0  loss = 1.4355
step 200  loss = 0.0519
step 400  loss = 0.0250
step 600  loss = 0.0160
step 800  loss = 0.0115
step 1000  loss = 0.0089
No description has been provided for this image
Test Accuracy: 100.00%
In [ ]:
 

4. Find Characters¶

Since I could find the different charactors for each class. I'd like to know it. So I asked chatGPT how to get to know the charactors in each class.

(1) Working Flow¶

chatGPT answered the working flow for this analysis;

  1. Separate the data by class
  2. Examine basic statistics (mean and variance)
  3. Visualize differences between classes
  4. Analyze correlations and feature contributions
  5. Use dimensionality reduction (PCA) to observe the overall structure
  6. Identify important features from a model-based perspective
  7. Summarize class characteristics in words

(2) Analyse and see¶

  1. & 2. Separate data and check the means and standard deviation.
In [13]:
# Check the means by class
class_mean = df.groupby('class').mean()
print(class_mean)
         Alcohol  Malic acid       Ash  Alcalinity of ash   Magnesium  \
class                                                                   
1      13.744746    2.010678  2.455593          17.037288  106.338983   
2      12.278732    1.932676  2.244789          20.238028   94.549296   
3      13.153750    3.333750  2.437083          21.416667   99.312500   

       Total phenols  Flavanoids  Nonflavanoid phenols  Proanthocyanins  \
class                                                                     
1           2.840169    2.982373              0.290000         1.899322   
2           2.258873    2.080845              0.363662         1.630282   
3           1.678750    0.781458              0.447500         1.153542   

       Color intensity       Hue  OD280/OD315 of diluted wines     Proline   
class                                                                        
1             5.528305  1.062034                      3.157797  1115.711864  
2             3.086620  1.056282                      2.785352   519.507042  
3             7.396250  0.682708                      1.683542   629.895833  
In [14]:
# クラスごとの平均と標準偏差
class_stats = df.groupby('class').agg(['mean', 'std'])
print(class_stats)
         Alcohol           Malic acid                 Ash            \
            mean       std       mean       std      mean       std   
class                                                                 
1      13.744746  0.462125   2.010678  0.688549  2.455593  0.227166   
2      12.278732  0.537964   1.932676  1.015569  2.244789  0.315467   
3      13.153750  0.530241   3.333750  1.087906  2.437083  0.184690   

      Alcalinity of ash             Magnesium             ... Proanthocyanins  \
                   mean       std        mean        std  ...            mean   
class                                                     ...                   
1             17.037288  2.546322  106.338983  10.498949  ...        1.899322   
2             20.238028  3.349770   94.549296  16.753497  ...        1.630282   
3             21.416667  2.258161   99.312500  10.890473  ...        1.153542   

                Color intensity                 Hue            \
            std            mean       std      mean       std   
class                                                           
1      0.412109        5.528305  1.238573  1.062034  0.116483   
2      0.602068        3.086620  0.924929  1.056282  0.202937   
3      0.408836        7.396250  2.310942  0.682708  0.114441   

      OD280/OD315 of diluted wines               Proline               
                              mean       std         mean         std  
class                                                                  
1                         3.157797  0.357077  1115.711864  221.520767  
2                         2.785352  0.496573   519.507042  157.211220  
3                         1.683542  0.272111   629.895833  115.097043  

[3 rows x 26 columns]

Then check 3.(Visualize the difference) using Alcohol feature.

In [15]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x='class', y='Alcohol', data=df)
plt.title("Alcohol by Class")
plt.show()
No description has been provided for this image

Next, all the features are displayed.

In [ ]:
target_col = 'class'
feature_cols = df.drop(columns=target_col).columns
classes = sorted(df[target_col].unique())  # [1,2,3]
In [30]:
import matplotlib.pyplot as plt
import seaborn as sns
import math

n_features = len(feature_cols)
ncols = 3
nrows = math.ceil(n_features / ncols)

plt.figure(figsize=(15, 4 * nrows))

for i, col in enumerate(feature_cols):
    ax = plt.subplot(nrows, ncols, i + 1)

    sns.boxplot(
        x=df[target_col],
        y=df[col],
        ax=ax
    )

    ax.set_title(col)
    ax.set_xlabel("class")
    ax.set_ylabel("")

plt.tight_layout()
plt.show()
No description has been provided for this image
  1. check the correlation.
In [32]:
corr = df.corr(method='spearman') 
print(corr['class'].sort_values(ascending=False))
class                           1.000000
Alcalinity of ash               0.569792
Nonflavanoid phenols            0.474205
Malic acid                      0.346913
Color intensity                 0.131170
Ash                            -0.053988
Magnesium                      -0.250498
Alcohol                        -0.354167
Proanthocyanins                -0.570648
Proline                        -0.576383
Hue                            -0.616570
Total phenols                  -0.726544
OD280/OD315 of diluted wines   -0.743787
Flavanoids                     -0.854908
Name: class, dtype: float64

I've learnt that if the number is larger such as "Alcalinity of ash", it effects strongly to the 'class 3',
and if the number is low such as "Flavanoids", it effects 'class 1'.

I asked chatGPT to make PCA and neural networking graphs for understanding the charactors.

In [33]:
# ============================================================
# PCA と JAX Neural Network の対応づけ(1本のコード)
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jax
import jax.numpy as jnp
from jax import random, grad, jit

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X = df.drop(columns="class")
y = df["class"].values - 1   # 0,1,2 に変換

# 標準化
X_std = StandardScaler().fit_transform(X)

# ------------------------------------------------------------
# 2. PCA on input features (for visualization)
# ------------------------------------------------------------
X_pca = PCA(n_components=2).fit_transform(X_std)

plt.figure(figsize=(5,4))
plt.scatter(X_pca[:,0], X_pca[:,1], c=y, cmap="viridis")
plt.title("PCA of input features")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.colorbar(label="class")
plt.tight_layout()
plt.show()

# ------------------------------------------------------------
# 3. Prepare JAX arrays
# ------------------------------------------------------------
X_jax = jnp.array(X_std, dtype=jnp.float32)
y_onehot = jnp.array(np.eye(3)[y], dtype=jnp.float32)

# ------------------------------------------------------------
# 4. Define a simple JAX neural network
# ------------------------------------------------------------
key = random.PRNGKey(0)

in_dim = X_jax.shape[1]
hidden_dim = 8
out_dim = 3

def init_params(key):
    k1, k2 = random.split(key)
    W1 = random.normal(k1, (in_dim, hidden_dim)) * jnp.sqrt(2 / in_dim)
    b1 = jnp.zeros(hidden_dim)
    W2 = random.normal(k2, (hidden_dim, out_dim)) * jnp.sqrt(2 / hidden_dim)
    b2 = jnp.zeros(out_dim)
    return W1, b1, W2, b2

params = init_params(key)

@jit
def forward(params, x):
    W1, b1, W2, b2 = params
    h = jnp.tanh(x @ W1 + b1)
    logits = h @ W2 + b2
    return logits

@jit
def loss(params, x, t):
    logits = forward(params, x)
    log_probs = logits - jax.nn.logsumexp(logits, axis=1, keepdims=True)
    return -jnp.mean(jnp.sum(t * log_probs, axis=1))

@jit
def update(params, x, t, lr=0.01):
    grads = grad(loss)(params, x, t)
    return jax.tree.map(lambda p, g: p - lr * g, params, grads)

# ------------------------------------------------------------
# 5. Train the neural network
# ------------------------------------------------------------
for step in range(1000):
    params = update(params, X_jax, y_onehot)

# ------------------------------------------------------------
# 6. Extract hidden layer representation
# ------------------------------------------------------------
@jit
def hidden_representation(params, x):
    W1, b1, _, _ = params
    return jnp.tanh(x @ W1 + b1)

H = hidden_representation(params, X_jax)
H_np = np.array(H)

# ------------------------------------------------------------
# 7. PCA on hidden layer output
# ------------------------------------------------------------
H_pca = PCA(n_components=2).fit_transform(H_np)

plt.figure(figsize=(5,4))
plt.scatter(H_pca[:,0], H_pca[:,1], c=y, cmap="viridis")
plt.title("PCA of hidden layer representation")
plt.xlabel("Hidden PC1")
plt.ylabel("Hidden PC2")
plt.colorbar(label="class")
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

I got an idea that PCA graph shows the real data points in the dataset. However, the 2nd graph(PCA of hidden layer represntation) visualises the internal representation, where the neural network captures trends effective for class discrimination through learning and reconstructs the input features.

5. What I've learnt¶

I have learnt the basics of how to analyse data consisting solely of raw datasets, extract “information” from them and then analysing that. First, I realised it is crucial to understand each characteristic and consider how to analyse it. The journey has only just begun, but I wish to deepen my understanding of data science while learning along the way.

In [ ]: