< Home
7. Transforms¶
Preprocessing¶
What I've learnt;
- Preprocessing -> we need to do before analysing.
- Standardization -> All units are changed to same scale
PCA¶
My "classmate" Rico introduced us one YouTube video about PCA, and I learnt.
As sample code is shown in that YouTube, I copied that code and modify it for using my dataset.
I also asked chatGPT to change dataset from sample data to my dataset(Wine).
In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('./datasets/Wine_dataset.csv')
Display main factors¶
Five main components were extracted from all items.(with Tsuchiya-san's advice)
In [6]:
import sklearn
X = df.to_numpy() #make array of winedata
# Align the size of the numbers
# mean:0 variance:1
scaler = sklearn.preprocessing.StandardScaler()
Xscale = scaler.fit_transform(X)
pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
print(f"explained variance: {pca.explained_variance_}")
explained variance: [5.56722458 2.51118402 1.45424413 0.9331603 0.88246016]
Next, I see that each main componet explains what percentage of the overall information.
In [7]:
contribution_ratios = pd.DataFrame(pca.explained_variance_ratio_)
contribution_ratios
Out[7]:
| 0 | |
|---|---|
| 0 | 0.395425 |
| 1 | 0.178363 |
| 2 | 0.103291 |
| 3 | 0.066280 |
| 4 | 0.062679 |
It is shown cumulative sum, which indicates the percentage of the whole by adding the principal components.
In [8]:
cumulative_contribution_ratios = contribution_ratios.cumsum()
cumulative_contribution_ratios
Out[8]:
| 0 | |
|---|---|
| 0 | 0.395425 |
| 1 | 0.573787 |
| 2 | 0.677078 |
| 3 | 0.743358 |
| 4 | 0.806037 |
Then the major items are indicated.
In [11]:
loadings = pd.DataFrame(pca.components_.T, index=df.columns)
loadings.round(3)
Out[11]:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| class | 0.394 | 0.006 | 0.001 | 0.122 | 0.158 |
| Alcohol | -0.136 | 0.484 | -0.207 | -0.082 | -0.251 |
| Malic acid | 0.223 | 0.224 | 0.089 | 0.470 | -0.189 |
| Ash | -0.002 | 0.316 | 0.626 | -0.250 | -0.094 |
| Alcalinity of ash | 0.224 | -0.012 | 0.612 | 0.072 | 0.047 |
| Magnesium | -0.125 | 0.301 | 0.131 | -0.163 | 0.778 |
| Total phenols | -0.359 | 0.067 | 0.147 | 0.191 | -0.145 |
| Flavanoids | -0.391 | -0.001 | 0.151 | 0.145 | -0.112 |
| Nonflavanoid phenols | 0.267 | 0.027 | 0.170 | -0.328 | -0.433 |
| Proanthocyanins | -0.279 | 0.041 | 0.150 | 0.463 | 0.092 |
| Color intensity | 0.089 | 0.530 | -0.137 | 0.072 | -0.046 |
| Hue | -0.277 | -0.278 | 0.085 | -0.435 | -0.030 |
| OD280/OD315 of diluted wines | -0.351 | -0.163 | 0.166 | 0.157 | -0.144 |
| Proline | -0.270 | 0.366 | -0.127 | -0.256 | -0.084 |
PCA plotting¶
I make a graph of percentage that each factor effects and scatter plot of CP1 and CP2.
In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
# ① CSV を読み込む(ここだけ変更)
df = pd.read_csv('./datasets/Wine_dataset.csv')
# ② 特徴量だけ取り出す(ここだけ変更)
# ※ 最初の列が "Class" なので除外、残り13列が特徴量
data = df.drop(columns=[df.columns[0]])
# ③ サンプル名を行インデックスにする(ここだけ変更)
# Wine の場合、サンプル名は無いので番号でOK
sample_names = [f"sample{i}" for i in range(len(data))]
# ④ PCA の前処理(元コードの通り)
scaled_data = preprocessing.scale(data.values)
pca = PCA()
pca.fit(scaled_data)
pca_data = pca.transform(scaled_data)
per_var = np.round(pca.explained_variance_ratio_ * 100, 1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
# スクリープロット
plt.bar(x=range(1, len(per_var)+1), height=per_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot (Wine Data)')
plt.show()
# PCA 結果をデータフレーム化
pca_df = pd.DataFrame(pca_data, index=sample_names, columns=labels)
# 散布図(PC1 vs PC2)
plt.scatter(pca_df.PC1, pca_df.PC2, alpha=0.6)
plt.title('PCA of Wine Dataset')
plt.xlabel(f'PC1 - {per_var[0]}%')
plt.ylabel(f'PC2 - {per_var[1]}%')
# サンプル名を表示
#for sample in pca_df.index:
# plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))
plt.show()
Then I learn Neil's sample code using my dataset.
In [13]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
np.set_printoptions(precision=1)
#
# load wine dataset data
#
# class をラベル、それ以外を特徴量にする
y = df['class'].to_numpy()
X = df.drop(columns=['class']).to_numpy()
print(f"Wine data shape (records,features): {X.shape}")
#
# plot vs two pixel values
#
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel(df.columns[1])
plt.ylabel(df.columns[2])
plt.title("Wine vs two original features")
plt.colorbar(label="digit")
plt.show()
#
# standardize (zero mean, unit variance) to eliminate dependence on data scaling
#
print(f"data mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")
X = X-np.mean(X,axis=0)
std = np.std(X,axis=0)
Xscale = X/np.where(std > 0,std,1)
print(f"standardized data mean: {np.mean(Xscale):.2f}, variance: {np.var(Xscale):.2f}")
#
# do 50 component PCA
#
pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
Xpca = pca.transform(Xscale)
plt.plot(pca.explained_variance_,'o')
plt.plot()
plt.xlabel('PCA component')
plt.ylabel('explained variance')
plt.title('MNIST PCA')
plt.show()
#
# plot vs first two PCA components
#
plt.scatter(Xpca[:,0],Xpca[:,1],c=y,s=3)
plt.xlabel("principal component 1")
plt.ylabel("principal component 2")
plt.title("MNIST vs two principal components")
plt.colorbar(label="digit")
plt.show()
Wine data shape (records,features): (178, 13)
data mean: 69.13, variance: 46546.42 standardized data mean: -0.00, variance: 1.00
In [ ]: