[Maki TANAKA] - Fab Futures - Data Science
Home About

< Home

7. Transforms¶

Preprocessing¶

What I've learnt;

  • Preprocessing -> we need to do before analysing.
  • Standardization -> All units are changed to same scale

PCA¶

My "classmate" Rico introduced us one YouTube video about PCA, and I learnt. As sample code is shown in that YouTube, I copied that code and modify it for using my dataset.
I also asked chatGPT to change dataset from sample data to my dataset(Wine).

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('./datasets/Wine_dataset.csv')

Display main factors¶

Five main components were extracted from all items.(with Tsuchiya-san's advice)

In [6]:
import sklearn
X = df.to_numpy() #make array of winedata

# Align the size of the numbers
# mean:0 variance:1
scaler = sklearn.preprocessing.StandardScaler()
Xscale = scaler.fit_transform(X)
pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
print(f"explained variance: {pca.explained_variance_}")
explained variance: [5.56722458 2.51118402 1.45424413 0.9331603  0.88246016]

Next, I see that each main componet explains what percentage of the overall information.

In [7]:
contribution_ratios = pd.DataFrame(pca.explained_variance_ratio_)
contribution_ratios
Out[7]:
0
0 0.395425
1 0.178363
2 0.103291
3 0.066280
4 0.062679

It is shown cumulative sum, which indicates the percentage of the whole by adding the principal components.

In [8]:
cumulative_contribution_ratios = contribution_ratios.cumsum()
cumulative_contribution_ratios
Out[8]:
0
0 0.395425
1 0.573787
2 0.677078
3 0.743358
4 0.806037

Then the major items are indicated.

In [11]:
loadings = pd.DataFrame(pca.components_.T, index=df.columns)
loadings.round(3)
Out[11]:
0 1 2 3 4
class 0.394 0.006 0.001 0.122 0.158
Alcohol -0.136 0.484 -0.207 -0.082 -0.251
Malic acid 0.223 0.224 0.089 0.470 -0.189
Ash -0.002 0.316 0.626 -0.250 -0.094
Alcalinity of ash 0.224 -0.012 0.612 0.072 0.047
Magnesium -0.125 0.301 0.131 -0.163 0.778
Total phenols -0.359 0.067 0.147 0.191 -0.145
Flavanoids -0.391 -0.001 0.151 0.145 -0.112
Nonflavanoid phenols 0.267 0.027 0.170 -0.328 -0.433
Proanthocyanins -0.279 0.041 0.150 0.463 0.092
Color intensity 0.089 0.530 -0.137 0.072 -0.046
Hue -0.277 -0.278 0.085 -0.435 -0.030
OD280/OD315 of diluted wines -0.351 -0.163 0.166 0.157 -0.144
Proline -0.270 0.366 -0.127 -0.256 -0.084

PCA plotting¶

I make a graph of percentage that each factor effects and scatter plot of CP1 and CP2.

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt

# ① CSV を読み込む(ここだけ変更)
df = pd.read_csv('./datasets/Wine_dataset.csv')

# ② 特徴量だけ取り出す(ここだけ変更)
# ※ 最初の列が "Class" なので除外、残り13列が特徴量
data = df.drop(columns=[df.columns[0]])

# ③ サンプル名を行インデックスにする(ここだけ変更)
# Wine の場合、サンプル名は無いので番号でOK
sample_names = [f"sample{i}" for i in range(len(data))]

# ④ PCA の前処理(元コードの通り)
scaled_data = preprocessing.scale(data.values)

pca = PCA()
pca.fit(scaled_data)
pca_data = pca.transform(scaled_data)

per_var = np.round(pca.explained_variance_ratio_ * 100, 1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]

# スクリープロット
plt.bar(x=range(1, len(per_var)+1), height=per_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot (Wine Data)')
plt.show()

# PCA 結果をデータフレーム化
pca_df = pd.DataFrame(pca_data, index=sample_names, columns=labels)

# 散布図(PC1 vs PC2)
plt.scatter(pca_df.PC1, pca_df.PC2, alpha=0.6)
plt.title('PCA of Wine Dataset')
plt.xlabel(f'PC1 - {per_var[0]}%')
plt.ylabel(f'PC2 - {per_var[1]}%')

# サンプル名を表示
#for sample in pca_df.index:
#    plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))

plt.show()
No description has been provided for this image
No description has been provided for this image

Then I learn Neil's sample code using my dataset.

In [13]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn

np.set_printoptions(precision=1)
#
# load wine dataset data
#
# class をラベル、それ以外を特徴量にする
y = df['class'].to_numpy()
X = df.drop(columns=['class']).to_numpy()
print(f"Wine data shape (records,features): {X.shape}")

#
# plot vs two pixel values
#
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel(df.columns[1])
plt.ylabel(df.columns[2])
plt.title("Wine vs two original features")
plt.colorbar(label="digit")
plt.show()

#
# standardize (zero mean, unit variance) to eliminate dependence on data scaling
#
print(f"data mean: {np.mean(X):.2f}, variance: {np.var(X):.2f}")
X = X-np.mean(X,axis=0)
std = np.std(X,axis=0)
Xscale = X/np.where(std > 0,std,1)
print(f"standardized data mean: {np.mean(Xscale):.2f}, variance: {np.var(Xscale):.2f}")
#
# do 50 component PCA
#
pca = sklearn.decomposition.PCA(n_components=5)
pca.fit(Xscale)
Xpca = pca.transform(Xscale)
plt.plot(pca.explained_variance_,'o')
plt.plot()
plt.xlabel('PCA component')
plt.ylabel('explained variance')
plt.title('MNIST PCA')
plt.show()
#
# plot vs first two PCA components
#
plt.scatter(Xpca[:,0],Xpca[:,1],c=y,s=3)
plt.xlabel("principal component 1")
plt.ylabel("principal component 2")
plt.title("MNIST vs two principal components")
plt.colorbar(label="digit")
plt.show()
Wine data shape (records,features): (178, 13)
No description has been provided for this image
data mean: 69.13, variance: 46546.42
standardized data mean: -0.00, variance: 1.00
No description has been provided for this image
No description has been provided for this image
In [ ]: