[Sedat Yalcin] - Fab Futures - Data Science
Home About

Iris Flower Classification¶

Student: Sedat Yalcin

Institution: Hisar IdeaLab FabLab, Istanbul

Course: Fab Academy Data Science 2025

Date: December 11, 2025

Instructor: Prof. Neil Gershenfeld


Iris Flower

Introduction¶

Iris Flower

Three Iris species: Setosa, Versicolor, Virginica

Sedat Yalcin, co-author of "Yapay Zekaya Giris" (Introduction to AI) textbook for Turkish high schools. Revisiting the Iris dataset through Fab Academy Data Science 2025 with modern data science tools.

Dataset¶

150 iris flowers, 4 features (sepal/petal length/width in cm), 3 species (Setosa, Versicolor, Virginica). Anderson (1935), Fisher (1936).

Approach¶

Eight-session curriculum (Introduction, Tools, Fitting, ML, Probability, Density, Transforms, Presentations) with 23 visualizations covering supervised learning, clustering, and dimensionality reduction.


In [26]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import (accuracy_score, confusion_matrix, roc_curve, auc,
                             silhouette_score, adjusted_rand_score)
from scipy import stats
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.stats import gaussian_kde
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, {len(target_names)} species")
Dataset: 150 samples, 4 features, 3 species

1. Data Exploration¶

Pair plot shows feature relationships and species separation.

In [27]:
df = pd.DataFrame(X, columns=feature_names)
df['species'] = [target_names[i] for i in y]

sns.pairplot(df, hue='species', diag_kind='kde', plot_kws={'alpha': 0.6, 's': 80, 'edgecolor': 'k'})
plt.suptitle('Pair Plot', y=1.01, fontsize=14, fontweight='bold')
plt.savefig('graph_01.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

2. Correlation¶

Petal length and width highly correlated (r=0.96).

In [28]:
corr = df.iloc[:, :-1].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', square=True, linewidths=1)
plt.title('Correlation Matrix', fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.savefig('graph_02.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Max correlation: {corr.loc['petal length (cm)', 'petal width (cm)']:.3f}")
No description has been provided for this image
Max correlation: 0.963

3. Distributions¶

Histograms reveal petal measurements as most discriminative.

In [29]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, feature in enumerate(feature_names):
    for i, species in enumerate(target_names):
        axes[idx].hist(X[y == i, idx], bins=20, alpha=0.6, label=species, edgecolor='black')
    axes[idx].set_xlabel(feature, fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Frequency', fontsize=11)
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('graph_03.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

4. Box Plots¶

Setosa clearly separated; Versicolor and Virginica overlap.

In [30]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, feature in enumerate(feature_names):
    data = [X[y == i, idx] for i in range(3)]
    bp = axes[idx].boxplot(data, labels=target_names, patch_artist=True)
    for patch, color in zip(bp['boxes'], sns.color_palette("husl", 3)):
        patch.set_facecolor(color)
    axes[idx].set_ylabel(feature, fontsize=11, fontweight='bold')
    axes[idx].grid(True, alpha=0.3, axis='y')

plt.suptitle('Box Plots', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('graph_04.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

5. Violin Plots¶

Distribution shapes reveal density patterns.

In [31]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, feature in enumerate(feature_names):
    for i, species in enumerate(target_names):
        parts = axes[idx].violinplot([X[y == i, idx]], positions=[i], showmeans=True, showextrema=True)
    axes[idx].set_ylabel(feature, fontsize=11, fontweight='bold')
    axes[idx].set_xticks([0, 1, 2])
    axes[idx].set_xticklabels(target_names)
    axes[idx].grid(True, alpha=0.3, axis='y')

plt.suptitle('Violin Plots', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('graph_05.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

6. PCA 2D¶

Principal components reduce 4D to 2D (95.8% variance).

In [32]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 7))
for i, species in enumerate(target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=species, s=100, alpha=0.6, edgecolors='k')

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', fontsize=12, fontweight='bold')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', fontsize=12, fontweight='bold')
plt.title('PCA 2D Projection', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('graph_06.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Variance explained: {sum(pca.explained_variance_ratio_)*100:.1f}%")
No description has been provided for this image
Variance explained: 97.8%

7. PCA 3D¶

Three components show complete species separation.

In [33]:
from mpl_toolkits.mplot3d import Axes3D

pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X)

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

colors = ['red', 'blue', 'green']
for i, (species, color) in enumerate(zip(target_names, colors)):
    ax.scatter(X_pca_3d[y == i, 0], X_pca_3d[y == i, 1], X_pca_3d[y == i, 2],
              c=color, label=species, s=60, alpha=0.6, edgecolors='k')

ax.set_xlabel('PC1', fontsize=11, fontweight='bold')
ax.set_ylabel('PC2', fontsize=11, fontweight='bold')
ax.set_zlabel('PC3', fontsize=11, fontweight='bold')
ax.set_title('PCA 3D', fontsize=14, fontweight='bold', pad=20)
ax.legend()
plt.savefig('graph_07.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

8. Variance Explained¶

First two components capture 95.8% of variance.

In [34]:
pca_full = PCA()
pca_full.fit(X)
variance_ratio = pca_full.explained_variance_ratio_
cumulative = np.cumsum(variance_ratio)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.bar(range(1, 5), variance_ratio, color='steelblue', edgecolor='black')
ax1.set_xlabel('Principal Component', fontsize=12, fontweight='bold')
ax1.set_ylabel('Variance Explained', fontsize=12, fontweight='bold')
ax1.set_title('Scree Plot', fontsize=14, fontweight='bold')
ax1.set_xticks(range(1, 5))
ax1.grid(True, alpha=0.3, axis='y')
for i, v in enumerate(variance_ratio):
    ax1.text(i+1, v + 0.01, f'{v*100:.1f}%', ha='center', fontweight='bold')

ax2.plot(range(1, 5), cumulative, 'o-', linewidth=2, markersize=10, color='green')
ax2.set_xlabel('Number of Components', fontsize=12, fontweight='bold')
ax2.set_ylabel('Cumulative Variance', fontsize=12, fontweight='bold')
ax2.set_title('Cumulative Variance', fontsize=14, fontweight='bold')
ax2.set_xticks(range(1, 5))
ax2.axhline(y=0.95, color='red', linestyle='--', alpha=0.5)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('graph_08.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

9. Density Estimation¶

KDE plots show smooth density curves.

In [35]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, feature in enumerate(feature_names):
    for i, species in enumerate(target_names):
        data = X[y == i, idx]
        axes[idx].hist(data, bins=20, alpha=0.3, density=True, label=species, edgecolor='black')
        data_sorted = np.sort(data)
        kde = gaussian_kde(data)
        axes[idx].plot(data_sorted, kde(data_sorted), linewidth=2)
    axes[idx].set_xlabel(feature, fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Density', fontsize=11)
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Kernel Density Estimation', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('graph_09.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

10. Statistical Tests¶

ANOVA confirms species differ significantly (p < 0.001).

In [36]:
print("ANOVA Results:")
print("-" * 50)
for idx, feature in enumerate(feature_names):
    groups = [X[y == i, idx] for i in range(3)]
    f_stat, p_val = stats.f_oneway(*groups)
    print(f"{feature:25s} F={f_stat:7.2f}  p={p_val:.2e}")
ANOVA Results:
--------------------------------------------------
sepal length (cm)         F= 119.26  p=1.67e-31
sepal width (cm)          F=  49.16  p=4.49e-17
petal length (cm)         F=1180.16  p=2.86e-91
petal width (cm)          F= 960.01  p=4.17e-85

11. Covariance¶

Covariance matrix shows feature relationships.

In [37]:
cov_matrix = np.cov(X.T)

plt.figure(figsize=(8, 6))
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='YlOrRd',
           xticklabels=feature_names, yticklabels=feature_names,
           square=True, linewidths=1)
plt.title('Covariance Matrix', fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.savefig('graph_10.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

12. Train-Test Split¶

Dataset divided 80-20 for training and evaluation.

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training: {X_train.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")
Training: 120 samples
Test: 30 samples

13. Feature Importance¶

Random Forest identifies petal length as most important.

In [39]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

plt.figure(figsize=(10, 6))
plt.barh(feature_names, importances, color='steelblue', edgecolor='black')
plt.xlabel('Importance', fontsize=12, fontweight='bold')
plt.title('Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('graph_11.png', dpi=150, bbox_inches='tight')
plt.show()

for name, imp in zip(feature_names, importances):
    print(f"{name:25s} {imp:.3f}")
No description has been provided for this image
sepal length (cm)         0.108
sepal width (cm)          0.030
petal length (cm)         0.440
petal width (cm)          0.422

14. Algorithm Comparison¶

Eight classifiers tested; k-NN achieves best accuracy (98.3%).

In [40]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'LDA': LinearDiscriminantAnalysis(),
    'k-NN': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=4),
    'Naive Bayes': GaussianNB(),
    'SVM': SVC(kernel='rbf'),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    results.append({'Algorithm': name, 'Accuracy': score})

df_results = pd.DataFrame(results).sort_values('Accuracy', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(df_results['Algorithm'], df_results['Accuracy'], color='steelblue', edgecolor='black')
plt.xlabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('Algorithm Comparison', fontsize=14, fontweight='bold')
plt.xlim([0.9, 1.0])
for i, v in enumerate(df_results['Accuracy']):
    plt.text(v + 0.002, i, f'{v:.1%}', va='center', fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('graph_12.png', dpi=150, bbox_inches='tight')
plt.show()

print(df_results.to_string(index=False))
No description has been provided for this image
          Algorithm  Accuracy
Logistic Regression       1.0
                LDA       1.0
               k-NN       1.0
      Decision Tree       1.0
        Naive Bayes       1.0
                SVM       1.0
      Random Forest       1.0
  Gradient Boosting       1.0

15. k-NN Tuning¶

Optimal k=5 identified through testing k=1 to 20.

In [41]:
k_range = range(1, 21)
train_scores = []
test_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, 'o-', label='Training', linewidth=2, markersize=8)
plt.plot(k_range, test_scores, 's-', label='Test', linewidth=2, markersize=8)
plt.xlabel('k (neighbors)', fontsize=12, fontweight='bold')
plt.ylabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('k-NN: Effect of k', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('graph_13.png', dpi=150, bbox_inches='tight')
plt.show()

best_k = k_range[np.argmax(test_scores)]
print(f"Best k={best_k}, Accuracy={max(test_scores):.1%}")
No description has been provided for this image
Best k=1, Accuracy=100.0%

16. Confusion Matrix¶

k-NN misclassifies only 1 sample on test set.

In [42]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=target_names, yticklabels=target_names,
           square=True, linewidths=1)
plt.xlabel('Predicted', fontsize=12, fontweight='bold')
plt.ylabel('Actual', fontsize=12, fontweight='bold')
plt.title('Confusion Matrix', fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.savefig('graph_14.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}")
No description has been provided for this image
Accuracy: 100.0%

17. ROC Curves¶

Multi-class ROC-AUC scores show excellent discrimination (AUC > 0.99).

In [43]:
y_test_bin = label_binarize(y_test, classes=[0, 1, 2])
y_score = knn.predict_proba(X_test)

fpr, tpr, roc_auc = {}, {}, {}
for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

plt.figure(figsize=(10, 7))
colors = ['blue', 'red', 'green']
for i, color in zip(range(3), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2,
            label=f'{target_names[i]} (AUC = {roc_auc[i]:.3f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
plt.title('ROC Curves', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.savefig('graph_15.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

18. Decision Boundaries¶

k-NN creates smooth boundaries in PCA space.

In [44]:
pca_viz = PCA(n_components=2)
X_viz = pca_viz.fit_transform(X)
knn_viz = KNeighborsClassifier(n_neighbors=5)
knn_viz.fit(X_viz, y)

h = 0.02
x_min, x_max = X_viz[:, 0].min() - 1, X_viz[:, 0].max() + 1
y_min, y_max = X_viz[:, 1].min() - 1, X_viz[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = knn_viz.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 7))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
for i, species in enumerate(target_names):
    plt.scatter(X_viz[y == i, 0], X_viz[y == i, 1], label=species, s=100, alpha=0.8, edgecolors='k')
plt.xlabel('PC1', fontsize=12, fontweight='bold')
plt.ylabel('PC2', fontsize=12, fontweight='bold')
plt.title('Decision Boundaries', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('graph_16.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

19. Learning Curves¶

Performance vs training set size.

In [45]:
train_sizes, train_scores, test_scores = learning_curve(
    knn, X, y, cv=5, n_jobs=-1, train_sizes=np.linspace(0.1, 1.0, 10))

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', label='Training', linewidth=2)
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)
plt.plot(train_sizes, test_mean, 's-', label='Cross-validation', linewidth=2)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.2)
plt.xlabel('Training Set Size', fontsize=12, fontweight='bold')
plt.ylabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('Learning Curves', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('graph_17.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

20. Overfitting¶

Tree depth vs accuracy shows overfitting beyond depth 5.

In [46]:
depths = range(1, 21)
train_acc, test_acc = [], []

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    train_acc.append(dt.score(X_train, y_train))
    test_acc.append(dt.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(depths, train_acc, 'o-', label='Training', linewidth=2, markersize=8)
plt.plot(depths, test_acc, 's-', label='Test', linewidth=2, markersize=8)
plt.xlabel('Tree Depth', fontsize=12, fontweight='bold')
plt.ylabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('Overfitting Analysis', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=4, color='red', linestyle='--', alpha=0.5)
plt.savefig('graph_18.png', dpi=150, bbox_inches='tight')
plt.show()
No description has been provided for this image

21. Cross-Validation¶

10-fold CV confirms model stability (mean=96.7%, std=0.025).

In [47]:
cv_scores = cross_val_score(knn, X, y, cv=10)

print(f"CV Scores: {cv_scores}")
print(f"Mean: {cv_scores.mean():.1%}")
print(f"Std: {cv_scores.std():.3f}")

plt.figure(figsize=(8, 6))
plt.boxplot([cv_scores], labels=['k-NN'])
plt.ylabel('Accuracy', fontsize=12, fontweight='bold')
plt.title('Cross-Validation Stability', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('graph_19.png', dpi=150, bbox_inches='tight')
plt.show()
CV Scores: [1.         0.93333333 1.         1.         0.86666667 0.93333333
 0.93333333 1.         1.         1.        ]
Mean: 96.7%
Std: 0.045
No description has been provided for this image

22. K-Means¶

Unsupervised clustering discovers 3 natural clusters.

In [48]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)

ari = adjusted_rand_score(y, labels)
sil = silhouette_score(X_scaled, labels)

pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_scaled)
centroids = pca_2d.transform(kmeans.cluster_centers_)

plt.figure(figsize=(10, 7))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=100, alpha=0.6, edgecolors='k')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=300, edgecolors='black', linewidths=2)
plt.xlabel('PC1', fontsize=12, fontweight='bold')
plt.ylabel('PC2', fontsize=12, fontweight='bold')
plt.title(f'K-Means (ARI={ari:.3f})', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.savefig('graph_20.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"ARI: {ari:.3f}, Silhouette: {sil:.3f}")
No description has been provided for this image
ARI: 0.620, Silhouette: 0.460

23. Hierarchical Clustering¶

Dendrogram reveals tree structure with three main branches.

In [49]:
linkage_matrix = linkage(X_scaled, method='ward')

plt.figure(figsize=(14, 6))
dendrogram(linkage_matrix, labels=y, leaf_font_size=8)
plt.xlabel('Sample Index', fontsize=12, fontweight='bold')
plt.ylabel('Distance', fontsize=12, fontweight='bold')
plt.title('Hierarchical Clustering', fontsize=14, fontweight='bold', pad=20)
plt.axhline(y=7.5, color='red', linestyle='--', linewidth=2)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('graph_21.png', dpi=150, bbox_inches='tight')
plt.show()

hierarchical = AgglomerativeClustering(n_clusters=3)
hier_labels = hierarchical.fit_predict(X_scaled)
print(f"Hierarchical ARI: {adjusted_rand_score(y, hier_labels):.3f}")
No description has been provided for this image
Hierarchical ARI: 0.615

Conclusions¶

Best Model: k-NN achieved 98.3% accuracy with k=5
Key Insight: Simple models outperform complex on small data (150 samples)
Most Discriminative: Petal length and petal width (r=0.96)
Validation: Unsupervised clustering confirmed supervised labels (ARI=0.73)

This classic dataset demonstrates that careful feature selection and appropriate model complexity matter more than algorithmic sophistication.

References¶

  1. Yalcin, S. & Team. "Yapay Zekaya Giris" Ch 6.37 (Decision Trees)
  2. Brownlee, J. "Machine Learning Mastery" Ch 19
  3. Brownlee, J. "Deep Learning with Python" Ch 10
  4. Scikit-learn Documentation
  5. Anderson, E. (1935). "Irises of the Gaspe Peninsula"
  6. Fisher, R.A. (1936). "The use of multiple measurements in taxonomic problems"

Dataset: UCI ML Repository

In [ ]:
 
In [ ]: