[Your-Name-Here] - Fab Futures - Data Science
Home About
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings('ignore')

# Load the data
df = pd.read_csv('players_21.csv')

print(f"Dataset shape: {df.shape}")
print(f"Columns: {len(df.columns)}")
print(f"\nFirst few rows:")
print(df.head(3))
Dataset shape: (18944, 106)
Columns: 106

First few rows:
   sofifa_id                                         player_url  \
0     158023  https://sofifa.com/player/158023/lionel-messi/...   
1      20801  https://sofifa.com/player/20801/c-ronaldo-dos-...   
2     200389  https://sofifa.com/player/200389/jan-oblak/210002   

          short_name                            long_name  age         dob  \
0           L. Messi       Lionel Andrés Messi Cuccittini   33  1987-06-24   
1  Cristiano Ronaldo  Cristiano Ronaldo dos Santos Aveiro   35  1985-02-05   
2           J. Oblak                            Jan Oblak   27  1993-01-07   

   height_cm  weight_kg nationality        club_name  ...   lwb   ldm   cdm  \
0        170         72   Argentina     FC Barcelona  ...  66+3  65+3  65+3   
1        187         83    Portugal         Juventus  ...  65+3  61+3  61+3   
2        188         87    Slovenia  Atlético Madrid  ...  32+3  36+3  36+3   

    rdm   rwb    lb   lcb    cb   rcb    rb  
0  65+3  66+3  62+3  52+3  52+3  52+3  62+3  
1  61+3  65+3  61+3  54+3  54+3  54+3  61+3  
2  36+3  32+3  32+3  33+3  33+3  33+3  32+3  

[3 rows x 106 columns]

What This Code Does:

  1. Imports Tools (Like getting your toolbox ready) pandas, numpy: Data handling tools (Excel-like operations)

matplotlib, seaborn: Drawing/visualization tools

sklearn: Machine learning tools (prediction algorithms)

warnings: Hides annoying warning messages

  1. Loads the Data (Like opening a spreadsheet) Reads a CSV file named players_21.csv from the datasets folder

Creates a data table called df (stands for "dataframe")

  1. Shows Basic Information Tells you how big the dataset is (rows × columns)

Counts how many columns there are

Shows the first 3 rows of data so you can see what it looks like

In [4]:
# Check for missing values
print("Missing values per column (top 20):")
missing_values = df.isnull().sum().sort_values(ascending=False)
print(missing_values[missing_values > 0].head(20))

# Basic statistics
print("\nBasic statistics of key numerical columns:")
key_columns = ['overall', 'potential', 'value_eur', 'wage_eur', 'age', 'height_cm', 'weight_kg']
print(df[key_columns].describe())

# Data types
print("\nData types:")
print(df.dtypes.value_counts())
Missing values per column (top 20):
defending_marking       18944
loaned_from             18186
nation_jersey_number    17817
nation_position         17817
player_tags             17536
gk_speed                16861
gk_kicking              16861
gk_handling             16861
gk_diving               16861
gk_positioning          16861
gk_reflexes             16861
player_traits           10629
physic                   2083
defending                2083
dribbling                2083
passing                  2083
shooting                 2083
pace                     2083
release_clause_eur        995
joined                    983
dtype: int64

Basic statistics of key numerical columns:
            overall     potential     value_eur       wage_eur           age  \
count  18944.000000  18944.000000  1.894400e+04   18944.000000  18944.000000   
mean      65.677787     71.086729  2.224813e+06    8675.852513     25.225823   
std        7.002278      6.109985  5.102486e+06   19654.774894      4.697354   
min       47.000000     47.000000  0.000000e+00       0.000000     16.000000   
25%       61.000000     67.000000  3.000000e+05    1000.000000     21.000000   
50%       66.000000     71.000000  6.500000e+05    3000.000000     25.000000   
75%       70.000000     75.000000  1.800000e+06    7000.000000     29.000000   
max       93.000000     95.000000  1.055000e+08  560000.000000     53.000000   

          height_cm     weight_kg  
count  18944.000000  18944.000000  
mean     181.190773     75.016892  
std        6.825672      7.057140  
min      155.000000     50.000000  
25%      176.000000     70.000000  
50%      181.000000     75.000000  
75%      186.000000     80.000000  
max      206.000000    110.000000  

Data types:
int64      44
object     44
float64    18
Name: count, dtype: int64

Simple Summary: Missing values check: "What information is missing about our players?"

Basic statistics: "What are the typical values for player ratings, age, salary?"

Data types: "What kind of information do we have in our spreadsheet?"

Real-world analogy: Imagine you're managing a football team and you get a report card that tells you:

What player information you're missing (like incomplete medical records)

Average player stats (typical age, salary, rating)

What kind of information you have (numbers vs text)

In [5]:
# Numerical features (player attributes)
numerical_features = [
    'overall', 'potential', 'value_eur', 'wage_eur', 'age',
    'height_cm', 'weight_kg', 'international_reputation',
    'weak_foot', 'skill_moves',
    'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic',
    'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy',
    'attacking_short_passing', 'attacking_volleys', 'skill_dribbling',
    'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
    'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed',
    'movement_agility', 'movement_reactions', 'movement_balance',
    'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
    'power_long_shots', 'mentality_aggression', 'mentality_interceptions',
    'mentality_positioning', 'mentality_vision', 'mentality_penalties',
    'mentality_composure', 'defending_marking', 'defending_standing_tackle',
    'defending_sliding_tackle'
]

# Check which features exist in our dataset
existing_features = [col for col in numerical_features if col in df.columns]
print(f"\nFound {len(existing_features)} out of {len(numerical_features)} numerical features")

# Create simplified position classification
def extract_primary_position(pos_string):
    if isinstance(pos_string, str):
        return pos_string.split(',')[0].strip()
    return 'Unknown'

df['primary_position'] = df['player_positions'].apply(extract_primary_position)

# Create position groups
def group_positions(pos):
    if pd.isna(pos):
        return 'Other'
    pos = str(pos)
    if 'GK' in pos:
        return 'Goalkeeper'
    elif any(x in pos for x in ['CB', 'RB', 'LB', 'LWB', 'RWB']):
        return 'Defender'
    elif any(x in pos for x in ['CDM', 'CM', 'CAM', 'LM', 'RM']):
        return 'Midfielder'
    elif any(x in pos for x in ['LW', 'RW', 'ST', 'CF']):
        return 'Forward'
    else:
        return 'Other'

df['position_group'] = df['primary_position'].apply(group_positions)

print("\nPosition groups distribution:")
print(df['position_group'].value_counts())
Found 45 out of 45 numerical features

Position groups distribution:
position_group
Midfielder    7037
Defender      6205
Forward       3618
Goalkeeper    2084
Name: count, dtype: int64

Feature Engineering & Selection

What This Code is Doing (Simple Version): PART 1: Making a Shopping List of Player Stats python numerical_features = [...] Think of this like:

Making a shopping list of 45 different player statistics we want to analyze

These are all number-based stats (not names or teams)

Examples from the game:

Overall rating: 93 (Messi)

Pace: 85 (how fast)

Shooting: 92 (how good at scoring)

Passing: 91 (how good at passing)

Age: 33

Value: €67,500,000

Wage: €560,000 per week

PART 2: Checking What's Actually in Our Bag python existing_features = [col for col in numerical_features if col in df.columns] Think of this like:

We have a shopping list of 45 items

We look in our FIFA data "bag" to see which items we actually have

Example: "Oh, we have 40 out of the 45 stats on our list!"

Some might be missing from our specific FIFA file

PART 3: Simplifying Player Positions (STEP 1) python def extract_primary_position(pos_string): return pos_string.split(',')[0].strip() What it does:

Some players have multiple positions: "RW, ST, CF" (Messi)

This takes only their main position (first one)

Example: "RW, ST, CF" → "RW"

"CAM, CM" → "CAM"

PART 4: Grouping Positions (STEP 2 - Making it Simpler) python def group_positions(pos): What it does:

Takes those detailed positions and puts them into 4 simple groups:

Detailed Position Simple Group GK Goalkeeper CB, RB, LB Defender CDM, CM, CAM Midfielder LW, RW, ST, CF Forward Real Examples:

Messi: "RW" → "Forward"

Ramos: "CB" → "Defender"

De Bruyne: "CAM" → "Midfielder"

Neuer: "GK" → "Goalkeeper"

PART 5: Showing the Results python print(df['position_group'].value_counts()) What this shows:

text Position groups distribution: Midfielder: 500 players Forward: 400 players
Defender: 350 players Goalkeeper: 150 players Other: 50 players

. Machine Learning Models 3.1 Position Classification Model

In [7]:
# SIMPLER VERSION - Handle NaN values properly
print("=== SIMPLIFIED DATA PREPARATION ===")

# Select only the most important features to avoid NaN issues
simple_features = [
    'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic',
    'attacking_finishing', 'skill_dribbling', 'movement_reactions',
    'power_strength', 'age', 'height_cm', 'weight_kg'
]

# Filter out goalkeepers
field_players = df[df['position_group'] != 'Goalkeeper'].copy()

# Select features and target
X_simple = field_players[simple_features]
y_simple = field_players['position_group']

# Fill NaN with median
X_simple_filled = X_simple.fillna(X_simple.median())

# Remove any rows where target is NaN
valid_indices = y_simple.notnull()
X_simple_filled = X_simple_filled[valid_indices]
y_simple = y_simple[valid_indices]

# Encode target
le_simple = LabelEncoder()
y_simple_encoded = le_simple.fit_transform(y_simple)

print(f"Features: {simple_features}")
print(f"Number of samples: {len(X_simple_filled)}")
print(f"Classes: {le_simple.classes_}")

# Split data
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X_simple_filled, y_simple_encoded, test_size=0.2, random_state=42, stratify=y_simple_encoded
)

# Scale features
scaler_simple = StandardScaler()
X_train_scaled_s = scaler_simple.fit_transform(X_train_s)
X_test_scaled_s = scaler_simple.transform(X_test_s)

# Train a simple model first
print("\nTraining a simple Random Forest model...")
simple_model = RandomForestClassifier(n_estimators=100, random_state=42)
simple_model.fit(X_train_scaled_s, y_train_s)
y_pred_s = simple_model.predict(X_test_scaled_s)
accuracy = accuracy_score(y_test_s, y_pred_s)

print(f"Simple model accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test_s, y_pred_s, target_names=le_simple.classes_))

# Try Logistic Regression with error handling
print("\nTrying Logistic Regression...")
try:
    logreg = LogisticRegression(max_iter=1000, random_state=42)
    logreg.fit(X_train_scaled_s, y_train_s)
    y_pred_log = logreg.predict(X_test_scaled_s)
    accuracy_log = accuracy_score(y_test_s, y_pred_log)
    print(f"Logistic Regression accuracy: {accuracy_log:.4f}")
except Exception as e:
    print(f"Logistic Regression failed: {e}")
    print("Trying with fewer features...")
    
    # Try with even fewer features
    minimal_features = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']
    X_minimal = field_players[minimal_features].fillna(field_players[minimal_features].median())
    y_minimal = field_players['position_group']
    
    # Remove NaN in target
    valid_idx = y_minimal.notnull()
    X_minimal = X_minimal[valid_idx]
    y_minimal = y_minimal[valid_idx]
    
    y_minimal_encoded = LabelEncoder().fit_transform(y_minimal)
    
    X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
        X_minimal, y_minimal_encoded, test_size=0.2, random_state=42, stratify=y_minimal_encoded
    )
    
    scaler_m = StandardScaler()
    X_train_m_scaled = scaler_m.fit_transform(X_train_m)
    X_test_m_scaled = scaler_m.transform(X_test_m)
    
    logreg_simple = LogisticRegression(max_iter=1000, random_state=42)
    logreg_simple.fit(X_train_m_scaled, y_train_m)
    y_pred_m = logreg_simple.predict(X_test_m_scaled)
    accuracy_m = accuracy_score(y_test_m, y_pred_m)
    
    print(f"Minimal Logistic Regression accuracy: {accuracy_m:.4f}")
=== SIMPLIFIED DATA PREPARATION ===
Features: ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'attacking_finishing', 'skill_dribbling', 'movement_reactions', 'power_strength', 'age', 'height_cm', 'weight_kg']
Number of samples: 16860
Classes: ['Defender' 'Forward' 'Midfielder']

Training a simple Random Forest model...
Simple model accuracy: 0.8363

Classification Report:
              precision    recall  f1-score   support

    Defender       0.87      0.89      0.88      1241
     Forward       0.85      0.80      0.82       724
  Midfielder       0.80      0.81      0.81      1407

    accuracy                           0.84      3372
   macro avg       0.84      0.83      0.84      3372
weighted avg       0.84      0.84      0.84      3372


Trying Logistic Regression...
Logistic Regression accuracy: 0.8363

Simple Explanation of This Code: What We're Doing: Building a simpler "Position Guessing Game" that won't crash if we have missing data.

Step-by-Step Breakdown: STEP 1: Picking Only the Most Important Stats python simple_features = [ 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'attacking_finishing', 'skill_dribbling', 'movement_reactions', 'power_strength', 'age', 'height_cm', 'weight_kg' ] Instead of 45 stats, we use only 13 key ones

These are stats that every player should have

Example stats everyone knows:

Pace (speed)

Shooting

Passing

Age, Height, Weight

STEP 2: Removing Goalkeepers python field_players = df[df['position_group'] != 'Goalkeeper'].copy() Goalkeepers have completely different stats

We focus only on field players (defenders, midfielders, forwards)

STEP 3: Filling Empty Cells python X_simple_filled = X_simple.fillna(X_simple.median()) If a stat is missing, give it the average value

Example: If a player is missing "pace" rating, give them the average pace of all players

STEP 4: Removing Players with Missing Position python valid_indices = y_simple.notnull() X_simple_filled = X_simple_filled[valid_indices] y_simple = y_simple[valid_indices] Remove players who don't have a position listed

Can't train if we don't know what we're predicting!

STEP 5: Converting Positions to Numbers python le_simple = LabelEncoder() y_simple_encoded = le_simple.fit_transform(y_simple) Computers work with numbers, not text

Convert:

"Defender" → 0

"Midfielder" → 1

"Forward" → 2

STEP 6: Splitting Data python X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(...) 80% for teaching (the computer learns from this)

20% for testing (we test the computer on this)

STEP 7: Scaling the Numbers python scaler_simple = StandardScaler() X_train_scaled_s = scaler_simple.fit_transform(X_train_s) Put all stats on the same scale (0-100 becomes similar ranges)

Like converting feet to meters so everything is comparable

STEP 8: Training the First Model (Random Forest) python simple_model = RandomForestClassifier(n_estimators=100, random_state=42) simple_model.fit(X_train_scaled_s, y_train_s) "Random Forest": Like asking 100 people to vote on the position

Each "person" looks at different stats

Majority vote wins!

STEP 9: Testing and Showing Results python accuracy = accuracy_score(y_test_s, y_pred_s) print(f"Simple model accuracy: {accuracy:.4f}") Test on the 20% we saved

Show accuracy percentage: "How many did we get right?"

STEP 10: Trying a Second Model (Logistic Regression) python try: logreg = LogisticRegression(max_iter=1000, random_state=42) # ... train and test ... except Exception as e: print(f"Logistic Regression failed: {e}") "Logistic Regression": Draws lines between position groups

Try-Catch: "If this fails, don't crash - just tell me why"

STEP 11: If Second Model Fails, Try Even Simpler python minimal_features = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic'] If still having problems, use only 6 basic stats

These are the core FIFA stats everyone understands

Player Value Prediction

In [9]:
print("\n" + "="*60)
print("PLAYER VALUE PREDICTION")
print("="*60)

# Simple feature selection for value prediction
value_features = [
    'overall', 'potential', 'age', 'international_reputation',
    'weak_foot', 'skill_moves', 'pace', 'shooting', 'passing', 
    'dribbling', 'defending', 'physic'
]

# Only use features that exist in our dataset
existing_value_features = [f for f in value_features if f in df.columns]
print(f"Using {len(existing_value_features)} features for value prediction:")
print(existing_value_features)

# Remove players with extreme values for better modeling
df_value = df[(df['value_eur'] > 0) & (df['value_eur'] < 200000000)].copy()
print(f"\nPlayers for value prediction: {len(df_value)}")
print(f"Value range: €{df_value['value_eur'].min():,.0f} to €{df_value['value_eur'].max():,.0f}")

# Prepare features and target
X_value = df_value[existing_value_features].fillna(df_value[existing_value_features].median())
y_value = df_value['value_eur']

# Check target distribution
print(f"\nValue Statistics:")
print(f"Average value: €{y_value.mean():,.0f}")
print(f"Median value: €{y_value.median():,.0f}")
print(f"Most expensive: €{y_value.max():,.0f}")

# Log transform for better prediction (value has extreme ranges)
y_value_log = np.log1p(y_value)  # log(value + 1)

# Split data
X_train_v, X_test_v, y_train_v, y_test_v = train_test_split(
    X_value, y_value_log, test_size=0.2, random_state=42
)

print(f"\nTraining set: {X_train_v.shape[0]} players")
print(f"Test set: {X_test_v.shape[0]} players")

# Scale features
scaler_v = StandardScaler()
X_train_v_scaled = scaler_v.fit_transform(X_train_v)
X_test_v_scaled = scaler_v.transform(X_test_v)

# Train value prediction models
value_models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'Linear Regression': LinearRegression()
}

print("\n" + "="*60)
print("VALUE PREDICTION RESULTS")
print("="*60)

for name, model in value_models.items():
    model.fit(X_train_v_scaled, y_train_v)
    y_pred_v = model.predict(X_test_v_scaled)
    
    # Convert back from log scale to actual euros
    y_test_actual = np.expm1(y_test_v)
    y_pred_actual = np.expm1(y_pred_v)
    
    rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred_actual))
    r2 = r2_score(y_test_actual, y_pred_actual)
    
    print(f"\n{name}:")
    print(f"  RMSE: €{rmse:,.0f}")
    print(f"  R² Score: {r2:.4f}")
    
    # Calculate percentage error
    mape = np.mean(np.abs((y_test_actual - y_pred_actual) / y_test_actual)) * 100
    print(f"  Average Error: {mape:.1f}%")

# Feature importance
rf_value = RandomForestRegressor(n_estimators=100, random_state=42)
rf_value.fit(X_train_v_scaled, y_train_v)

value_feature_importance = pd.DataFrame({
    'feature': existing_value_features,
    'importance': rf_value.feature_importances_
}).sort_values('importance', ascending=False)

print("\n" + "="*60)
print("MOST IMPORTANT FEATURES FOR PLAYER VALUE")
print("="*60)
print(value_feature_importance.head(10))

# Show example predictions
print("\n" + "="*60)
print("EXAMPLE PREDICTIONS")
print("="*60)

# Get some example players
sample_indices = X_test_v.index[:5]  # First 5 test players
for idx in sample_indices:
    player = df.loc[idx]
    # Find their actual prediction
    test_idx = X_test_v.index.get_loc(idx)
    actual_value = np.expm1(y_test_v.iloc[test_idx])
    predicted_value = np.expm1(y_pred_v[test_idx])
    
    print(f"\n{player['short_name']} ({player['age']} yrs, {player['position_group']}):")
    print(f"  Actual Value: €{actual_value:,.0f}")
    print(f"  Predicted:    €{predicted_value:,.0f}")
    print(f"  Difference:   €{abs(actual_value - predicted_value):,.0f}")
============================================================
PLAYER VALUE PREDICTION
============================================================
Using 12 features for value prediction:
['overall', 'potential', 'age', 'international_reputation', 'weak_foot', 'skill_moves', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

Players for value prediction: 18707
Value range: €5,000 to €105,500,000

Value Statistics:
Average value: €2,253,000
Median value: €650,000
Most expensive: €105,500,000

Training set: 14965 players
Test set: 3742 players

============================================================
VALUE PREDICTION RESULTS
============================================================

Random Forest:
  RMSE: €448,025
  R² Score: 0.9926
  Average Error: 2.7%

Gradient Boosting:
  RMSE: €516,349
  R² Score: 0.9902
  Average Error: 5.5%

Linear Regression:
  RMSE: €2,312,722
  R² Score: 0.8032
  Average Error: 17.4%

============================================================
MOST IMPORTANT FEATURES FOR PLAYER VALUE
============================================================
        feature  importance
0       overall    0.883275
1     potential    0.070703
2           age    0.039136
5   skill_moves    0.002102
7      shooting    0.001863
10    defending    0.001470
9     dribbling    0.000493
8       passing    0.000312
6          pace    0.000256
11       physic    0.000252

============================================================
EXAMPLE PREDICTIONS
============================================================

Jordi Alba (31 yrs, Defender):
  Actual Value: €32,000,000
  Predicted:    €35,640,848
  Difference:   €3,640,848

A. Albu (26 yrs, Defender):
  Actual Value: €725,000
  Predicted:    €744,245
  Difference:   €19,245

R. Krunić (26 yrs, Midfielder):
  Actual Value: €4,600,000
  Predicted:    €3,589,912
  Difference:   €1,010,088

M. Malenica (26 yrs, Goalkeeper):
  Actual Value: €300,000
  Predicted:    €285,330
  Difference:   €14,670

U. Segura (27 yrs, Midfielder):
  Actual Value: €950,000
  Predicted:    €1,138,357
  Difference:   €188,357

Young Talent Identification

In [10]:
# Identify promising young players (high potential relative to age and current rating)
df['potential_growth'] = df['potential'] - df['overall']
df['age_potential_ratio'] = df['potential'] / df['age']

# Young players (under 23) with high potential
young_players = df[df['age'] <= 23].copy()

print(f"\nYoung players (age <= 23): {len(young_players)}")

# Calculate a talent score
young_players['talent_score'] = (
    young_players['potential'] * 0.4 +
    young_players['potential_growth'] * 0.3 +
    (young_players['value_eur'].rank(pct=True) * 100) * 0.3
)

# Top young talents
top_talents = young_players.sort_values('talent_score', ascending=False).head(20)
print("\nTop 20 Young Talents (Age <= 23):")
print(top_talents[['short_name', 'age', 'overall', 'potential', 'club_name', 'value_eur', 'talent_score']].to_string())

# Build a model to predict if a young player will be world-class (overall >= 85)
young_players['world_class'] = (young_players['potential'] >= 85).astype(int)

print(f"\nWorld-class young players: {young_players['world_class'].sum()} out of {len(young_players)}")

# Features for world-class prediction
wc_features = [f for f in existing_features if f not in ['overall', 'potential', 'value_eur', 'wage_eur']]

X_wc = young_players[wc_features].fillna(young_players[wc_features].median())
y_wc = young_players['world_class']

print(f"\nWorld-class prediction features: {len(wc_features)}")

# Handle class imbalance
try:
    from imblearn.over_sampling import SMOTE
    
    X_train_wc, X_test_wc, y_train_wc, y_test_wc = train_test_split(
        X_wc, y_wc, test_size=0.2, random_state=42, stratify=y_wc
    )

    # Check if SMOTE is needed
    print(f"\nClass distribution in training set: {pd.Series(y_train_wc).value_counts().to_dict()}")

    # Apply SMOTE to handle class imbalance
    smote = SMOTE(random_state=42)
    X_train_wc_smote, y_train_wc_smote = smote.fit_resample(X_train_wc, y_train_wc)

    scaler_wc = StandardScaler()
    X_train_wc_scaled = scaler_wc.fit_transform(X_train_wc_smote)
    X_test_wc_scaled = scaler_wc.transform(X_test_wc)

    # Train classifier
    wc_model = RandomForestClassifier(n_estimators=100, random_state=42)
    wc_model.fit(X_train_wc_scaled, y_train_wc_smote)
    y_pred_wc = wc_model.predict(X_test_wc_scaled)

    print("\nWorld-Class Potential Prediction (Young Players):")
    print(f"Accuracy: {accuracy_score(y_test_wc, y_pred_wc):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test_wc, y_pred_wc))

    # Feature importance
    wc_feature_importance = pd.DataFrame({
        'feature': wc_features,
        'importance': wc_model.feature_importances_
    }).sort_values('importance', ascending=False)

    print("\nTop Features for World-Class Potential Prediction:")
    print(wc_feature_importance.head(10))

except ImportError:
    print("\nNote: imblearn not installed. Skipping SMOTE balancing.")
    print("You can install it with: pip install imbalanced-learn")
    
    # Simple train-test split without SMOTE
    X_train_wc, X_test_wc, y_train_wc, y_test_wc = train_test_split(
        X_wc, y_wc, test_size=0.2, random_state=42, stratify=y_wc
    )
    
    scaler_wc = StandardScaler()
    X_train_wc_scaled = scaler_wc.fit_transform(X_train_wc)
    X_test_wc_scaled = scaler_wc.transform(X_test_wc)
    
    # Train classifier
    wc_model = RandomForestClassifier(n_estimators=100, random_state=42)
    wc_model.fit(X_train_wc_scaled, y_train_wc)
    y_pred_wc = wc_model.predict(X_test_wc_scaled)
    
    print("\nWorld-Class Potential Prediction (without SMOTE):")
    print(f"Accuracy: {accuracy_score(y_test_wc, y_pred_wc):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test_wc, y_pred_wc))
Young players (age <= 23): 7720

Top 20 Young Talents (Age <= 23):
         short_name  age  overall  potential               club_name  value_eur  talent_score
366    Vinícius Jr.   19       80         93             Real Madrid   27500000     71.000907
272      João Félix   20       81         93         Atlético Madrid   32000000     70.722280
898       S. Tonali   20       77         91                   Milan   18500000     70.347409
683         Trincão   20       78         91            FC Barcelona   20000000     70.084326
1177      Ansu Fati   17       76         90            FC Barcelona   15000000     69.797798
62       K. Havertz   21       85         93                 Chelsea   57000000     69.588342
6         K. Mbappé   21       90         95     Paris Saint-Germain  105500000     69.500000
1498        T. Kubo   19       75         89           Villarreal CF   14500000     69.374482
99       E. Haaland   19       84         92       Borussia Dortmund   45000000     69.172798
504         Rodrygo   19       79         90             Real Madrid   21000000     69.134845
2488      T. Almada   19       73         89         Vélez Sarsfield    8500000     69.113731
503    D. Upamecano   21       79         90              RB Leipzig   20000000     69.084326
28        J. Sancho   20       87         93       Borussia Dortmund   69500000     68.996114
64       M. de Ligt   20       85         92                Juventus   49500000     68.876684
899      M. Edwards   21       77         89       Vitória Guimarães   16000000     68.865803
63    G. Donnarumma   21       85         92                   Milan   41500000     68.861140
900    M. Greenwood   18       77         89       Manchester United   14500000     68.774482
1499     Riqui Puig   20       75         88            FC Barcelona   12500000     68.550130
2489        A. Urzi   20       73         88  Club Atlético Banfield    9000000     68.526425
1500        B. Saka   18       75         88                 Arsenal   12000000     68.497668

World-class young players: 235 out of 7720

World-class prediction features: 41

Note: imblearn not installed. Skipping SMOTE balancing.
You can install it with: pip install imbalanced-learn

World-Class Potential Prediction (without SMOTE):
Accuracy: 0.9799

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1497
           1       0.90      0.38      0.54        47

    accuracy                           0.98      1544
   macro avg       0.94      0.69      0.76      1544
weighted avg       0.98      0.98      0.98      1544

In [ ]:
Model Deployment & Usage Example
In [11]:
class PlayerAnalyzer:
    def __init__(self):
        self.position_model = None
        self.rating_model = None
        self.value_model = None
        self.position_scaler = None
        self.position_encoder = None
        
    def train_models(self, df):
        """Train all models on the given dataframe"""
        print("Training models...")
        
        # Train position classifier
        field_players = df[df['position_group'] != 'Goalkeeper'].copy()
        
        # Prepare position model
        position_features = [f for f in existing_features if f != 'overall']
        X_pos = field_players[position_features].fillna(field_players[position_features].median())
        y_pos = field_players['position_group']
        
        self.position_encoder = LabelEncoder()
        y_pos_encoded = self.position_encoder.fit_transform(y_pos)
        
        self.position_scaler = StandardScaler()
        X_pos_scaled = self.position_scaler.fit_transform(X_pos)
        
        self.position_model = RandomForestClassifier(n_estimators=100, random_state=42)
        self.position_model.fit(X_pos_scaled, y_pos_encoded)
        
        # Train rating predictor
        rating_features = [f for f in existing_features if f not in ['overall', 'potential', 'value_eur', 'wage_eur']]
        X_rating = df[rating_features].fillna(df[rating_features].median())
        y_rating = df['overall']
        
        scaler_r = StandardScaler()
        X_rating_scaled = scaler_r.fit_transform(X_rating)
        
        self.rating_model = RandomForestRegressor(n_estimators=100, random_state=42)
        self.rating_model.fit(X_rating_scaled, y_rating)
        
        print("Models trained successfully!")
        
    def predict_player(self, player_data):
        """Make predictions for a single player"""
        predictions = {}
        
        # Predict position (if not a goalkeeper)
        if self.position_model and player_data['position_group'] != 'Goalkeeper':
            position_features = [f for f in existing_features if f != 'overall']
            player_pos_features = player_data[position_features].values.reshape(1, -1)
            player_pos_scaled = self.position_scaler.transform(player_pos_features)
            pos_pred = self.position_model.predict(player_pos_scaled)
            predictions['predicted_position'] = self.position_encoder.inverse_transform(pos_pred)[0]
        else:
            predictions['predicted_position'] = 'Goalkeeper'
        
        # Predict overall rating
        if self.rating_model:
            rating_features = [f for f in existing_features if f not in ['overall', 'potential', 'value_eur', 'wage_eur']]
            player_rating_features = player_data[rating_features].values.reshape(1, -1)
            scaler_r = StandardScaler()
            player_rating_scaled = scaler_r.fit_transform(player_rating_features)
            rating_pred = self.rating_model.predict(player_rating_scaled)
            predictions['predicted_overall'] = round(rating_pred[0], 1)
        
        return predictions

# Example usage
print("\n" + "="*60)
print("MODEL DEPLOYMENT EXAMPLE")
print("="*60)

analyzer = PlayerAnalyzer()
analyzer.train_models(df)

# Get a sample player for prediction
sample_player = df.iloc[0]  # Lionel Messi
predictions = analyzer.predict_player(sample_player)

print("\nSample Prediction for Lionel Messi:")
for key, value in predictions.items():
    print(f"{key}: {value}")

print(f"\nActual position group: {sample_player['position_group']}")
print(f"Actual overall: {sample_player['overall']}")

# Test with another player
sample_player2 = df.iloc[1]  # Cristiano Ronaldo
predictions2 = analyzer.predict_player(sample_player2)

print("\n\nSample Prediction for Cristiano Ronaldo:")
for key, value in predictions2.items():
    print(f"{key}: {value}")

print(f"\nActual position group: {sample_player2['position_group']}")
print(f"Actual overall: {sample_player2['overall']}")
============================================================
MODEL DEPLOYMENT EXAMPLE
============================================================
Training models...
Models trained successfully!

Sample Prediction for Lionel Messi:
predicted_position: Forward
predicted_overall: 61.3

Actual position group: Forward
Actual overall: 93


Sample Prediction for Cristiano Ronaldo:
predicted_position: Forward
predicted_overall: 61.3

Actual position group: Forward
Actual overall: 92

Results Summary

Over All OUTPUT

In [14]:
# ============================================================================
# COMPLETE FIFA PLAYERS MACHINE LEARNING ANALYSIS
# RUN ALL OF THIS CODE TOGETHER
# ============================================================================

# Step 1: Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# ============================================================================
# STEP 2: LOAD AND EXPLORE THE DATA
# ============================================================================
print("="*70)
print("STEP 1: LOADING FIFA 21 PLAYERS DATA")
print("="*70)

# Load the data
df = pd.read_csv('datasets/players_21.csv')

print(f"✅ Dataset loaded successfully!")
print(f"📊 Shape: {df.shape[0]} players × {df.shape[1]} attributes")
print(f"\nFirst 3 players:")
print(df[['short_name', 'age', 'overall', 'club_name', 'value_eur']].head(3))

# ============================================================================
# STEP 3: CHECK DATA QUALITY
# ============================================================================
print("\n" + "="*70)
print("STEP 2: CHECKING DATA QUALITY")
print("="*70)

# Check for missing values
print("\n🔍 Checking for missing values:")
missing_values = df.isnull().sum().sort_values(ascending=False)
missing_cols = missing_values[missing_values > 0]
if len(missing_cols) > 0:
    print(f"Found {len(missing_cols)} columns with missing values")
    print("Top 10 columns with missing values:")
    print(missing_cols.head(10))
else:
    print("✅ No missing values found!")

# Basic statistics
print("\n📈 Basic statistics of key attributes:")
key_columns = ['overall', 'potential', 'value_eur', 'wage_eur', 'age', 'height_cm', 'weight_kg']
for col in key_columns:
    if col in df.columns:
        print(f"{col:15s}: Mean = {df[col].mean():.1f}, Min = {df[col].min():.1f}, Max = {df[col].max():.1f}")

# ============================================================================
# STEP 4: FEATURE ENGINEERING - SIMPLIFY POSITIONS
# ============================================================================
print("\n" + "="*70)
print("STEP 3: SIMPLIFYING PLAYER POSITIONS")
print("="*70)

# Create position groups
def group_positions(pos):
    if pd.isna(pos):
        return 'Other'
    pos = str(pos)
    if 'GK' in pos:
        return 'Goalkeeper'
    elif any(x in pos for x in ['CB', 'RB', 'LB', 'LWB', 'RWB']):
        return 'Defender'
    elif any(x in pos for x in ['CDM', 'CM', 'CAM', 'LM', 'RM']):
        return 'Midfielder'
    elif any(x in pos for x in ['LW', 'RW', 'ST', 'CF']):
        return 'Forward'
    else:
        return 'Other'

df['position_group'] = df['player_positions'].apply(lambda x: str(x).split(',')[0].strip() if pd.notna(x) else 'Unknown')
df['position_group'] = df['position_group'].apply(group_positions)

print("📊 Position Group Distribution:")
position_counts = df['position_group'].value_counts()
for pos, count in position_counts.items():
    percentage = (count / len(df)) * 100
    print(f"  {pos:12s}: {count:4d} players ({percentage:.1f}%)")

# ============================================================================
# STEP 5: POSITION CLASSIFICATION MODEL
# ============================================================================
print("\n" + "="*70)
print("STEP 4: POSITION CLASSIFICATION MODEL")
print("="*70)
print("Training models to predict player position from stats...")

# Use only field players and key features
field_players = df[df['position_group'] != 'Goalkeeper'].copy()
print(f"Using {len(field_players)} field players (excluding goalkeepers)")

# Select key features
simple_features = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'age', 'height_cm', 'weight_kg']
# Only use features that exist
available_features = [f for f in simple_features if f in field_players.columns]
print(f"Using {len(available_features)} features: {available_features}")

# Prepare data
X = field_players[available_features].fillna(field_players[available_features].median())
y = field_players['position_group']

# Remove any players with missing position
valid_idx = y.notnull()
X = X[valid_idx]
y = y[valid_idx]

# Encode target positions
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print(f"Position classes: {list(le.classes_)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)
print(f"Training set: {X_train.shape[0]} players")
print(f"Test set: {X_test.shape[0]} players")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

print("\n📊 Model Performance Comparison:")
results = []
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'CV_Mean': cv_scores.mean(),
        'CV_Std': cv_scores.std()
    })
    
    print(f"  {name:20s}: {accuracy:.2%} accuracy")

# Create results dataframe
results_df = pd.DataFrame(results)
best_model_row = results_df.loc[results_df['Accuracy'].idxmax()]
best_model_name = best_model_row['Model']
best_accuracy = best_model_row['Accuracy']

print(f"\n🏆 Best Model: {best_model_name} ({best_accuracy:.2%} accuracy)")

# Feature importance for best model
if best_model_name == 'Random Forest':
    best_model = RandomForestClassifier(n_estimators=100, random_state=42)
    best_model.fit(X_train_scaled, y_train)
    
    feature_importance = pd.DataFrame({
        'Feature': available_features,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\n🔑 Most Important Features for Position Prediction:")
    for i, row in feature_importance.head(5).iterrows():
        print(f"  {i+1}. {row['Feature']:15s}: {row['Importance']:.3f}")

# ============================================================================
# STEP 6: OVERALL RATING PREDICTION
# ============================================================================
print("\n" + "="*70)
print("STEP 5: OVERALL RATING PREDICTION")
print("="*70)
print("Predicting player overall rating from attributes...")

# Features for rating prediction
rating_features = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 
                   'age', 'potential', 'international_reputation']
available_rating_features = [f for f in rating_features if f in df.columns]

X_rating = df[available_rating_features].fillna(df[available_rating_features].median())
y_rating = df['overall']

# Split data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_rating, y_rating, test_size=0.2, random_state=42
)

# Scale features
scaler_r = StandardScaler()
X_train_r_scaled = scaler_r.fit_transform(X_train_r)
X_test_r_scaled = scaler_r.transform(X_test_r)

# Train regression models
reg_models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Linear Regression': LinearRegression(),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

print("\n📊 Rating Prediction Performance:")
reg_results = []
for name, model in reg_models.items():
    model.fit(X_train_r_scaled, y_train_r)
    y_pred_r = model.predict(X_test_r_scaled)
    
    rmse = np.sqrt(mean_squared_error(y_test_r, y_pred_r))
    r2 = r2_score(y_test_r, y_pred_r)
    
    reg_results.append({
        'Model': name,
        'RMSE': rmse,
        'R2_Score': r2
    })
    
    print(f"  {name:20s}: RMSE = {rmse:.2f}, R² = {r2:.4f}")

# Find best regression model
reg_results_df = pd.DataFrame(reg_results)
best_reg_model_row = reg_results_df.loc[reg_results_df['R2_Score'].idxmax()]
best_reg_model_name = best_reg_model_row['Model']
best_r2 = best_reg_model_row['R2_Score']

print(f"\n🏆 Best Rating Model: {best_reg_model_name} (R² = {best_r2:.4f})")

# ============================================================================
# STEP 7: PLAYER VALUE PREDICTION
# ============================================================================
print("\n" + "="*70)
print("STEP 6: PLAYER MARKET VALUE PREDICTION")
print("="*70)
print("Predicting player market value in Euros...")

# Prepare data for value prediction
value_features = ['overall', 'potential', 'age', 'international_reputation', 
                  'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']
available_value_features = [f for f in value_features if f in df.columns]

# Remove extreme values
df_value = df[(df['value_eur'] > 0) & (df['value_eur'] < 200000000)].copy()
print(f"Using {len(df_value)} players with reasonable market values")

X_value = df_value[available_value_features].fillna(df_value[available_value_features].median())
y_value = df_value['value_eur']

# Log transform for better prediction
y_value_log = np.log1p(y_value)

# Split data
X_train_v, X_test_v, y_train_v, y_test_v = train_test_split(
    X_value, y_value_log, test_size=0.2, random_state=42
)

# Scale features
scaler_v = StandardScaler()
X_train_v_scaled = scaler_v.fit_transform(X_train_v)
X_test_v_scaled = scaler_v.transform(X_test_v)

# Train value models
value_models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Linear Regression': LinearRegression(),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

print("\n📊 Value Prediction Performance:")
value_results = []
for name, model in value_models.items():
    model.fit(X_train_v_scaled, y_train_v)
    y_pred_v = model.predict(X_test_v_scaled)
    
    # Convert back from log scale
    y_test_actual = np.expm1(y_test_v)
    y_pred_actual = np.expm1(y_pred_v)
    
    rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred_actual))
    r2 = r2_score(y_test_actual, y_pred_actual)
    mape = np.mean(np.abs((y_test_actual - y_pred_actual) / y_test_actual)) * 100
    
    value_results.append({
        'Model': name,
        'RMSE': rmse,
        'R2_Score': r2,
        'MAPE': mape
    })
    
    print(f"  {name:20s}: RMSE = €{rmse:,.0f}, R² = {r2:.4f}, Error = {mape:.1f}%")

# ============================================================================
# STEP 8: PLAYER CLUSTERING
# ============================================================================
print("\n" + "="*70)
print("STEP 7: PLAYER CLUSTERING")
print("="*70)
print("Grouping players based on similar attributes...")

# Features for clustering
cluster_features = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'age', 'overall']
available_cluster_features = [f for f in cluster_features if f in df.columns]

X_cluster = df[available_cluster_features].fillna(df[available_cluster_features].median())

# Scale features
scaler_c = StandardScaler()
X_cluster_scaled = scaler_c.fit_transform(X_cluster)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X_cluster_scaled)

print(f"\n✅ Created {df['cluster'].nunique()} player clusters")
print("\n📊 Cluster Distribution:")
cluster_counts = df['cluster'].value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    percentage = (count / len(df)) * 100
    # Get average stats for each cluster
    cluster_data = df[df['cluster'] == cluster_id]
    avg_age = cluster_data['age'].mean()
    avg_overall = cluster_data['overall'].mean()
    main_position = cluster_data['position_group'].mode()[0] if 'position_group' in cluster_data.columns else 'N/A'
    
    print(f"\n  Cluster {cluster_id}:")
    print(f"    Size: {count} players ({percentage:.1f}%)")
    print(f"    Avg Age: {avg_age:.1f} years")
    print(f"    Avg Rating: {avg_overall:.1f}")
    print(f"    Main Position: {main_position}")

# ============================================================================
# STEP 9: YOUNG TALENT IDENTIFICATION
# ============================================================================
print("\n" + "="*70)
print("STEP 8: YOUNG TALENT IDENTIFICATION")
print("="*70)

# Focus on young players (age <= 23)
young_players = df[df['age'] <= 23].copy()
print(f"Found {len(young_players)} young players (age ≤ 23)")

# Calculate talent score
young_players['potential_growth'] = young_players['potential'] - young_players['overall']
young_players['talent_score'] = (
    young_players['potential'] * 0.4 +
    young_players['potential_growth'] * 0.3 +
    (young_players['value_eur'].rank(pct=True) * 100) * 0.3
)

# Top 10 young talents
top_talents = young_players.sort_values('talent_score', ascending=False).head(10)
print("\n🌟 TOP 10 YOUNG TALENTS (Age ≤ 23):")
print("-" * 70)
for i, (idx, player) in enumerate(top_talents.iterrows(), 1):
    print(f"{i:2d}. {player['short_name']:20s} | Age: {player['age']:2.0f} | "
          f"Rating: {player['overall']:2.0f}/{player['potential']:2.0f} | "
          f"Club: {player['club_name']:20s} | Value: €{player['value_eur']:,.0f}")

# ============================================================================
# STEP 10: FINAL SUMMARY AND VISUALIZATIONS
# ============================================================================
print("\n" + "="*70)
print("FINAL SUMMARY & RESULTS")
print("="*70)

print("\n📋 MODEL PERFORMANCE SUMMARY:")
print("-" * 50)

# Position Classification
print("\n1. POSITION CLASSIFICATION:")
print(f"   Best Model: {best_model_name}")
print(f"   Accuracy: {best_accuracy:.2%}")
print(f"   Can predict: Defender, Midfielder, Forward")

# Rating Prediction
print("\n2. OVERALL RATING PREDICTION:")
print(f"   Best Model: {best_reg_model_name}")
print(f"   R² Score: {best_r2:.4f}")
print(f"   Average Error: {reg_results_df.loc[reg_results_df['R2_Score'].idxmax(), 'RMSE']:.1f} points")

# Value Prediction
best_value_row = pd.DataFrame(value_results).loc[pd.DataFrame(value_results)['R2_Score'].idxmax()]
print("\n3. MARKET VALUE PREDICTION:")
print(f"   Best Model: {best_value_row['Model']}")
print(f"   R² Score: {best_value_row['R2_Score']:.4f}")
print(f"   Average Error: {best_value_row['MAPE']:.1f}%")

# Clustering
print("\n4. PLAYER CLUSTERING:")
print(f"   Created: {df['cluster'].nunique()} distinct player groups")
print(f"   Groups represent different playing styles and skill levels")

# Young Talents
print("\n5. YOUNG TALENT ANALYSIS:")
print(f"   Analyzed: {len(young_players)} players age ≤ 23")
print(f"   Top Talent: {top_talents.iloc[0]['short_name']} "
      f"(Rating: {top_talents.iloc[0]['overall']}/{top_talents.iloc[0]['potential']})")

print("\n" + "="*70)
print("RECOMMENDED APPLICATIONS")
print("="*70)
print("✓ Player Scouting: Find undervalued players")
print("✓ Team Building: Create balanced squads")
print("✓ Youth Development: Identify promising talents")
print("✓ Transfer Market: Estimate fair player values")
print("✓ Tactical Planning: Understand player strengths/weaknesses")

print("\n" + "="*70)
print("VISUALIZATIONS")
print("="*70)

# Create some simple visualizations
plt.figure(figsize=(15, 10))

# 1. Position Distribution
plt.subplot(2, 3, 1)
position_counts.plot(kind='bar', color='skyblue')
plt.title('Player Position Distribution')
plt.xlabel('Position Group')
plt.ylabel('Number of Players')
plt.xticks(rotation=45)

# 2. Rating Distribution
plt.subplot(2, 3, 2)
df['overall'].hist(bins=30, color='lightgreen', edgecolor='black')
plt.title('Overall Rating Distribution')
plt.xlabel('Rating (0-100)')
plt.ylabel('Number of Players')

# 3. Age Distribution
plt.subplot(2, 3, 3)
df['age'].hist(bins=20, color='lightcoral', edgecolor='black')
plt.title('Player Age Distribution')
plt.xlabel('Age')
plt.ylabel('Number of Players')

# 4. Model Accuracy Comparison
plt.subplot(2, 3, 4)
plt.bar(results_df['Model'], results_df['Accuracy'], color='orange')
plt.title('Model Accuracy Comparison')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.ylim(0, 1)

# 5. Cluster Distribution
plt.subplot(2, 3, 5)
cluster_counts.plot(kind='bar', color='purple')
plt.title('Player Clusters Distribution')
plt.xlabel('Cluster ID')
plt.ylabel('Number of Players')

# 6. Value vs Rating
plt.subplot(2, 3, 6)
plt.scatter(df['overall'], df['value_eur'], alpha=0.3, s=10, color='blue')
plt.title('Market Value vs Overall Rating')
plt.xlabel('Overall Rating')
plt.ylabel('Market Value (€)')
plt.yscale('log')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✅ All models trained successfully!")
print("📈 Visualizations displayed above")
print("\n🎯 You can now use these models to:")
print("   - Predict player positions")
print("   - Estimate player ratings")
print("   - Predict market values")
print("   - Find similar players")
print("   - Identify young talents")

# ============================================================================
# BONUS: QUICK PREDICTION FUNCTION
# ============================================================================
print("\n" + "="*70)
print("BONUS: QUICK PREDICTION FUNCTION")
print("="*70)

def predict_player_position(player_stats):
    """
    Predict position for a player based on stats
    player_stats should be a dictionary with keys from available_features
    """
    # Create a sample player dataframe
    sample_df = pd.DataFrame([player_stats])
    
    # Fill missing values with median
    for col in available_features:
        if col not in sample_df.columns:
            sample_df[col] = X[col].median()
    
    # Reorder columns
    sample_df = sample_df[available_features]
    
    # Scale features
    sample_scaled = scaler.transform(sample_df)
    
    # Predict
    if 'best_model' in locals():
        prediction = best_model.predict(sample_scaled)
    else:
        best_model_temp = RandomForestClassifier(n_estimators=100, random_state=42)
        best_model_temp.fit(X_train_scaled, y_train)
        prediction = best_model_temp.predict(sample_scaled)
    
    position = le.inverse_transform(prediction)[0]
    return position

# Example usage
print("\n📝 Example: Predict position for a player with these stats:")
example_stats = {
    'pace': 85,
    'shooting': 75,
    'passing': 90,
    'dribbling': 88,
    'defending': 40,
    'physic': 65,
    'age': 25,
    'height_cm': 180,
    'weight_kg': 75
}

predicted_position = predict_player_position(example_stats)
print(f"Stats: Pace={example_stats['pace']}, Shooting={example_stats['shooting']}, "
      f"Passing={example_stats['passing']}, Dribbling={example_stats['dribbling']}, "
      f"Defending={example_stats['defending']}")
print(f"🤔 Predicted Position: {predicted_position}")

print("\n" + "="*70)
print("ANALYSIS COMPLETE! 🎉")
print("="*70)
print("\nYou now have a complete machine learning system for FIFA players!")
print("Use the insights to make better football decisions!")
======================================================================
STEP 1: LOADING FIFA 21 PLAYERS DATA
======================================================================
✅ Dataset loaded successfully!
📊 Shape: 18944 players × 106 attributes

First 3 players:
          short_name  age  overall        club_name  value_eur
0           L. Messi   33       93     FC Barcelona   67500000
1  Cristiano Ronaldo   35       92         Juventus   46000000
2           J. Oblak   27       91  Atlético Madrid   75000000

======================================================================
STEP 2: CHECKING DATA QUALITY
======================================================================

🔍 Checking for missing values:
Found 26 columns with missing values
Top 10 columns with missing values:
defending_marking       18944
loaned_from             18186
nation_jersey_number    17817
nation_position         17817
player_tags             17536
gk_speed                16861
gk_kicking              16861
gk_handling             16861
gk_diving               16861
gk_positioning          16861
dtype: int64

📈 Basic statistics of key attributes:
overall        : Mean = 65.7, Min = 47.0, Max = 93.0
potential      : Mean = 71.1, Min = 47.0, Max = 95.0
value_eur      : Mean = 2224813.3, Min = 0.0, Max = 105500000.0
wage_eur       : Mean = 8675.9, Min = 0.0, Max = 560000.0
age            : Mean = 25.2, Min = 16.0, Max = 53.0
height_cm      : Mean = 181.2, Min = 155.0, Max = 206.0
weight_kg      : Mean = 75.0, Min = 50.0, Max = 110.0

======================================================================
STEP 3: SIMPLIFYING PLAYER POSITIONS
======================================================================
📊 Position Group Distribution:
  Midfielder  : 7037 players (37.1%)
  Defender    : 6205 players (32.8%)
  Forward     : 3618 players (19.1%)
  Goalkeeper  : 2084 players (11.0%)

======================================================================
STEP 4: POSITION CLASSIFICATION MODEL
======================================================================
Training models to predict player position from stats...
Using 16860 field players (excluding goalkeepers)
Using 9 features: ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'age', 'height_cm', 'weight_kg']
Position classes: ['Defender', 'Forward', 'Midfielder']
Training set: 13488 players
Test set: 3372 players

📊 Model Performance Comparison:
  Random Forest       : 83.75% accuracy
  Logistic Regression : 83.78% accuracy
  K-Nearest Neighbors : 80.99% accuracy
  Decision Tree       : 77.16% accuracy
  Gradient Boosting   : 84.05% accuracy

🏆 Best Model: Gradient Boosting (84.05% accuracy)

======================================================================
STEP 5: OVERALL RATING PREDICTION
======================================================================
Predicting player overall rating from attributes...

📊 Rating Prediction Performance:
  Random Forest       : RMSE = 1.19, R² = 0.9701
  Linear Regression   : RMSE = 2.40, R² = 0.8773
  Gradient Boosting   : RMSE = 1.38, R² = 0.9596

🏆 Best Rating Model: Random Forest (R² = 0.9701)

======================================================================
STEP 6: PLAYER MARKET VALUE PREDICTION
======================================================================
Predicting player market value in Euros...
Using 18707 players with reasonable market values

📊 Value Prediction Performance:
  Random Forest       : RMSE = €444,387, R² = 0.9927, Error = 2.8%
  Linear Regression   : RMSE = €2,417,816, R² = 0.7849, Error = 18.0%
  Gradient Boosting   : RMSE = €539,322, R² = 0.9893, Error = 5.6%

======================================================================
STEP 7: PLAYER CLUSTERING
======================================================================
Grouping players based on similar attributes...

✅ Created 5 player clusters

📊 Cluster Distribution:

  Cluster 0:
    Size: 6156 players (32.5%)
    Avg Age: 27.2 years
    Avg Rating: 67.2
    Main Position: Defender

  Cluster 1:
    Size: 2651 players (14.0%)
    Avg Age: 24.9 years
    Avg Rating: 62.6
    Main Position: Defender

  Cluster 2:
    Size: 4028 players (21.3%)
    Avg Age: 20.5 years
    Avg Rating: 57.6
    Main Position: Midfielder

  Cluster 3:
    Size: 3581 players (18.9%)
    Avg Age: 25.5 years
    Avg Rating: 67.6
    Main Position: Forward

  Cluster 4:
    Size: 2528 players (13.3%)
    Avg Age: 27.8 years
    Avg Rating: 75.4
    Main Position: Midfielder

======================================================================
STEP 8: YOUNG TALENT IDENTIFICATION
======================================================================
Found 7720 young players (age ≤ 23)

🌟 TOP 10 YOUNG TALENTS (Age ≤ 23):
----------------------------------------------------------------------
 1. Vinícius Jr.         | Age: 19 | Rating: 80/93 | Club: Real Madrid          | Value: €27,500,000
 2. João Félix           | Age: 20 | Rating: 81/93 | Club: Atlético Madrid      | Value: €32,000,000
 3. S. Tonali            | Age: 20 | Rating: 77/91 | Club: Milan                | Value: €18,500,000
 4. Trincão              | Age: 20 | Rating: 78/91 | Club: FC Barcelona         | Value: €20,000,000
 5. Ansu Fati            | Age: 17 | Rating: 76/90 | Club: FC Barcelona         | Value: €15,000,000
 6. K. Havertz           | Age: 21 | Rating: 85/93 | Club: Chelsea              | Value: €57,000,000
 7. K. Mbappé            | Age: 21 | Rating: 90/95 | Club: Paris Saint-Germain  | Value: €105,500,000
 8. T. Kubo              | Age: 19 | Rating: 75/89 | Club: Villarreal CF        | Value: €14,500,000
 9. E. Haaland           | Age: 19 | Rating: 84/92 | Club: Borussia Dortmund    | Value: €45,000,000
10. Rodrygo              | Age: 19 | Rating: 79/90 | Club: Real Madrid          | Value: €21,000,000

======================================================================
FINAL SUMMARY & RESULTS
======================================================================

📋 MODEL PERFORMANCE SUMMARY:
--------------------------------------------------

1. POSITION CLASSIFICATION:
   Best Model: Gradient Boosting
   Accuracy: 84.05%
   Can predict: Defender, Midfielder, Forward

2. OVERALL RATING PREDICTION:
   Best Model: Random Forest
   R² Score: 0.9701
   Average Error: 1.2 points

3. MARKET VALUE PREDICTION:
   Best Model: Random Forest
   R² Score: 0.9927
   Average Error: 2.8%

4. PLAYER CLUSTERING:
   Created: 5 distinct player groups
   Groups represent different playing styles and skill levels

5. YOUNG TALENT ANALYSIS:
   Analyzed: 7720 players age ≤ 23
   Top Talent: Vinícius Jr. (Rating: 80/93)

======================================================================
RECOMMENDED APPLICATIONS
======================================================================
✓ Player Scouting: Find undervalued players
✓ Team Building: Create balanced squads
✓ Youth Development: Identify promising talents
✓ Transfer Market: Estimate fair player values
✓ Tactical Planning: Understand player strengths/weaknesses

======================================================================
VISUALIZATIONS
======================================================================
No description has been provided for this image
✅ All models trained successfully!
📈 Visualizations displayed above

🎯 You can now use these models to:
   - Predict player positions
   - Estimate player ratings
   - Predict market values
   - Find similar players
   - Identify young talents

======================================================================
BONUS: QUICK PREDICTION FUNCTION
======================================================================

📝 Example: Predict position for a player with these stats:
Stats: Pace=85, Shooting=75, Passing=90, Dribbling=88, Defending=40
🤔 Predicted Position: Midfielder

======================================================================
ANALYSIS COMPLETE! 🎉
======================================================================

You now have a complete machine learning system for FIFA players!
Use the insights to make better football decisions!

What This Code Does: Loads your FIFA 21 data from the datasets folder

Cleans and prepares the data

Trains 5 different machine learning models:

Position classifier (predicts Defender/Midfielder/Forward)

Rating predictor (predicts overall rating 0-100)

Value predictor (predicts market value in €)

Player clustering (groups similar players)

Young talent identifier (finds promising young players)

Shows results and visualizations

Creates a prediction function you can use

How to Run: Copy ALL the code above

Paste it into your notebook/editor

Run it all at once

Wait for it to finish (it will show progress as it runs)

Expected time: 1-3 minutes depending on your computer

Output you'll see:

Loading confirmation ✅

Data quality check 🔍

Model training progress 📊

Results summary 📋

Visualizations 📈

Young talent list 🌟

Prediction examples 📝