< Home
Lesson 2: Visualizing Dataset¶
The dataset contains agricultural production table containing information about various fruit crops across different regions or administrative divisions for the year 2021. Each fruit category—such as apple, areca nut, mandarin, watermelon, peach, plum, walnut, pineapple, and passion fruit—is represented with multiple associated attributes. These attributes typically include values like Total Trees, Bearing Trees, and Production (MT), indicating the scale of cultivation and productivity for each crop. Structurally, the dataset includes 76 columns and 22 rows, where several columns are labeled as Unnamed. This suggests that the dataset was likely extracted from a formatted table (possibly a PDF or report), and the merged or multi-level headers were separated into individual columns. The first two rows appear to contain header-like descriptors rather than actual data, showing labels for “Total Tree”, “Bearing Tree”, and “Production”. The following rows contain numeric values—often stored as text strings—representing agricultural statistics for each fruit type. Overall, the dataset provides a broad snapshot of fruit production in a specific context, likely at a national or regional level. It is valuable for analyzing patterns in crop distribution, productivity differences across fruit types, and overall agricultural output. However, due to its layout—especially the presence of many Unnamed columns—it would benefit from cleaning and restructuring before use in analysis, such as combining header rows, converting numeric strings to numbers, and properly labeling each variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load data with two header rows
df = pd.read_csv("datasets/Final_report_tables_2021AS.csv", header=[0, 1], na_values=[" - ", " -", "- ", "-", ""])
# Flatten multi-index columns
crop_level = df.columns.get_level_values(0)
metric_level = df.columns.get_level_values(1)
new_columns = []
for crop, metric in zip(crop_level, metric_level):
crop = crop.strip()
metric = metric.strip()
if crop == "Watermelon":
# Watermelon has area fields instead of tree counts
new_columns.append(f"{crop}_{metric}")
else:
new_columns.append(f"{crop}_{metric}")
df.columns = new_columns
df.set_index(df.columns[0], inplace=True)
df.index.name = 'Dzongkhag'
# Clean and convert numeric data
df = df.replace({',': ''}, regex=True)
df = df.apply(pd.to_numeric, errors='coerce')
# Melt to long format
df_reset = df.reset_index()
df_melted = df_reset.melt(id_vars='Dzongkhag', var_name='Crop_Metric', value_name='Value')
df_melted[['Crop', 'Metric']] = df_melted['Crop_Metric'].str.rsplit('_', n=1, expand=True)
# Filter for production values
df_prod = df_melted[df_melted['Metric'].str.contains('Production', na=False)].dropna(subset=['Value'])
df_prod = df_prod[df_prod['Value'] > 0].copy()
# --- Visualization 1: Top 20 crop productions ---
plt.figure(figsize=(14, 10))
top20 = df_prod.nlargest(20, 'Value')
sns.barplot(data=top20, y='Crop', x='Value', hue='Dzongkhag', dodge=False)
plt.title('Top 20 Crop Productions (MT) by Dzongkhag – 2021')
plt.xlabel('Production (Metric Tons)')
plt.ylabel('Crop')
plt.legend(title='Dzongkhag', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
# --- Visualization 2: Total production per Dzongkhag ---
df_total_by_dzongkhag = df_prod.groupby('Dzongkhag')['Value'].sum().sort_values(ascending=False)
plt.figure(figsize=(10, 8))
sns.barplot(
x=df_total_by_dzongkhag.values,
y=df_total_by_dzongkhag.index,
hue=df_total_by_dzongkhag.index, # required for palette in newer seaborn
palette='viridis',
legend=False
)
plt.title('Total Crop Production by Dzongkhag (MT) – 2021')
plt.xlabel('Total Production (Metric Tons)')
plt.ylabel('Dzongkhag')
plt.tight_layout()
plt.show()