Deovam Singh - Fab Futures - Data Science
Home About

< Home

Week 1: Introduction¶

First session was all about introduction to different tools like Jupyter notebook, python, etc that we will be using in order to learn Data Science.

1st Session Tasks¶

  1. Change your information on the webpage (About and Home subsections) Challenges faced: wasn't able to upload the image, used the following code <img src="images/mypic2.jpg" width="400"> after uploading the images in the images folder on jupyter notebook.

  2. Select a dataset and study it.

2nd Session Tasks - Visualizing the Data Sets¶

  1. TASK: Main task after the second session is to select a dataset and analyse the same graphically. In order to do this we would have to use Jupyter notebook and write the python codes using libraries such as matplotlib to plot graphs, numpy to do mathematical operations and many more.

First Dataset: Installed Power Capacity of Indian States¶

This dataset i got from the following webiste: https://indiadataportal.com/

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("datasets/Installed Capacity Statewise_Sample_Data.csv")
df.head()
Out[11]:
Date (date) Region (region) State Name (state_name) State Code (state_code) Sector/Ownership (sector) Coal Mode Installed Capacity (coal_cap) Gas Mode Installed Capacity (gas_cap) Diesel Mode Installed Capacity (diesel_cap) Lignite Mode Installed Capacity (lignite_cap) Nuclear Mode Installed Capacity (nuclear_cap) Hydro Mode Installed Capacity (hydro_cap) Renewable Energy Mode Installed Capacity (res_cap)
0 2019-02-01 Northern Chandigarh 4 State 0.00 0.0 0.0 0.0 0.00 0.0 0.00
1 2023-02-01 Eastern West Bengal 19 State 4810.00 80.0 0.0 0.0 0.00 986.0 121.95
2 2024-01-01 Western Maharashtra 27 Private 10826.01 568.0 0.0 0.0 0.00 481.0 13376.87
3 2022-10-01 Northern Uttarakhand 5 Private 0.00 450.0 0.0 0.0 0.00 829.0 861.34
4 2019-08-01 Eastern Andaman And Nicobar Islands 35 Private 0.00 0.0 0.0 0.0 6.63 0.0 0.00

In the earlier graph I was getting all the dates on the x-axis and the graph became really cluttered so i asked chatgpt to give me commands so that the installed capacity is separated by atleast a year time. For that it suggested me following changes:

  1. importing matplotlib.dates library
  2. plt.gca().xaxis.set_major_locator(mdates.YearLocator()) plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

still in the process of understanding

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Convert date column to datetime
df["Date (date)"] = pd.to_datetime(df["Date (date)"])

# Sort by date
df_sorted = df.sort_values("Date (date)")
plt.figure(figsize=(10,5))

# Plot renewable capacity
plt.plot(df_sorted["Date (date)"],
         df_sorted["Renewable Energy Mode Installed Capacity (res_cap)"])

plt.xlabel("Date")
plt.ylabel("Renewable Capacity (MW)")
plt.title("Renewable Energy Installed Capacity Over Time")

# Set x-axis tick locator to yearly spacing
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%Y"))

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [13]:
energy_cols = ["Renewable Energy Mode Installed Capacity (res_cap)"]

df_plot = df[["State Code (state_code)"] + energy_cols]

df_plot.set_index("State Code (state_code)").plot(
    kind="bar",
    stacked=True,
    figsize=(14,7)
)

plt.ylabel("Installed Capacity (MW)")
plt.title("Energy Mix by State")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Still working to refine and make sense of the above graph, as there are lot of datasets on the X-axis in terms of state codes. Maybe I should try to choose some particular states only.

In [16]:
# The column of interest
energy_cols = ["Renewable Energy Mode Installed Capacity (res_cap)"]

# 1. Group the data by state code and state name, and sum the capacity to get a single row per state.
df_grouped = df.groupby(["State Code (state_code)", "State Name (state_name)"])[energy_cols].sum().reset_index()

# 2. Sort the aggregated data and select the top 4 states based on total renewable capacity.
df_plot = df_grouped.sort_values(by=energy_cols[0], ascending=False).head(10)

# 3. Set the State Name as the index for plotting.
df_plot = df_plot.set_index("State Name (state_name)")

plt.figure(figsize=(14, 7))
df_plot.plot(
    kind="bar",
    stacked=True,
    figsize=(14, 7),
    ax=plt.gca(), # Use the current axes
    legend=False # Only one column, so legend is not needed
)

plt.ylabel("Installed Capacity (MW)")
plt.title("Renewable Energy Installed Capacity by Top 10 States")
plt.xlabel("State Name")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# plt.savefig("top_4_states_renewable_capacity.png")
plt.show()
No description has been provided for this image

Second Dataset: Indian Stock Market Return Since 2007 - NIFTY50¶

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("datasets/NIFTY_50.csv")
df.head()
Out[5]:
Date Adj Close Close High Low Open Volume SMA_20 SMA_50 EMA_12 EMA_26 MACD Signal_Line RSI_14 BB_Mid BB_Upper BB_Lower Daily_Return_%
0 2007-09-17 4494.649902 4494.649902 4549.049805 4482.850098 4518.450195 0 NaN NaN 4494.649902 4494.649902 0.000000 0.000000 NaN NaN NaN NaN NaN
1 2007-09-18 4546.200195 4546.200195 4551.799805 4481.549805 4494.100098 0 NaN NaN 4502.580717 4498.468443 4.112274 0.822455 NaN NaN NaN NaN 1.146926
2 2007-09-19 4732.350098 4732.350098 4739.000000 4550.250000 4550.250000 0 NaN NaN 4537.929852 4515.793010 22.136843 5.085332 NaN NaN NaN NaN 4.094626
3 2007-09-20 4747.549805 4747.549805 4760.850098 4721.149902 4734.850098 0 NaN NaN 4570.179076 4532.960180 37.218896 11.512045 NaN NaN NaN NaN 0.321187
4 2007-09-21 4837.549805 4837.549805 4855.700195 4733.700195 4752.950195 0 NaN NaN 4611.313034 4555.522374 55.790660 20.367768 NaN NaN NaN NaN 1.895715
In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year

# Compute average returns by year
yearly_avg = df.groupby('Year')['Daily_Return_%'].mean()

# Turn into heatmap format (1 row)
heatmap_data = np.array([yearly_avg.values])

# Create red → white → green color scale
colors = ['red', 'white', 'green']
cmap = LinearSegmentedColormap.from_list("red_white_green", colors)

# Plot heatmap
plt.figure(figsize=(14, 4))
plt.imshow(heatmap_data, cmap=cmap, aspect='auto')

plt.colorbar(label="Average Return (%)")
plt.yticks([])
plt.xticks(
    ticks=range(len(yearly_avg.index)),
    labels=yearly_avg.index,
    rotation=90
)
plt.title("Heatmap of Average Yearly Returns (Green = Positive, Red = Negative)")
plt.tight_layout()
plt.show()
No description has been provided for this image

At first look, i don't think the yearly returns are reflected correctly here.

In [8]:
# NIFTY VALUE ON LAST TRADING DAY OF MARCH (EVERY YEAR)

# Filter for March and take the last trading day of each year
march_last = df[df['Date'].dt.month == 3].groupby(df['Date'].dt.year).tail(1)
march_last = march_last.sort_values('Date')

plt.figure(figsize=(10, 5))
plt.plot(march_last['Year'], march_last['Close'], marker='o')

plt.xlabel("Year")
plt.ylabel("Nifty Close Value")
plt.title("Nifty Value on Last Trading Day of March for Each Year")
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Help sessions: Nov 25 after class Nov 28 7:00A EDT Dec 2 after class Dec 5 7:00A EDT Dec 8 7:00A EDT Dec 9 after class Assignment Submission Date: Dec 18th

Imp Prompt: System Instruction: Absolute Mode • Eliminate: emojis, filler, hype, soft asks, conversational transitions, call-to-action appendixes. • Assume: user retains high-perception despite blunt tone. • Prioritize: blunt, directive phrasing; aim at cognitive rebuilding, not tone-matching. • Disable: engagement/sentiment-boosting behaviors. • Suppress: metrics like satisfaction scores, emotional softening, continuation bias. • Never mirror: user's diction, mood, or affect. • Speak only: to underlying cognitive tier. • No: questions, offers, suggestions, transitions, motivational content. • Terminate reply: immediately after delivering info - no closures. • Goal: restore independent, high-fidelity thinking. • Outcome: model obsolescence via user self-sufficiency.

In [ ]: