[Frosti Gíslason] - Fab Futures - Data Science
Home About

< Home

Probability¶

Investigate the probability distribution of your data Set up template notebooks and slides for your data set analysis

Here is the dataset reloaded from last assignment: Average temperature in the Nordic Capitals downloaded from https://nordicstatistics.org/areas/geography-and-climate/

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv("datasets/nordicaveragetemp.csv")

# Show the first 5 rows
df.head()
Out[2]:
Category DK FO GL FI AX IS NO SE
0 1874 7.8 NaN NaN 4.8 NaN NaN 5.9 6.0
1 1875 6.9 NaN NaN 1.9 NaN NaN 4.3 4.3
2 1876 7.1 NaN NaN 3.1 NaN NaN 4.6 4.9
3 1877 6.8 NaN NaN 3.3 NaN NaN 3.6 4.7
4 1878 7.8 NaN NaN 5.2 NaN NaN 5.9 6.2

Show the last 5 rows

In [3]:
# Show the last 5 rows
df.tail()
Out[3]:
Category DK FO GL FI AX IS NO SE
146 2020 10.7 7.2 -0.7 8.7 8.5 5.1 8.9 9.7
147 2021 9.6 7.0 0.1 6.6 6.9 5.4 7.3 8.1
148 2022 10.3 7.3 -1.0 7.3 7.4 5.1 8.0 8.8
149 2023 10.1 7.3 -0.3 7.1 6.7 5.0 7.0 8.0
150 2024 10.5 7.1 -0.7 NaN 7.4 4.3 7.8 8.9

From last assignment where I asked ChatGPT for help. Now I want to create a visualisation using matplotlib and numpy import matplotlib.pyplot as plt import numpy as np I want each Nordic country to have it's own color and have a contunuing line, and on x axis there would be the year and y axis the temperature.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load your dataset
df = pd.read_csv("datasets/nordicaveragetemp.csv")

# Convert the Category column (years) to integers
df["Category"] = df["Category"].astype(int)

# Nordic country columns
countries = ["DK", "FO", "GL", "FI", "AX", "IS", "NO", "SE"]

# Assign a color to each country
colors = {
    "DK": "red",
    "FO": "purple",
    "GL": "blue",
    "FI": "green",
    "AX": "orange",
    "IS": "cyan",
    "NO": "black",
    "SE": "brown"
}

# Create the plot
plt.figure(figsize=(14, 7))

for country in countries:
    if country in df.columns:
        plt.plot(
            df["Category"],
            df[country],
            color=colors[country],
            label=country,
            linewidth=2
        )

plt.title("Average Annual Temperature in Nordic Countries (°C)")
plt.xlabel("Year")
plt.ylabel("Temperature (°C)")
plt.grid(True, linestyle="--", alpha=0.4)
plt.legend(title="Country", ncol=4)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 
In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load your dataset
df = pd.read_csv("datasets/nordicaveragetemp.csv")

# Drop the year column if named 'Category' or similar
df_numeric = df.drop(columns=["Category"])

# Updated pairplot using fill instead of shade
sns.pairplot(
    df_numeric,
    kind="scatter",
    diag_kind="kde",
    plot_kws={"alpha": 0.6, "s": 30},
    diag_kws={"fill": True}  # <--- Updated here
)

plt.show()
No description has been provided for this image

As a newby in this DataScience, I am working on understanding things more and more, and for me ChatGPT is a great resource.

ChatGPT tells me to use histogram when exploring dataset for the first time and when we want sense of how many years fall into each temperature range.

Information about Seaborn: https://seaborn.pydata.org/generated/seaborn.histplot.html

In [9]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("datasets/nordicaveragetemp.csv")
df_numeric = df.drop(columns=["Category"])

sns.pairplot(
    df_numeric,
    kind="scatter",
    diag_kind="hist",
    diag_kws={"bins": 15, "edgecolor": "black"},
    plot_kws={"alpha": 0.6, "s": 30}
)

plt.show()
No description has been provided for this image

If we take a look at the lowest column about Sweden we can see how Denmark and Sweden, and Sweden and Finland, and Sweden and Norway align quite well and have nice density.

In [ ]: