< Home
Probability¶
Investigate the probability distribution of your data Set up template notebooks and slides for your data set analysis
Here is the dataset reloaded from last assignment: Average temperature in the Nordic Capitals downloaded from https://nordicstatistics.org/areas/geography-and-climate/
import pandas as pd
# Load the dataset
df = pd.read_csv("datasets/nordicaveragetemp.csv")
# Show the first 5 rows
df.head()
| Category | DK | FO | GL | FI | AX | IS | NO | SE | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1874 | 7.8 | NaN | NaN | 4.8 | NaN | NaN | 5.9 | 6.0 |
| 1 | 1875 | 6.9 | NaN | NaN | 1.9 | NaN | NaN | 4.3 | 4.3 |
| 2 | 1876 | 7.1 | NaN | NaN | 3.1 | NaN | NaN | 4.6 | 4.9 |
| 3 | 1877 | 6.8 | NaN | NaN | 3.3 | NaN | NaN | 3.6 | 4.7 |
| 4 | 1878 | 7.8 | NaN | NaN | 5.2 | NaN | NaN | 5.9 | 6.2 |
Show the last 5 rows
# Show the last 5 rows
df.tail()
| Category | DK | FO | GL | FI | AX | IS | NO | SE | |
|---|---|---|---|---|---|---|---|---|---|
| 146 | 2020 | 10.7 | 7.2 | -0.7 | 8.7 | 8.5 | 5.1 | 8.9 | 9.7 |
| 147 | 2021 | 9.6 | 7.0 | 0.1 | 6.6 | 6.9 | 5.4 | 7.3 | 8.1 |
| 148 | 2022 | 10.3 | 7.3 | -1.0 | 7.3 | 7.4 | 5.1 | 8.0 | 8.8 |
| 149 | 2023 | 10.1 | 7.3 | -0.3 | 7.1 | 6.7 | 5.0 | 7.0 | 8.0 |
| 150 | 2024 | 10.5 | 7.1 | -0.7 | NaN | 7.4 | 4.3 | 7.8 | 8.9 |
From last assignment where I asked ChatGPT for help. Now I want to create a visualisation using matplotlib and numpy import matplotlib.pyplot as plt import numpy as np I want each Nordic country to have it's own color and have a contunuing line, and on x axis there would be the year and y axis the temperature.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load your dataset
df = pd.read_csv("datasets/nordicaveragetemp.csv")
# Convert the Category column (years) to integers
df["Category"] = df["Category"].astype(int)
# Nordic country columns
countries = ["DK", "FO", "GL", "FI", "AX", "IS", "NO", "SE"]
# Assign a color to each country
colors = {
"DK": "red",
"FO": "purple",
"GL": "blue",
"FI": "green",
"AX": "orange",
"IS": "cyan",
"NO": "black",
"SE": "brown"
}
# Create the plot
plt.figure(figsize=(14, 7))
for country in countries:
if country in df.columns:
plt.plot(
df["Category"],
df[country],
color=colors[country],
label=country,
linewidth=2
)
plt.title("Average Annual Temperature in Nordic Countries (°C)")
plt.xlabel("Year")
plt.ylabel("Temperature (°C)")
plt.grid(True, linestyle="--", alpha=0.4)
plt.legend(title="Country", ncol=4)
plt.tight_layout()
plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load your dataset
df = pd.read_csv("datasets/nordicaveragetemp.csv")
# Drop the year column if named 'Category' or similar
df_numeric = df.drop(columns=["Category"])
# Updated pairplot using fill instead of shade
sns.pairplot(
df_numeric,
kind="scatter",
diag_kind="kde",
plot_kws={"alpha": 0.6, "s": 30},
diag_kws={"fill": True} # <--- Updated here
)
plt.show()
As a newby in this DataScience, I am working on understanding things more and more, and for me ChatGPT is a great resource.
ChatGPT tells me to use histogram when exploring dataset for the first time and when we want sense of how many years fall into each temperature range.
Information about Seaborn: https://seaborn.pydata.org/generated/seaborn.histplot.html
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("datasets/nordicaveragetemp.csv")
df_numeric = df.drop(columns=["Category"])
sns.pairplot(
df_numeric,
kind="scatter",
diag_kind="hist",
diag_kws={"bins": 15, "edgecolor": "black"},
plot_kws={"alpha": 0.6, "s": 30}
)
plt.show()
If we take a look at the lowest column about Sweden we can see how Denmark and Sweden, and Sweden and Finland, and Sweden and Norway align quite well and have nice density.