< Home
Week 1 - 2nd Class: Tools¶
For this class assignment We need to visualize our selected dataset on a web page. In order to do it we can apply programing languages such as:
- Javascript
- Rust
- Python
Using Python to visualize Data¶
Trying Matplotlib¶
The Class' materials suggests to visit Matplotlib Official WebSite. Here Mathplotlib is described as a lybrary that runs within Python, allowing to visualize data.
- I wanted to apply a simple example first to generate an (X,Y) plot, I followed the instructions shows at Plot (X,Y), but without introducing maths calculations. So I "called" the lybrary, provided X and Y values, and finnaly assigned different characteristics to plot the graphic:
- Figure size
- Initial figure marker
- Graphic Title
- X and Y Values Labels
- Grid
In [4]:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 14, 12, 18, 20]
plt.figure(figsize=(10,7))
plt.plot(x, y, marker="o")
plt.title("Simple Line Plot")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.grid(True)
plt.show()
- Then. I wanted to plot a histogram distribution, following the instructions shows at Hist (X), but introducing maths calculations. So I "called" the matplotlib and numpy lybraries, asking to graphic 1000 values of random data provided X and Y values, and finnaly assigned different characteristics to plot the graphic:
- Bins: that assign the number of bins that will be ploted
- Alpha: that brings color transparency
- alpha=1 → solid color
- alpha=0.7 → add a light transparency
- alpha=0.3 → transparency increase
- alpha=0 → invisible
In [11]:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.figure(figsize=(10,7))
plt.hist(data, bins=100, color="pink", alpha=0.7)
plt.title("Histogram of Random Data")
plt.show()
Trying Matplotlib with my Data¶
- I wanted to use my thesis pilot data to run a plot and visualise some charateristics related with entrepreneurs. Thus, I need to upload it the file with CSV extention and then be sur that is recognising the data. At the begining I have a problem visualizing the table, because I used a "," instead of ";"
In [2]:
import pandas as pd
df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=",")
df.head()
Out[2]:
| Marca temporal;NOM;GEN;EAGE;FOUND;CAGE1;AFOUND;CBASED;CSECT;EEXP;EEDUC;INVT;MNGEXP;WEXP;SEBCK;FRUG1;FRUG2;FRUG3;FRUG4;FRUG5;FRUG6;FRUG7;BRIC1;BRIC2;BRIC3;BRIC4;BRIC5;BRIC6;BRIC7;BRIC8;INNOV1;INNOV2;INNOV3;INNOV4;CAGE2;TECHBS;ETEAM;EAOS;SEEDF;OPERF;INCC | |
|---|---|
| 0 | 4/4/2025 18:10:28;iFurniture ;2;35;1;2;1;2;9;1... |
| 1 | 4/6/2025 13:09:46;Salvy Natural - Indes Perú ;... |
| 2 | 4/7/2025 16:07:37;AVR Technology;1;23;1;2;1;2;... |
| 3 | 4/7/2025 21:49:59;AIO SENSORS ;1;32;1;1;1;3;9;... |
| 4 | 4/8/2025 17:54:07;Face Me;1;30;1;2;1;3;5;0;1;1... |
- I corrected the code and everything runs smoothly
In [17]:
import pandas as pd
df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=";")
df.head()
Out[17]:
| Marca temporal | NOM | GEN | EAGE | FOUND | CAGE1 | AFOUND | CBASED | CSECT | EEXP | ... | INNOV2 | INNOV3 | INNOV4 | CAGE2 | TECHBS | ETEAM | EAOS | SEEDF | OPERF | INCC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4/4/2025 18:10:28 | iFurniture | 2 | 35 | 1 | 2 | 1 | 2 | 9 | 1 | ... | 4 | 2 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 4/6/2025 13:09:46 | Salvy Natural - Indes Perú | 2 | 37 | 1 | 2 | 1 | 2 | 12 | 1 | ... | 5 | 5 | 5 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 4/7/2025 16:07:37 | AVR Technology | 1 | 23 | 1 | 2 | 1 | 2 | 15 | 0 | ... | 4 | 4 | 4 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 3 | 4/7/2025 21:49:59 | AIO SENSORS | 1 | 32 | 1 | 1 | 1 | 3 | 9 | 0 | ... | 4 | 4 | 4 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| 4 | 4/8/2025 17:54:07 | Face Me | 1 | 30 | 1 | 2 | 1 | 3 | 5 | 0 | ... | 4 | 4 | 4 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
5 rows × 41 columns
- Now I want to plot an histogram that shows Entrepreneurs' age distribution
In [22]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
plt.hist(df["EAGE"].dropna(), bins=12, color="teal", edgecolor="black", alpha=0.7)
plt.title("Distribution of Entrepreneur Age", fontsize=14)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.grid(True, alpha=0.3)
plt.show()
- However, plot's bins did not fit a distribution that allows to determinate ages included. Thus, I define bins segments at the begining and then I linked them with xlabel.
In [19]:
import matplotlib.pyplot as plt
import numpy as np
# Define segmentos (bins)
bins = [20, 25, 30, 35, 40, 45, 50, 55]
plt.figure(figsize=(8,5))
plt.hist(df["EAGE"].dropna(), bins=bins, color="teal", edgecolor="black", alpha=0.7)
# Usar los mismos cortes como etiquetas del eje X
plt.xticks(bins)
plt.title("Distribution of Entrepreneur Age", fontsize=14)
plt.xlabel("Age (bin edges)")
plt.ylabel("Frequency")
plt.grid(True, alpha=0.3)
plt.show()
- Then I wanted to graphic gender, considering that my data labels are:
- 1 → Male
- 2 → Female
- 3 → Other / Prefer not to say
- I need to instruct to map within the data "gender" and create a new label GEN_label
In [23]:
gender_map = {1: "Male", 2: "Female", 3: "Other"}
df["GEN_label"] = df["GEN"].map(gender_map)
- Then I need to be sure that it is recognising all the data (39 surveyed entrepreneurs), so I used counts instruction
In [24]:
gender_counts = df["GEN_label"].value_counts()
gender_counts
Out[24]:
GEN_label Male 27 Female 12 Name: count, dtype: int64
- Here, I created an histogram
In [25]:
import matplotlib.pyplot as plt
plt.figure(figsize=(7,5))
plt.bar(gender_counts.index, gender_counts.values,
color=["#4B8BBE", "#F89C74", "#9C27B0"], # colores bonitos
edgecolor="black")
plt.title("Entrepreneurs by Gender", fontsize=14)
plt.xlabel("Gender")
plt.ylabel("Count")
plt.grid(axis="y", alpha=0.3)
plt.show()
- I also created a pye graphic
In [26]:
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.pie(
gender_counts,
labels=gender_counts.index,
autopct="%1.1f%%", # muestra porcentajes
startangle=90, # inicia desde arriba
colors=["#4B8BBE", "#F89C74", "#9C27B0"], # colores elegantes
explode=[0.03]*len(gender_counts) # separación ligera de las porciones
)
plt.title("Gender Distribution of Entrepreneurs", fontsize=14)
plt.show()
- Finally, we can try to put all the code togeher in on notebook. Thus I ask ChatGPT in spanish "Bríndame el código para hace que IPython realice el mapeo de los valores númericos referidos a género creando una nueva etiqueta y a partir de esta realice el conteo de los valores de género y con ello pueda graficar un pie chart". I copy and paste the code, however it resulted an error.
In [1]:
# 1. Mapear los valores numéricos de género a etiquetas de texto
gender_map = {1: "Male", 2: "Female", 3: "Other"}
df["GEN_label"] = df["GEN"].map(gender_map)
# 2. Obtener el conteo de cada categoría
gender_counts = df["GEN_label"].value_counts()
# 3. Pie chart (gráfico de pastel)
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.pie(
gender_counts,
labels=gender_counts.index,
autopct="%1.1f%%",
startangle=90,
colors=["#4B8BBE", "#F89C74", "#9C27B0"],
explode=[0.05]*len(gender_counts) # pequeña separación estética
)
plt.title("Gender Distribution of Entrepreneurs", fontsize=14)
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[1], line 3 1 # 1. Mapear los valores numéricos de género a etiquetas de texto 2 gender_map = {1: "Male", 2: "Female", 3: "Other"} ----> 3 df["GEN_label"] = df["GEN"].map(gender_map) 5 # 2. Obtener el conteo de cada categoría 6 gender_counts = df["GEN_label"].value_counts() NameError: name 'df' is not defined
In [3]:
import pandas as pd
df = pd.read_csv("datasets/2nd_Class_Assignmt_Data/Entrepreneurs.csv", sep=";")
df.head()
# 1. Mapear los valores numéricos de género a etiquetas de texto
gender_map = {1: "Male", 2: "Female", 3: "Other"}
df["GEN_label"] = df["GEN"].map(gender_map)
# 2. Obtener el conteo de cada categoría
gender_counts = df["GEN_label"].value_counts()
# 3. Pie chart (gráfico de pastel)
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.pie(
gender_counts,
labels=gender_counts.index,
autopct="%1.1f%%",
startangle=90,
colors=["#4B8BBE", "#F89C74", "#9C27B0"],
explode=[0.05]*len(gender_counts) # pequeña separación estética
)
plt.title("Gender Distribution of Entrepreneurs", fontsize=14)
plt.show()
In [ ]: