Yuichi TAMIYA - Fab Futures 2025 - Data Science
Home About Tips

< Home

2. Tools¶

Assignment¶

  • Visualize your data set(s)

Prompt (ChatGPT) What is a dataset in data science?

Answer
In data science, a dataset is a collection of data that is organized for analysis or machine learning.

  1. Basic Definition
  • A collection of data: It can include numbers, text, images, audio, or other formats.
  • Structured format: Usually organized into rows (records) and columns (features/attributes).
    • Example: Tables in CSV, Excel, SQL, folders of images, JSON files, etc.
  1. Components of a Dataset
  • Samples (instances): Each individual data point (row) for analysis.
    • Example: A record of a person’s height, weight, and age.
  • Features (variables): Attributes or properties of the samples (columns).
    • Example: height, weight, age, gender.
  • Labels (if applicable): The target or correct answer used in supervised learning.
    • Example: Whether a person has a disease (Yes/No), or product category.
  1. Types of Datasets
  • Structured datasets: Organized in tables with numbers or text (CSV, Excel, SQL).
  • Unstructured datasets: Images, audio, video, or text that cannot be directly analyzed.
  • Semi-structured datasets: JSON, XML, or similar formats with some structure.
  1. Role in Data Science
  • Modeling: Used to train and evaluate machine learning models or statistical analyses.
  • Visualization: Helps understand patterns and trends in the data.
  • Exploratory analysis: Used to check data quality and decide preprocessing steps.

Summary:
A dataset is a collection of data organized for analysis in data science, usually with rows = samples, columns = features, and sometimes labels.

How to read files¶

[.txt] Open and read .txt file in Python¶

w3schools: Python File Open

In [1]:
## You should always close your files. 
## In some cases, due to buffering, changes made to a file may not show until you close the file.

file = open("data/text.txt")
print(file.read())
file.close()
Hello World!
Hello Python!
In [2]:
## Then you do not have to worry about closing your files,
## the with statement takes care of that.

with open("data/text.txt") as file:
  print(file.read())
Hello World!
Hello Python!
In [3]:
## return one line by using the readline() method:

with open("data/text.txt") as file:
  print(file.readline())
Hello World!

In [4]:
## Read two lines of the file:

with open("data/text.txt") as file:
  print(file.readline())
  print(file.readline())
Hello World!

Hello Python!
In [5]:
### Return the 5 first characters of the file:
with open("data/text.txt") as file:
  print(file.read(5))
Hello

[.csv] Python csv Module¶

w3schools: Python csv Module

reader

  • Return a reader object that will iterate over lines in the CSV file.

DictReader

  • Read CSV data into dictionaries, using the first row as fieldnames.

reader vs DictReader

Aspect reader DictReader
Data handling Access via index in list Access via column name (similar to pandas)
Readability Hard to understand (row[0]) Intuitive (row["name"])
Step toward pandas A bit far Conceptually close to DataFrame
In [6]:
'''
DictReader
'''

import csv

with open("data/data.csv", newline="", encoding="utf-8") as f: ## f=file object
    reader = csv.DictReader(f)
    for row in reader:
        print(row["name"], row["age"])
Jon Snow 25
Daenerys Targaryen 23
Tyrion Lannister 32
Arya Stark 18
Cersei Lannister 42

[.csv][.json][.toml][.xml] Pandas¶

w3schools: Pandas Read CSV

CSV file

In [7]:
import pandas as pd

df = pd.read_csv('data/data.csv')
print(df.to_string()) #Display all rows and columns of a DataFrame as a string
                 name      house  age                         role
0            Jon Snow      Stark   25            King in the North
1  Daenerys Targaryen  Targaryen   23             Queen of Dragons
2    Tyrion Lannister  Lannister   32            Hand of the Queen
3          Arya Stark      Stark   18                     Assassin
4    Cersei Lannister  Lannister   42  Queen of the Seven Kingdoms
In [8]:
import pandas as pd

df = pd.read_csv("data/data.csv")
df.head(3) # Displays the first 5 rows of the DataFrame by default. The number of rows can be specified via an argument
Out[8]:
name house age role
0 Jon Snow Stark 25 King in the North
1 Daenerys Targaryen Targaryen 23 Queen of Dragons
2 Tyrion Lannister Lannister 32 Hand of the Queen

JSON File

In [9]:
import pandas as pd

df = pd.read_json("data/data.json")
print(df.head())
                 name      house  age                         role
0            Jon Snow      Stark   25            King in the North
1  Daenerys Targaryen  Targaryen   23             Queen of Dragons
2    Tyrion Lannister  Lannister   32            Hand of the Queen
3          Arya Stark      Stark   18                     Assassin
4    Cersei Lannister  Lannister   42  Queen of the Seven Kingdoms
In [10]:
import pandas as pd

df = pd.read_json("data/data.json")
df.head()
Out[10]:
name house age role
0 Jon Snow Stark 25 King in the North
1 Daenerys Targaryen Targaryen 23 Queen of Dragons
2 Tyrion Lannister Lannister 32 Hand of the Queen
3 Arya Stark Stark 18 Assassin
4 Cersei Lannister Lannister 42 Queen of the Seven Kingdoms

XML file

In [11]:
import pandas as pd

df = pd.read_xml("data/data.xml", parser="etree")#Use Python’s built-in xml.etree.ElementTree library without installing lxml
print(df.head())
                 name      house  age                         role
0            Jon Snow      Stark   25            King in the North
1  Daenerys Targaryen  Targaryen   23             Queen of Dragons
2    Tyrion Lannister  Lannister   32            Hand of the Queen
3          Arya Stark      Stark   18                     Assassin
4    Cersei Lannister  Lannister   42  Queen of the Seven Kingdoms
In [12]:
import pandas as pd

df = pd.read_xml("data/data.xml", parser="etree") 
df.head()
Out[12]:
name house age role
0 Jon Snow Stark 25 King in the North
1 Daenerys Targaryen Targaryen 23 Queen of Dragons
2 Tyrion Lannister Lannister 32 Hand of the Queen
3 Arya Stark Stark 18 Assassin
4 Cersei Lannister Lannister 42 Queen of the Seven Kingdoms

TOML

In [13]:
import tomllib
import pandas as pd

with open("data/data.toml", "rb") as f: # Use Python’s built-in tomllib library without installing the toml package
    data = tomllib.load(f)

df = pd.DataFrame(data["character"])
print(df.head())
                 name      house  age                         role
0            Jon Snow      Stark   25            King in the North
1  Daenerys Targaryen  Targaryen   23             Queen of Dragons
2    Tyrion Lannister  Lannister   32            Hand of the Queen
3          Arya Stark      Stark   18                     Assassin
4    Cersei Lannister  Lannister   42  Queen of the Seven Kingdoms
In [14]:
import tomllib
import pandas as pd

with open("data/data.toml", "rb") as f:
    data = tomllib.load(f)

df = pd.DataFrame(data["character"])
df.head()
Out[14]:
name house age role
0 Jon Snow Stark 25 King in the North
1 Daenerys Targaryen Targaryen 23 Queen of Dragons
2 Tyrion Lannister Lannister 32 Hand of the Queen
3 Arya Stark Stark 18 Assassin
4 Cersei Lannister Lannister 42 Queen of the Seven Kingdoms

Visualize data set(s)¶

I got some dataset from e-Stat which is a portal site for Japanese Government Statistics.

A free online course using e-Stat will start on January 13, 2026.

  • https://gacco.org/stat-japan3/

Population Pyramid¶

Population Pyramid

I downloaded the following dataset as a CSV file from Other Func:

  • Prefecture: 14 Kanagawa-ken
  • City: 14100 Yokohama-shi

I saved the CSV file (2KB) in datasets folder. All contents of this folder will be ignored because /datasets is listed in the .gitignore file.

.gitignore

.ipynb_checkpoints/
/public
/datasets
In [15]:
import pandas as pd

df = pd.read_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2025.csv")

df.head() 
Out[15]:
Region Time Age Group Population projection (Male) [person] Annotation Population projection (Female) [person] Annotation.1 Population projection (Both sexes) [person] Annotation.2
0 Kanagawa-ken Yokohama-shi 2025 95+ 4405 NaN 15477 NaN 19882 NaN
1 Kanagawa-ken Yokohama-shi 2025 90--94 18658 NaN 41397 NaN 60055 NaN
2 Kanagawa-ken Yokohama-shi 2025 85--89 43319 NaN 71519 NaN 114838 NaN
3 Kanagawa-ken Yokohama-shi 2025 80--84 70603 NaN 99732 NaN 170335 NaN
4 Kanagawa-ken Yokohama-shi 2025 75--79 101265 NaN 121690 NaN 222955 NaN

Visualizing the dataset¶

Prompt
I can read a CSV and show it with df.head(). How do I visualize the data?

Answer
Generated Code is below

In [16]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the CSV file
df = pd.read_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2025.csv")

# Select only the necessary columns and rename them
df = df[["Age Group", 
         "Population projection (Male) [person]", 
         "Population projection (Female) [person]"]]
df.columns = ["Age", "Male", "Female"]

# Display the first few rows to check the data
print(df.head())

# Prepare the age and population for males and females
age = df["Age"]
male = df["Male"] * -1  # Make male population negative to plot on the left
female = df["Female"]   # Female population stays positive for the right side

# Create the plot
plt.figure(figsize=(10, 8))

plt.barh(age, male, color="blue", label="Male")
plt.barh(age, female, color="pink", label="Female")

# Set labels and title
plt.xlabel("Population")
plt.ylabel("Age Group")
plt.title("Population Pyramid of Yokohama, 2025")
plt.legend()
plt.tight_layout()  # Adjust layout to prevent label cutoff

# Show the population pyramid
plt.show()
      Age    Male  Female
0     95+    4405   15477
1  90--94   18658   41397
2  85--89   43319   71519
3  80--84   70603   99732
4  75--79  101265  121690
No description has been provided for this image

Compare to official one), I want to adjust the graph:
1.Invert Y axis
2.Change the X axis unit to population from -2.0 ~2.0[1e6] to 200000 to 200,000
2-1. Change from 2x10^6 to 200000
2-2. 200000 to 200,000
3.Remove the row that contains the total (the last row in CSV)

To do this:
1.Add plt.gca().invert_yaxis() after plt.legend()
2-1. Add plt.ticklabel_format(style='plain', axis='x')
2-2. Add import matplotlib.ticker as ticker
plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: f"{int(x):,}"))
3. Addd df = df.iloc[:-1]

In [17]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker ##2.2.Add comma formatting to the x-axis


# Read the CSV file
df = pd.read_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2025.csv")

# Select only the necessary columns and rename them
df = df[["Age Group", 
         "Population projection (Male) [person]", 
         "Population projection (Female) [person]"]]
df.columns = ["Age", "Male", "Female"]

# Remove the last row
print(df.tail(1)) # print the last raw before remove it
df = df.iloc[:-1] # Remove the last row because it contains the total # --------------3.Remove the last row that contains the total
print(df.tail(1)) # print the last raw after remove it

# Display the first few rows to check the data
# print(df.head()) # stop printing

# Prepare the age and population for males and females
age = df["Age"]
male = df["Male"] * -1  # Make male population negative to plot on the left
female = df["Female"]   # Female population stays positive for the right side

# Create the plot
plt.figure(figsize=(10, 8))

plt.barh(age, male, color="blue", label="Male")
plt.barh(age, female, color="pink", label="Female")

# Set labels and title
plt.xlabel("Population")
plt.ylabel("Age Group")
plt.title("Population Pyramid of Yokohama, 2025")
plt.legend()
plt.gca().invert_yaxis()  #--------------------------------------- 1.invert Y axis

plt.ticklabel_format(style='plain', axis='x')  #-------------------2-1.Disable scientific notation on the x-axis
plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: f"{int(x):,}"))#------2.2.Add comma formatting to the x-axis


plt.tight_layout()  # Adjust layout to prevent label cutoff

# Show the population pyramid
plt.show()
         Age     Male   Female
20  All ages  1863429  1923273
     Age   Male  Female
19  0--4  63865   60757
No description has been provided for this image

Prompt(ChatGPT)
Male is plotted as a negative value, but the X-axis label is overwritten with a positive value.

Answer
Generated Code is below

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Read the CSV file
df = pd.read_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2025.csv")

# Select only the necessary columns and rename them
df = df[["Age Group", 
         "Population projection (Male) [person]", 
         "Population projection (Female) [person]"]]
df.columns = ["Age", "Male", "Female"]

# 3.Remove the last row
df = df.iloc[:-1]

# Prepare the age and population for males and females
age = df["Age"]
male = df["Male"] * -1 # Make male population negative to plot on the left
female = df["Female"] # Female population stays positive for the right side

# Create the plot
plt.figure(figsize=(10, 8))

plt.barh(age, male, color="blue", label="Male")
plt.barh(age, female, color="pink", label="Female")

# Set labels and title
plt.xlabel("Population")
plt.ylabel("Age Group")
plt.title("Population Pyramid of Yokohama, 2025")
plt.legend()
plt.gca().invert_yaxis() # 1.Invert Y axis

# Format x-axis: positive numbers even for negative bars
plt.gca().xaxis.set_major_formatter(
    ticker.FuncFormatter(lambda x, pos: f"{int(abs(x)):,}") # 2.convert to integer and add commas
)

plt.tight_layout() # Adjust layout
plt.show()
No description has been provided for this image

Prompt(ChatGPT)
I want to load datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2050.csv as well and use a toggle switch to switch between the plots. I also want to clearly indicate the year being displayed (2025 or 2050). Since changing the x-axis scale makes the graph hard to read, I want to fix the x-axis scale based on the 2025 data.

Answer
Generated Code is below

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import ipywidgets as widgets
from IPython.display import display

# Function to load CSV
def load_population_csv(file_path):
    df = pd.read_csv(file_path)
    df = df[["Age Group", 
             "Population projection (Male) [person]", 
             "Population projection (Female) [person]"]]
    df.columns = ["Age", "Male", "Female"]
    df = df.iloc[:-1]  # Remove the last row containing the total
    return df

# Load data
df_2025 = load_population_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2025.csv")
df_2050 = load_population_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2050.csv")

# Fix x-axis range based on 2025 data
max_population = max(df_2025["Male"].max(), df_2025["Female"].max())

# Plot function
def plot_population(year):
    plt.figure(figsize=(10, 8))
    
    if year == 2025:
        df = df_2025
    else:
        df = df_2050
    
    age = df["Age"]
    male = df["Male"] * -1  # Make male population negative for left side
    female = df["Female"]     # Female population stays positive for right side
    
    plt.barh(age, male, color="blue", label="Male")
    plt.barh(age, female, color="pink", label="Female")
    
    plt.xlabel("Population")
    plt.ylabel("Age Group")
    plt.title(f"Population Pyramid of Yokohama, {year}")
    plt.legend()
    plt.gca().invert_yaxis()  # Invert y-axis so youngest at the bottom
    
    # Format x-axis labels as positive integers with commas
    plt.gca().xaxis.set_major_formatter(
        ticker.FuncFormatter(lambda x, pos: f"{int(abs(x)):,}")
    )
    
    # Fix x-axis range based on 2025 data
    plt.xlim(-max_population, max_population)
    
    plt.tight_layout()
    plt.show()

# Create toggle button for year selection
toggle = widgets.ToggleButtons(
    options=[2025, 2050],
    description='Year:',
)
widgets.interact(plot_population, year=toggle)
interactive(children=(ToggleButtons(description='Year:', options=(2025, 2050), value=2025), Output()), _dom_cl…
Out[19]:
<function __main__.plot_population(year)>

Prompt(ChatGPT)
I want to load datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2035.csv and 2045.csv as well, and use a slider to switch between the plots. I also want to clearly indicate the year currently being displayed.

Answer
Generated Code is below

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import ipywidgets as widgets
from IPython.display import display

# Function to load CSV
def load_population_csv(file_path):
    df = pd.read_csv(file_path)
    df = df[["Age Group", 
             "Population projection (Male) [person]", 
             "Population projection (Female) [person]"]]
    df.columns = ["Age", "Male", "Female"]
    df = df.iloc[:-1]  # Remove the last row containing the total
    return df

# Load all data
dfs = {
    2025: load_population_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2025.csv"),
    2035: load_population_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2035.csv"),
    2045: load_population_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2045.csv"),
    2050: load_population_csv("datasets/Population Pyramid_Kanagawa-ken Yokohama-shi_2050.csv"),
}

# Fix x-axis range based on 2025 data
max_population = max(dfs[2025]["Male"].max(), dfs[2025]["Female"].max())

# Plot function
def plot_population(year):
    plt.figure(figsize=(10, 8))
    
    df = dfs[year]
    
    age = df["Age"]
    male = df["Male"] * -1  # Make male population negative for left side
    female = df["Female"]     # Female population stays positive for right side
    
    plt.barh(age, male, color="blue", label="Male")
    plt.barh(age, female, color="pink", label="Female")
    
    plt.xlabel("Population")
    plt.ylabel("Age Group")
    plt.title(f"Population Pyramid of Yokohama, {year}")
    plt.legend()
    plt.gca().invert_yaxis()  # Invert y-axis so youngest at the bottom
    
    # Format x-axis labels as positive integers with commas
    plt.gca().xaxis.set_major_formatter(
        ticker.FuncFormatter(lambda x, pos: f"{int(abs(x)):,}")
    )
    
    # Fix x-axis range based on 2025 data
    plt.xlim(-max_population, max_population)
    
    plt.tight_layout()
    plt.show()

# Create a slider for year selection
slider = widgets.IntSlider(
    value=2025,
    min=2025,
    max=2050,
    step=10,  # steps: 2025, 2035, 2045, 2050
    description='Year:',
    continuous_update=False
)

widgets.interact(plot_population, year=slider)
interactive(children=(IntSlider(value=2025, continuous_update=False, description='Year:', max=2050, min=2025, …
Out[20]:
<function __main__.plot_population(year)>

Number of Pilgrims by Gender¶

In [27]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('datasets/finalproject/gender.csv')

print(df.head())
   Gender    2004   2005   2006   2007   2008   2009    2010    2011    2012  \
0    male  100431  55706  59416  66780  72936  85946  151706  105831  108809   
1  female   79513  38218  40961  47246  52205  59932  120429   77535   83679   

     2016    2017    2018    2019   2020   2021    2022    2023    2024  Ratio  
0  144067  153169  162542  169777  30221  90128  207801  209593  210422  51.9%  
1  133841  147867  164836  177801  23923  88784  230506  236445  245677  48.1%  
In [55]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('datasets/finalproject/gender.csv')
print("df.head")
print(df.head())

df = df.set_index("Gender")

# Extract column names from df.columns that consist only of digits and store them in a list.
#years = [c for c in df.columns if c.isdigit()]

years = []
for c in df.columns:
    if c.isdigit():
        years.append(c)

df_years = df[years]

print("df_years.head()")
print(df_years.head())

plt.figure(figsize=(10,5)) # X-Axis 10inch, Y-Axis 5inch

plt.plot(years, df.loc["male", years],marker="o", label="Male")
plt.plot(years, df.loc["female", years],marker="o", label="Female")

plt.title("Number of Pilgrims by Gender (2004-2024)")
plt.xlabel("Year")
plt.ylabel("Number of People")
plt.legend(loc="upper left",title="Gender",frameon=True) #判例
#plt.grid(alpha=0.3)
#plt.tight_layout()
plt.show()
df.head
   Gender    2004   2005   2006   2007   2008   2009    2010    2011    2012  \
0    male  100431  55706  59416  66780  72936  85946  151706  105831  108809   
1  female   79513  38218  40961  47246  52205  59932  120429   77535   83679   

     2016    2017    2018    2019   2020   2021    2022    2023    2024  Ratio  
0  144067  153169  162542  169777  30221  90128  207801  209593  210422  51.9%  
1  133841  147867  164836  177801  23923  88784  230506  236445  245677  48.1%  
df_years.head()
          2004   2005   2006   2007   2008   2009    2010    2011    2012  \
Gender                                                                      
male    100431  55706  59416  66780  72936  85946  151706  105831  108809   
female   79513  38218  40961  47246  52205  59932  120429   77535   83679   

          2016    2017    2018    2019   2020   2021    2022    2023    2024  
Gender                                                                        
male    144067  153169  162542  169777  30221  90128  207801  209593  210422  
female  133841  147867  164836  177801  23923  88784  230506  236445  245677  
No description has been provided for this image

Unplot COVID years¶

In [59]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('datasets/finalproject/gender.csv')
print("df.head")
print(df.head())

df = df.set_index("Gender")

exclude_years = ["2020", "2021"] # COVID
years = [c for c in df.columns if c.isdigit() and c not in exclude_years]

df_years = df[years]

print("df_years.head()")
print(df_years.head())

plt.figure(figsize=(10,5)) # X-Axis 10inch, Y-Axis 5inch

plt.plot(years, df.loc["male", years],marker="o", label="Male")
plt.plot(years, df.loc["female", years],marker="o", label="Female")

plt.title("Number of Pilgrims by Gender (2004-2024)")
plt.xlabel("Year")
plt.ylabel("Number of People")
#plt.legend()
plt.legend(loc="upper left",title="Gender",frameon=True) #判例
#plt.grid(alpha=0.3)
#plt.tight_layout()
plt.show()
df.head
   Gender    2004   2005   2006   2007   2008   2009    2010    2011    2012  \
0    male  100431  55706  59416  66780  72936  85946  151706  105831  108809   
1  female   79513  38218  40961  47246  52205  59932  120429   77535   83679   

     2016    2017    2018    2019   2020   2021    2022    2023    2024  Ratio  
0  144067  153169  162542  169777  30221  90128  207801  209593  210422  51.9%  
1  133841  147867  164836  177801  23923  88784  230506  236445  245677  48.1%  
df_years.head()
          2004   2005   2006   2007   2008   2009    2010    2011    2012  \
Gender                                                                      
male    100431  55706  59416  66780  72936  85946  151706  105831  108809   
female   79513  38218  40961  47246  52205  59932  120429   77535   83679   

          2016    2017    2018    2019    2022    2023    2024  
Gender                                                          
male    144067  153169  162542  169777  207801  209593  210422  
female  133841  147867  164836  177801  230506  236445  245677  
No description has been provided for this image