[Your-Name-Here] - Fab Futures - Data Science
Home About

< Home

Week 1: Introduction to the course¶

What data set did I choose?¶

For this class, I chose to work with the Greenhouse gas Emission data attached below.

  • The dataset provides emission factors for 1016 U.S. commodities defined at the 6-digit level of the North American Industry Classification System(NAICS).
  • It covers spend-based emission factors: i.e., Kilograms of GHG per US dollar spent in 2022.
  • For each commodity(NAICS-6), the dataset includes three factors:
    • SEF - Supply Chain Emission Without Margins, which is the direct carbon footprint of the product itself
    • MEF - Margins of supply chain Emission Factors (extra things like distribution)
    • SEF + MEF - Supply Chain Emissions with Margins
In [1]:
import pandas as pd
df = pd.read_csv("datasets/GHGEmissionFactors_1.csv")
df.head(10)
Out[1]:
2017 NAICS Code 2017 NAICS Title GHG Unit Supply Chain Emission Factors without Margins Margins of Supply Chain Emission Factors Supply Chain Emission Factors with Margins Reference USEEIO Code
0 111110 Soybean Farming All GHGs kg CO2e/2022 USD, purchaser price 0.488 0.044 0.532 1111A0
1 111120 Oilseed (except Soybean) Farming All GHGs kg CO2e/2022 USD, purchaser price 0.488 0.044 0.532 1111A0
2 111130 Dry Pea and Bean Farming All GHGs kg CO2e/2022 USD, purchaser price 0.809 0.040 0.848 1111B0
3 111140 Wheat Farming All GHGs kg CO2e/2022 USD, purchaser price 0.809 0.040 0.848 1111B0
4 111150 Corn Farming All GHGs kg CO2e/2022 USD, purchaser price 0.809 0.040 0.848 1111B0
5 111160 Rice Farming All GHGs kg CO2e/2022 USD, purchaser price 0.809 0.040 0.848 1111B0
6 111191 Oilseed and Grain Combination Farming All GHGs kg CO2e/2022 USD, purchaser price 0.809 0.040 0.848 1111B0
7 111199 All Other Grain Farming All GHGs kg CO2e/2022 USD, purchaser price 0.809 0.040 0.848 1111B0
8 111211 Potato Farming All GHGs kg CO2e/2022 USD, purchaser price 0.591 0.041 0.631 111200
9 111219 Other Vegetable (except Potato) and Melon Farming All GHGs kg CO2e/2022 USD, purchaser price 0.591 0.041 0.631 111200

Why does the data interest me?¶

I chose this dataset because greenhouse gases and global warming are important topics worldwide. The data comes from the United States, which provides detailed information on emissions from different industries. Using this dataset, I want to explore which industries produce the highest greenhouse gas emissions and which contribute the lowest, and to examine how these trends might change in the next few years. This analysis will help me understand patterns in greenhouse gas emissions and the role of different sectors

U.S. Environmental Protection Agency. (2022). Supply chain greenhouse gas emission factors v1.3 by NAICS-6 [Data set]. Data.gov. https://catalog.data.gov/dataset/supply-chain-greenhouse-gas-emission-factors-v1-3-by-naics-6

Choosing a New Dataset¶

Hello, Back here again.¶

I am updating this after my fourth class of data science. This is because I initially tried to model greenhouse gas emission factors by sector, but the data are categorical and do not form continuous variables. Sector names and sector codes represent discrete groups, not a numeric sequence suitable for curve fitting, Gaussian modelling, or time-series prediction. Therefore, there is no meaningful functional relationship to fit.¶

Bitcoin 1-minute price data, on the other hand, is a continuous numerical time-series. It has consistent timestamps, measurable trends, volatility, and statistical structure. This makes it appropriate for machine learning, regression, forecasting, and curve fitting.¶

For this reason, I chose BTCUSD minute-by-minute data as my dataset for analysis.¶

In [3]:
import yfinance as yf
import pandas as pd

# Define the ticker symbol and date range
ticker_symbol = 'BTC'
start_date = '2020-12-31'
end_date = '2025-12-02'

# Fetch the data
stock_data = yf.download(ticker_symbol, start=start_date, end=end_date)

# Display the first few rows of the data
print(stock_data.head(10))
/tmp/ipykernel_21288/2121163880.py:10: FutureWarning: YF.download() has changed argument auto_adjust default to True
  stock_data = yf.download(ticker_symbol, start=start_date, end=end_date)
[*********************100%***********************]  1 of 1 completed
Price           Close       High        Low       Open   Volume
Ticker            BTC        BTC        BTC        BTC      BTC
Date                                                           
2024-07-31  28.950001  29.650000  28.799999  29.500000   930940
2024-08-01  28.100000  28.799999  27.600000  28.650000  8564500
2024-08-02  27.799999  29.049999  27.650000  28.750000  2412560
2024-08-05  23.750000  24.650000  22.000000  22.049999  3141140
2024-08-06  25.250000  25.350000  24.150000  24.500000  1690140
2024-08-07  24.299999  25.549999  24.250000  25.450001  1267920
2024-08-08  26.350000  26.575001  25.200001  25.650000   977760
2024-08-09  26.900000  27.125000  26.440001  26.850000  1328060
2024-08-12  26.200001  26.924999  25.650000  26.450001   751200
2024-08-13  27.049999  27.325001  26.100000  26.150000  1069980

I find this data equally fascinating, as Bitcoin is the longest-standing and most widely recognized cryptocurrency. It operates as a decentralized digital medium of exchange, with transactions verified and recorded on a public distributed ledger (the blockchain), eliminating the need for a trusted intermediary or central authority. Bitcoin’s initial exchange price in 2009 was just 0.00099 dollars per coin, whereas its price has surged to 87,015.73 dollars as of December 2, 2025. This dramatic increase sparks my curiosity about Bitcoin’s potential trajectory in the coming years and which other cryptocurrencies might experience similar growth.¶

References¶

  1. Supply Chain Greenhouse Gas Emission Factors” dataset, Umair Hayat, Kaggle. Retrieved from https://www.kaggle.com/datasets/umairhayat/supply-chain-greenhouse-gas-emission-factors
  2. mczielinski (Zielak). Bitcoin Historical Data [dataset]. Kaggle. Available at: https://www.kaggle.com/datasets/mczielinski/bitcoin-historical-data