< Home
Week 1: Introduction to the course¶
What data set did I choose?¶
For this class, I chose to work with the Greenhouse gas Emission data attached below.
- The dataset provides emission factors for 1016 U.S. commodities defined at the 6-digit level of the North American Industry Classification System(NAICS).
- It covers spend-based emission factors: i.e., Kilograms of GHG per US dollar spent in 2022.
- For each commodity(NAICS-6), the dataset includes three factors:
- SEF - Supply Chain Emission Without Margins, which is the direct carbon footprint of the product itself
- MEF - Margins of supply chain Emission Factors (extra things like distribution)
- SEF + MEF - Supply Chain Emissions with Margins
import pandas as pd
df = pd.read_csv("datasets/GHGEmissionFactors_1.csv")
df.head(10)
| 2017 NAICS Code | 2017 NAICS Title | GHG | Unit | Supply Chain Emission Factors without Margins | Margins of Supply Chain Emission Factors | Supply Chain Emission Factors with Margins | Reference USEEIO Code | |
|---|---|---|---|---|---|---|---|---|
| 0 | 111110 | Soybean Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.488 | 0.044 | 0.532 | 1111A0 |
| 1 | 111120 | Oilseed (except Soybean) Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.488 | 0.044 | 0.532 | 1111A0 |
| 2 | 111130 | Dry Pea and Bean Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.809 | 0.040 | 0.848 | 1111B0 |
| 3 | 111140 | Wheat Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.809 | 0.040 | 0.848 | 1111B0 |
| 4 | 111150 | Corn Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.809 | 0.040 | 0.848 | 1111B0 |
| 5 | 111160 | Rice Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.809 | 0.040 | 0.848 | 1111B0 |
| 6 | 111191 | Oilseed and Grain Combination Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.809 | 0.040 | 0.848 | 1111B0 |
| 7 | 111199 | All Other Grain Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.809 | 0.040 | 0.848 | 1111B0 |
| 8 | 111211 | Potato Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.591 | 0.041 | 0.631 | 111200 |
| 9 | 111219 | Other Vegetable (except Potato) and Melon Farming | All GHGs | kg CO2e/2022 USD, purchaser price | 0.591 | 0.041 | 0.631 | 111200 |
Why does the data interest me?¶
I chose this dataset because greenhouse gases and global warming are important topics worldwide. The data comes from the United States, which provides detailed information on emissions from different industries. Using this dataset, I want to explore which industries produce the highest greenhouse gas emissions and which contribute the lowest, and to examine how these trends might change in the next few years. This analysis will help me understand patterns in greenhouse gas emissions and the role of different sectors
U.S. Environmental Protection Agency. (2022). Supply chain greenhouse gas emission factors v1.3 by NAICS-6 [Data set]. Data.gov. https://catalog.data.gov/dataset/supply-chain-greenhouse-gas-emission-factors-v1-3-by-naics-6
Choosing a New Dataset¶
Hello, Back here again.¶
I am updating this after my fourth class of data science. This is because I initially tried to model greenhouse gas emission factors by sector, but the data are categorical and do not form continuous variables. Sector names and sector codes represent discrete groups, not a numeric sequence suitable for curve fitting, Gaussian modelling, or time-series prediction. Therefore, there is no meaningful functional relationship to fit.¶
Bitcoin 1-minute price data, on the other hand, is a continuous numerical time-series. It has consistent timestamps, measurable trends, volatility, and statistical structure. This makes it appropriate for machine learning, regression, forecasting, and curve fitting.¶
For this reason, I chose BTCUSD minute-by-minute data as my dataset for analysis.¶
import yfinance as yf
import pandas as pd
# Define the ticker symbol and date range
ticker_symbol = 'BTC'
start_date = '2020-12-31'
end_date = '2025-12-02'
# Fetch the data
stock_data = yf.download(ticker_symbol, start=start_date, end=end_date)
# Display the first few rows of the data
print(stock_data.head(10))
/tmp/ipykernel_21288/2121163880.py:10: FutureWarning: YF.download() has changed argument auto_adjust default to True stock_data = yf.download(ticker_symbol, start=start_date, end=end_date) [*********************100%***********************] 1 of 1 completed
Price Close High Low Open Volume Ticker BTC BTC BTC BTC BTC Date 2024-07-31 28.950001 29.650000 28.799999 29.500000 930940 2024-08-01 28.100000 28.799999 27.600000 28.650000 8564500 2024-08-02 27.799999 29.049999 27.650000 28.750000 2412560 2024-08-05 23.750000 24.650000 22.000000 22.049999 3141140 2024-08-06 25.250000 25.350000 24.150000 24.500000 1690140 2024-08-07 24.299999 25.549999 24.250000 25.450001 1267920 2024-08-08 26.350000 26.575001 25.200001 25.650000 977760 2024-08-09 26.900000 27.125000 26.440001 26.850000 1328060 2024-08-12 26.200001 26.924999 25.650000 26.450001 751200 2024-08-13 27.049999 27.325001 26.100000 26.150000 1069980
I find this data equally fascinating, as Bitcoin is the longest-standing and most widely recognized cryptocurrency. It operates as a decentralized digital medium of exchange, with transactions verified and recorded on a public distributed ledger (the blockchain), eliminating the need for a trusted intermediary or central authority. Bitcoin’s initial exchange price in 2009 was just 0.00099 dollars per coin, whereas its price has surged to 87,015.73 dollars as of December 2, 2025. This dramatic increase sparks my curiosity about Bitcoin’s potential trajectory in the coming years and which other cryptocurrencies might experience similar growth.¶
References¶
- Supply Chain Greenhouse Gas Emission Factors” dataset, Umair Hayat, Kaggle. Retrieved from https://www.kaggle.com/datasets/umairhayat/supply-chain-greenhouse-gas-emission-factors
- mczielinski (Zielak). Bitcoin Historical Data [dataset]. Kaggle. Available at: https://www.kaggle.com/datasets/mczielinski/bitcoin-historical-data