Week 3: fitting - "House Property Sales Time Series" dataset¶
Context¶
- Source: Kaggle
- Description: property sales data for the 2007-2019 period for one specific region. The data contains sales prices for houses and units with 1,2,3,4,5 bedrooms. These are the cross-depended variables.
Load dataset¶
In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import random
raw_df = pd.read_csv("datasets/House_Property_Sales_Time_Series.csv", usecols=["datesold", "price","bedrooms"],parse_dates=["datesold"])
df = raw_df[(raw_df['bedrooms'] == 4) & (raw_df['price'] > 1300000)] # data is filtered with the intent to get a nice cloud to analyze
# 🧾 Display dataset informations
print("House sales dataset shape:", df.shape)
#print(df.info)
House sales dataset shape: (395, 3)
Explore content¶
In [2]:
df.head()
Out[2]:
| datesold | price | bedrooms | |
|---|---|---|---|
| 7 | 2007-04-30 | 1530000 | 4 |
| 26 | 2007-07-21 | 1780000 | 4 |
| 691 | 2008-12-20 | 1375000 | 4 |
| 781 | 2009-01-27 | 2100000 | 4 |
| 880 | 2009-02-25 | 1580000 | 4 |
Display a nice chart¶
In [3]:
# Let's display a basic chart
plt.rcParams["figure.figsize"] = (20,9)
plt.plot(df['datesold'], df['price'],'o')
plt.xlabel('Date sold')
plt.ylabel('Price')
plt.show()
Fitting using radial basis function (RBF)¶
In [24]:
ncenters = 15
x = df['datesold'].to_numpy(dtype='int64') / 10000000000000
#print(x)
xmin = x.min()
print("MinX=", xmin)
xmax = x.max()
print("MaxX=", xmax)
npts = np.count_nonzero(x)
print("CountX=", npts)
y = df['price'].to_numpy(dtype='int64')
#print(y)
ymin = y.min()
print("MinY=", ymin)
ymax = y.max()
print("MaxY=", ymax)
indices = np.random.uniform(low=0,high=len(x),size=ncenters).astype(int) # choose random RBF centers from data
#print("Indices=",indices)
centers = x[indices]
print("Centers=", centers)
M = np.abs(np.outer(x,np.ones(ncenters)) # construct matrix of basis terms
-np.outer(np.ones(npts),centers))**3
#print("M=",M)
b,residuals,rank,values = np.linalg.lstsq(M,y) # do SVD fit
xfit = np.linspace(xmin,xmax,npts)
yfit = (np.abs(np.outer(xfit,np.ones(ncenters))-np.outer(np.ones(npts),centers))**3)@b # evaluate fit
#print("yfit=",yfit)
plt.plot(x,y,'o')
plt.plot(xfit,yfit,'g-',label='RBF fit')
plt.plot(xfit,(xfit-centers[0])**3,'b-',label='$r^3$ basis functions')
#print((xfit-centers[0])**3)
for i in range(ncenters):
plt.plot(xfit,np.abs(xfit-centers[i])**3,color=(0.75,0.75,0.75))
plt.ylim(0,8100000)
plt.legend()
plt.show()
MinX= 117789.12 MaxX= 156098.88 CountX= 395 MinY= 1305000 MaxY= 8000000 Centers= [141557.76 126740.16 150914.88 151096.32 139285.44 145938.24 150595.2 151182.72 144780.48 142344. 149947.2 138542.4 144037.44 155122.56 151295.04]
In [ ]:
In [ ]:
In [ ]:
In [ ]: