Welcome to the airbnb prediction price in Paris project !
In this notebook we are going to predict the price of airbnb homes. To do this, In order to achieve this goal, we have carried out several steps:
The Dataset is composed of 66900 rows and 106 columns. Metrics used in this project were the Mean Square Error (MSE) and R2 Score (r2).
You can get the paris dataset here : Inside Airbnb
Also take a look at the notebooks produced by AWS SageMaker Autopilot:
Time to code !
import os, sys
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, cross_validate
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA, TruncatedSVD
from category_encoders.target_encoder import TargetEncoder
from category_encoders import LeaveOneOutEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
pd.options.mode.chained_assignment = None # default='warn
First we open our csv and read it:
path = '../../Datasets'
file_name = 'listings.csv'
filepath = os.path.join(path,file_name)
df = pd.read_csv(filepath)
We can now take a peak of what our data look like:
print(f"The dataset is composed of {df.shape[0]} rows and {df.shape[1]} columns.")
df.head()
Okay, we can see that our DataFrame is quiet big...
We suppose that we won't need that much of features to predict the price so we will have to remove some columns, clean the remaining ones and process them.
First, I checked for NaN values and applied a mask on my data to remove the columns were the number of NaN values were >= 30 % .
isna_mask = (df.isna().sum()/df.shape[0]*100) < 30
df = df.loc[:, isna_mask]
I also cheked for duplicated rows based on the 'id' but nothing return.
df_duplicated = df.duplicated(subset='id',keep=False)
df_duplicated[df_duplicated == True]
After taking some time to inspect the data I remove a large number of columns (e.g: id, scraped_id, requiere_guest_profile_pic,..).
Of course I could have keep all the features and use an algorithm that is not affected by the curse of dimensionality, but the aim was to use features wich could bring significant information.
columns_to_keep = [
'host_since',
'first_review',
'last_review',
'summary',
'description',
'zipcode',
'property_type',
'room_type',
'accommodates',
'bathrooms',
'bedrooms',
'beds',
'bed_type',
'amenities',
'price',
'security_deposit',
'cleaning_fee',
'guests_included',
'extra_people',
'minimum_nights',
'maximum_nights',
'number_of_reviews'
]
df = df.loc[:,columns_to_keep]
print(f"The dataset is now composed of {df.shape[0]} rows and {df.shape[1]} columns.")
Since i will be using some textual features ( 'summary' and ' description' which are quiet similar but gave the best results) and cannot fill NaN values with some new text i removed these rows
df = df[~np.logical_or(df['summary'].isna(),df['description'].isna())]
I also separated my data according to dtype:
df_numerical = df.select_dtypes(include=['int','float'])
df_others = df.select_dtypes(exclude=['int','float'])
A quick heatmap of correlation matrix of numerical data:
plt.figure(figsize=(9,7))
sns.heatmap(df_numerical.corr())
plt.title('Correlation Matrix plot of numerical data')
plt.show()
Based on the correlation matrix, I checked for higly correlated features (redundant information) using a mask (correlation > 0.8) but no features were found:
corr_mask = np.triu(np.ones_like(df_numerical.corr().abs(), dtype = bool))
df_numerical_corr_masked = df_numerical.corr().abs().mask(corr_mask)
numerical_col_to_remove = [ c for c in df_numerical_corr_masked.columns if any(df_numerical_corr_masked[c] > 0.8)]
print(f'Higly correlated numerical features : {numerical_col_to_remove}')
I practiced feature engineering on 'minimum_nights' and 'maximum_nights', basically i just made the mean of the two features:
df_numerical['mean_num_nights'] = (df_numerical['minimum_nights'] + df_numerical['maximum_nights'])/2
df_numerical.drop(['minimum_nights','maximum_nights'], axis=1, inplace=True)
# Price_col cleaning:
priced_col = ['price','security_deposit','cleaning_fee','extra_people']
for col in priced_col:
df_others[col] = df_others[col].str.replace('$','').str.replace(",","").astype('float')
# Amenities cleaning:
def amenities_cleaning(x):
''' This function is used to clean the Amenities column '''
x = len(x.replace('{','').replace('}','').split(','))
return x
df_others['amenities'] = df_others['amenities'].apply(lambda x: amenities_cleaning(x))
# Zipcode cleaning ( zipcode values where value_counts was < 800 were replaced by 'Other'):
zipcode_mask = df_others['zipcode'].value_counts().index[df_others['zipcode'].value_counts() < 800]
mask_isin_zipcode = df_others['zipcode'].isin(zipcode_mask)
df_others['zipcode'][mask_isin_zipcode] = 'Other'
df_others['zipcode_clean'] = df_others['zipcode'].astype('str').apply(lambda x: x.replace(".0",""))
df_others.drop(['zipcode'], axis=1,inplace=True)
# Property_type cleaning ( property_type values where value_count was < 10 were replaced by 'Exotic'):
exotic_properties = df_others['property_type'].value_counts()[df_others['property_type'].value_counts() < 10].index
mask_isin_exotic = df_others['property_type'].isin(exotic_properties)
df_others['property_type'][mask_isin_exotic] = 'Exotic'
# Room_type and bed_type cleaning:
# No need since we don't have that much unique values :
print(f"Number of unique values for 'room_type' feature: {len(df_others['room_type'].unique())}")
print(f"Number of unique values for 'df_others' feature: {len(df_others['bed_type'].unique())}")
Finally we concat our two DataFrame after cleaning:
df_final = pd.concat([df_others,df_numerical], axis=1)
Still before going to the next step, there's something quiet important that we need to check : Price distribution
fig, axes = plt.subplots(1,2,figsize=(15,7))
sns.distplot(df_final['price'], ax=axes[0])
axes[0].set_title('Price distribution')
sns.distplot(df_final['price'][df_final['price'] < 500], ax=axes[1])
axes[1].set_title('Price distribution with label thresholding')
plt.show()
Okay, so as we can see, most of the prices are between 0 and 500, thus I've decided to removed the abnormal prices with a threshold at 500 and also applied a log transform.
You may wonder why a log transform ?
Well, it's just that without log transform on the target the algorithms performed quiet bad. I tested various transform and the logarithm transform gave us the best performances.
df_final[df_final['price']< 500]
df_final['price_log'] = df_final['price'].apply(lambda x: np.log1p(x))
df_final.drop('price', axis=1, inplace=True)
Now, let's take a look at our target...The target feature is more Gaussian-Like now !
fig, ax = plt.subplots()
sns.distplot(df_final['price_log'], ax=ax)
plt.title('Price with Log Transform')
plt.show()
If you wanna save the csv preprocessed:
#df_final.to_csv('airbnb_paris_df.csv',index=False)
We can load the preprocessed csv ready to use for ML purpose (need to remove the hashtag):
#df_final = pd.read_csv('airbnb_paris_df.csv', index_col=False)
#df_final.drop('Unnamed: 0', axis=1, inplace=True)
We split our DataFrame into a target columns y (price_log) and features columns X and perform the train-test-split:
y = df_final['price_log']
X = df_final.drop(['price_log'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y, shuffle=True, random_state= 42)
Just before doing ML I processed date features (host_since, first_review, last_review).
I converted them into continuous values by filling the null value with the mean date, and subtracting the earliest date value from all datevalues.
date_columns = ['host_since','first_review','last_review']
for col in date_columns:
X_train[col] = pd.to_datetime(X_train[col])
X_test[col] = pd.to_datetime(X_test[col])
mean_date = str(X_train[col].mean())
X_train[col] = X_train[col].astype('str')
X_test[col] = X_test[col].astype('str')
imputer_date = SimpleImputer(missing_values='NaT',strategy='constant', fill_value=mean_date)
X_train[col] = imputer_date.fit_transform(X_train[[col]])
X_train[col] = pd.to_datetime(X_train[col])
X_test[col] = imputer_date.transform(X_test[[col]])
X_test[col] = pd.to_datetime(X_test[col])
for col in date_columns:
X_train[col] = X_train[col] - X_train[col].min()
X_train[col] = X_train[col].apply(lambda x: x.days)
X_test[col] = X_test[col] - X_test[col].min()
X_test[col] = X_test[col].apply(lambda x: x.days)
Okay, we already made a lot of things such as cleaning and processing our data, Still we have more to do so that our preparation will be completed and we will use Pipelines to the rescue !
Pipelines allow us to perform various type of data preparation and chaining them such as:
Here is what i did :
numerical_columns = ['host_since','first_review','last_review','amenities','security_deposit', 'cleaning_fee', 'extra_people','accommodates', 'bathrooms', 'bedrooms', 'beds', 'guests_included','number_of_reviews', 'mean_num_nights']
categorical_columns = ['property_type', 'room_type', 'bed_type','zipcode_clean']
text_columns = ['summary']
final_stopwords_list = stopwords.words('english') + stopwords.words('french')
numerical_transformer = Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler',StandardScaler())])
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('encoder',OneHotEncoder())])
text_transformer = Pipeline(steps=[('tfidf',TfidfVectorizer(stop_words=final_stopwords_list)),('svd', TruncatedSVD(n_components=50))])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_columns),
('cat', categorical_transformer, categorical_columns),
('text_summary', text_transformer, 'summary')])
#preprocessor_pca = Pipeline(steps=[('preprocess',preprocessor),('pca',T(n_components=0.90))])
Finally we can fit_transform on the train set and apply the transform on the test set which avoid data leaking ! (Thanks pipelines !)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
Finally, we can start to perform regression with our dataset prepared.
Because it is important to keep things simple at first, we will start with a linear regression.
#Baseline model with linear regression:
lr = LinearRegression()
lr.fit(X_train_processed,y_train)
#Predict on train set:
y_pred_train_lr = lr.predict(X_train_processed)
mse_train_lr = mean_squared_error(y_train, y_pred_train_lr)
r2_train_lr = r2_score(y_train, y_pred_train_lr)
print(f'MSE Training Set: {mse_train_lr}')
print(f'R2 Training Set: {r2_train_lr}')
# Cross Validation 10 folds multi-scoring
lr_cv = cross_validate(lr,X_train_processed, y_train, scoring=['neg_mean_squared_error','r2'], cv=10, n_jobs=-1)
mse_val_lr = -lr_cv['test_neg_mean_squared_error'].mean()
r2_val_lr = lr_cv['test_r2'].mean()
print(f'MSE Validation Set: {mse_val_lr}')
print(f'R2 Score Validation Set: {r2_val_lr}')
Using a linear regression perform on our data gives us poor results. As we can see our model doesn't learn really well on the training set and therefore it does not manage to generalize well.
A suggestion would be to use non-linear models such as decision trees or random forests.
#Baseline model with random forest regression:
rfr = RandomForestRegressor()
rfr.fit(X_train_processed,y_train)
y_pred_train_rfr = rfr.predict(X_train_processed)
mse_train_rfr = mean_squared_error(y_train, y_pred_train_rfr)
r2_train_rfr = r2_score(y_train, y_pred_train_rfr)
print(f'MSE Training Set: {mse_train_rfr}')
print(f'R2 Training Set: {r2_train_rfr}')
# Cross Validation 10 folds multi-scoring
cv_rfr = cross_validate(rfr,X_train_processed, y_train, scoring=['neg_mean_squared_error','r2'], cv=5, n_jobs=-1)
mse_val_rfr = -cv_rfr['test_neg_mean_squared_error'].mean()
r2_val_rfr = cv_rfr['test_r2'].mean()
print(f'MSE Validation Set: {mse_val_rfr}')
print(f'R2 Score Validation Set: {r2_val_rfr}')
Results demonstrated that using RF the model is really good at learning the training set but tends to overfit.
Because the bootstrap aggregating technique also called "Bagging" can lead to mistakes if the majority of classifiers are wrong in the same regions and boosting models overcome this issue by training models sequentially.
Thus, we will finish our benchmark by using XGBoost which is an optimized distributed gradient boosting library.
#Baseline model with XGBoost:
xboost = xgb.XGBRegressor(tree_method='gpu_hist', gpu_id=0) #Use of GPU
xboost.fit(X_train_processed, y_train)
y_pred_train_xgb = xboost.predict(X_train_processed)
mse_train_xgb = mean_squared_error(y_train, y_pred_train_xgb)
r2_train_xgb = r2_score(y_train, y_pred_train_xgb)
print(f'MSE Training Set: {mse_train_xgb}')
print(f'R2 Training Set: {r2_train_xgb}')
xboost = xgb.XGBRegressor(tree_method='gpu_hist', gpu_id=0)
# Cross Validation 10 folds multi-scoring
xgb_cross_val = cross_validate(xboost, X_train_processed, y_train, scoring=['neg_mean_squared_error','r2'], cv=10)
mse_val_xgb = -xgb_cross_val['test_neg_mean_squared_error'].mean()
r2_val_xgb = xgb_cross_val['test_r2'].mean()
print(f'MSE Validation Set: {mse_val_xgb}')
print(f'R2 Score Validation Set: {r2_val_xgb}')
Yes ! We improved our metrics on the validation set, using a beseline gradient boosting algorithm with XGBoost disminished the overfit that we had using Random Forest.
Comparing our baselines models we can clearly see that Gradient Boosting using XGBoost performed better on a validation set than a linear regression and a Random Forest.
Thus, tuning our boosting model could be quiet interesting in order to reduce the overfit. Nevertheless boosting models have a lot of hyperparameters (more than random forest)and tuning them is an hard task that take time and computational power.
Due to computational limitation on my computer, I switched to cloud computing but running the randomized search took me too much time. Implement the build-in XGBoost in SageMaker would be preferable (see xgboost in SageMaker).
I decided to use the hyperparameters given by the best model of SageMaker Autopilot:
#Example of XGBoost tuning that can be made:
#params_xgb = {'n_estimators':range(100,1000,20),
# 'max_depth':range(1,30),
# 'learning_rate':np.linspace(0.001,0.3,num=50),
# 'reg_alpha':np.linspace(0.001,0.5,num=50)}
#rand_search_xgb = RandomizedSearchCV(xboost,params_xgb,n_iter=60, cv=5)
#rand_search_xgb_results = rand_search_xgb.fit(X_train_processed, y_train)
#rand_search_xgb_results.best_params_
#rand_search_xgb_results.best_score_
#Hyperparameters were calculated by SageMaker Autpilot giving the best model:
xboost = xgb.XGBRegressor(n_estimators=740,
max_depth= 6,
learning_rate=0.11214298095296928,
reg_alpha=0.317940445687246,
colsample_bytree= 0.38523465777592514,
gamma=4.465918775019842e-06,
min_child_weight = 0.8883356598036687,
subsample = 0.86520980408866,
tree_method='gpu_hist', gpu_id=0)
# Cross Validation 10 folds multi-scoring
xgb_cross_val = cross_validate(xboost, X_train_processed, y_train, scoring=['neg_mean_squared_error','r2'], cv=10)
mse_val_xgb = -xgb_cross_val['test_neg_mean_squared_error'].mean()
r2_val_xgb = xgb_cross_val['test_r2'].mean()
print(f'MSE Validation Set: {mse_val_xgb}')
print(f'R2 Score Validation Set: {r2_val_xgb}')
That's quiet impressive ! We improved again our model using SageMaker Autopilot hyperparameters.
Now it is time to try our model on new data !
We fit our XGBoost model on the training set and predict on the test set:
xboost = xgb.XGBRegressor(n_estimators=740,
max_depth= 6,
learning_rate=0.11214298095296928,
reg_alpha=0.317940445687246,
colsample_bytree= 0.38523465777592514,
gamma=4.465918775019842e-06,
min_child_weight = 0.8883356598036687,
subsample = 0.86520980408866)
xboost.fit(X_train_processed, y_train)
y_pred_xgb = xboost.predict(X_test_processed)
mse_xgb_test = mean_squared_error(y_test, y_pred_xgb)
r2_xgb_test = r2_score(y_test, y_pred_xgb)
print(f'MSE Test Set: {mse_xgb_test}')
print(f'R2 Score test Set: {r2_xgb_test}')
We reconstruct our train set and apply an exponential transformation on the price_log and predicted price (which is also logarithm tranqformed).
test_set = pd.concat([X_test,y_test], axis=1)
test_set['price'] = np.expm1(test_set['price_log'])
test_set['price_predict'] = np.expm1(y_pred_xgb)
Let's take a look at our Airbnb predictions...
test_set.head(10)[['summary','price','price_predict']]
Sounds good!
We finally made it, this is the end of the Airbnb prediction price , I really hope that you find that interesting.
Also, go check this scientific paper if you are interested in the subject : Predicting Airbnb Listing Price Across Different Cities