This report provides insights about the dataset you provided as input to the AutoML job. It was automatically generated by the AutoML training job: airbnbautopilot.
As part of the AutoML job, the input dataset was randomly split into two pieces, one for training and one for validation. The training dataset was randomly sampled, and metrics were computed for each of the columns. This notebook provides these metrics so that you can:
We read 64095
rows from the training dataset.
The dataset has 22
columns and the column named price_log
is used as the target column.
This is identified as a Regression
problem.
The labels were found to be within the range [0.0, 9.39599]
.
The following table is a random sample of 10 rows from the training dataset. For ease of presentation, we are only showing 20 of the 22 columns of the dataset.
host_since | first_review | last_review | summary | description | property_type | room_type | bed_type | amenities | ... | extra_people | zipcode_clean | accommodates | bathrooms | bedrooms | beds | guests_included | number_of_reviews | mean_num_nights | price_log | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 38291 | 2014-02-25 | 2018-07-08 | 2019-06-12 | Appartement très agréable au 7èm étage . | Appartement très agréable au 7èm étage . Soiré... | Apartment | Entire home/apt | Real Bed | 11 | ... | 0.0 | 75019 | 2 | 1.0 | 1.0 | 1.0 | 2 | 17 | 564.0 | 3.9318256327243257 |
1 | 23934 | 2013-01-28 | 2016-12-15 | 2020-03-14 | Au cœur du canal Saint-Martin, ancien atelier ... | Au cœur du canal Saint-Martin, ancien atelier ... | Apartment | Entire home/apt | Real Bed | 24 | ... | 0.0 | 75010 | 2 | 2.0 | 1.0 | 1.0 | 1 | 17 | 46.5 | 4.912654885736052 |
2 | 7380 | 2014-08-07 | 2018-01-02 | 2020-02-23 | Cute Parisian apartment with wonderful views t... | Cute Parisian apartment with wonderful views t... | Apartment | Entire home/apt | Real Bed | 24 | ... | 0.0 | 75020 | 2 | 1.0 | 1.0 | 1.0 | 1 | 37 | 11.0 | 4.02535169073515 |
3 | 42439 | 2018-10-29 | 2018-11-11 | 2020-01-20 | Beautiful apartment, both modern and bright. T... | Beautiful apartment, both modern and bright. T... | Apartment | Entire home/apt | Real Bed | 21 | ... | 0.0 | 75018 | 4 | 1.0 | 1.0 | 2.0 | 1 | 12 | 563.0 | 4.394449154672439 |
4 | 40742 | 2016-06-06 | A gem in the City of Lights. Walking distance ... | A gem in the City of Lights. Walking distance ... | Apartment | Private room | Real Bed | 29 | ... | 0.0 | Other | 4 | 1.0 | 1.0 | 1.0 | 1 | 0 | 566.0 | 6.2166061010848646 | ||
5 | 26763 | 2016-06-16 | 2017-05-14 | 2017-05-16 | Alone or with a few friends, you're looking fo... | Alone or with a few friends, you're looking fo... | Apartment | Private room | Real Bed | 15 | ... | 15.0 | 75018 | 2 | 1.0 | 0.0 | 2.0 | 2 | 2 | 563.5 | 3.828641396489095 |
6 | 61489 | 2012-09-19 | 2020-02-03 | 2020-03-04 | Contemporary apartment with a view of the Eiff... | Contemporary apartment with a view of the Eiff... | Apartment | Entire home/apt | Real Bed | 13 | ... | 0.0 | 75019 | 2 | 2.0 | 1.0 | 1.0 | 1 | 2 | 564.0 | 4.709530201312334 |
7 | 52201 | 2019-06-28 | 2019-07-05 | 2019-10-06 | Beautiful apartment of 30 square meters locate... | Beautiful apartment of 30 square meters locate... | Apartment | Entire home/apt | Real Bed | 17 | ... | 0.0 | 75018 | 2 | 1.0 | 1.0 | 1.0 | 1 | 18 | 563.0 | 4.02535169073515 |
8 | 51090 | 2013-07-05 | 2019-07-01 | 2020-01-05 | Bel appartement parisien de 29m2, spacieux, or... | Bel appartement parisien de 29m2, spacieux, or... | Apartment | Entire home/apt | Real Bed | 11 | ... | 0.0 | 75018 | 3 | 1.0 | 1.0 | 2.0 | 1 | 14 | 2.5 | 4.2626798770413155 |
9 | 8638 | 2015-04-12 | Atelier d'artiste parisien transformé en appar... | Atelier d'artiste parisien transformé en appar... | Apartment | Entire home/apt | Real Bed | 40 | ... | 0.0 | 75017 | 2 | 1.0 | 1.0 | 1.0 | 1 | 0 | 8.5 | 4.709530201312334 |
The AutoML job analyzed the 22
input columns to infer each data type and select
the feature processing pipelines for each training algorithm.
For more details on the specific AutoML pipeline candidates, see Amazon SageMaker Autopilot Candidate Definition Notebook.ipynb.
Within the data sample, the following columns contained missing values, such as: nan
, white spaces, or empty fields.
SageMaker Autopilot will attempt to fill in missing values using various techniques. For example,
missing values can be replaced with a new 'unknown' category for Categorical
features
and missing Numerical
values can be replaced with the mean or median of the column.
We found 10 of the 22 of the columns contained missing values. The following table shows the 10 columns with the highest percentage of missing values.
% of Missing Values | |
---|---|
security_deposit | 28.27% |
cleaning_fee | 24.2% |
first_review | 20.45% |
last_review | 20.45% |
zipcode_clean | 0.91% |
beds | 0.37% |
bedrooms | 0.16% |
bathrooms | 0.03% |
host_since | 0.01% |
bed_type | 0.01% |
For String
features, it is important to count the number of unique values to determine whether to treat a feature as Categorical
or Text
and then processes the feature according to its type.
For example, SageMaker Autopilot counts the number of unique entries and the number of unique words. The following string column would have 3 total entries, 2 unique entries, and 3 unique words.
String Column | |
---|---|
0 | "red blue" |
1 | "red blue" |
2 | "red blue yellow" |
If the feature is Categorical
, SageMaker Autopilot can look at the total number of unique entries and transform it using techniques such as one-hot encoding.
If the field contains a Text
string, we look at the number of unique words, or the vocabulary size, in the string.
We can use the unique words to then compute text-based features, such as Term Frequency-Inverse Document Frequency (tf-idf).
Note: If the number of unique values is too high, we risk data transformations expanding the dataset to too many features. In that case, SageMaker Autopilot will attempt to reduce the dimensionality of the post-processed data, such as by capping the number vocabulary words for tf-idf, applying Principle Component Analysis (PCA), or other dimensionality reduction techniques.
The table below shows 22 of the 22 columns ranked by the number of unique entries.
Number of Unique Entries | Number of Unique Words (if Text) | |
---|---|---|
room_type | 4 | n/a |
bed_type | 5 | n/a |
bedrooms | 14 | n/a |
guests_included | 17 | n/a |
property_type | 18 | n/a |
beds | 19 | n/a |
bathrooms | 19 | n/a |
accommodates | 19 | n/a |
zipcode_clean | 21 | n/a |
amenities | 68 | n/a |
extra_people | 101 | n/a |
cleaning_fee | 224 | n/a |
number_of_reviews | 429 | n/a |
mean_num_nights | 560 | n/a |
security_deposit | 722 | n/a |
price_log | 728 | n/a |
last_review | 2012 | 2012 |
first_review | 2941 | 2941 |
host_since | 3594 | 3594 |
summary | 62010 | 105113 |
description | 62850 | 190834 |
64059 | n/a |
For each of the numerical input features, several descriptive statistics are computed from the data sample.
SageMaker Autopilot may treat numerical features as Categorical
if the number of unique entries is sufficiently low.
For Numerical
features, we may apply numerical transformations such as normalization, log and quantile transforms,
and binning to manage outlier values and difference in feature scales.
We found 14 of the 22 columns contained at least one numerical value. The table below shows the 14 columns which have the largest percentage of numerical values.
% of Numerical Values | Mean | Median | Min | Max | |
---|---|---|---|---|---|
100.0% | 33370.2 | 32895.0 | 0.0 | 66899.0 | |
amenities | 100.0% | 18.7832 | 18.0 | 1.0 | 91.0 |
extra_people | 100.0% | 5.54269 | 0.0 | 0.0 | 277.0 |
accommodates | 100.0% | 3.08643 | 2.0 | 1.0 | 22.0 |
guests_included | 100.0% | 1.51823 | 1.0 | 1.0 | 100.0 |
number_of_reviews | 100.0% | 20.3424 | 6.0 | 0.0 | 867.0 |
mean_num_nights | 100.0% | 436.142 | 563.0 | 1.0 | 5e+06 |
price_log | 100.0% | 4.53285 | 4.45435 | 0.0 | 9.39599 |
bathrooms | 99.97% | 1.12773 | 1.0 | 0.0 | 50.0 |
bedrooms | 99.84% | 1.09691 | 1.0 | 0.0 | 50.0 |
beds | 99.63% | 1.6858 | 1.0 | 0.0 | 50.0 |
zipcode_clean | 91.82% | 75012.2 | 75012.0 | 75002.0 | 75020.0 |
cleaning_fee | 75.8% | 42.2892 | 35.0 | 0.0 | 735.0 |
security_deposit | 71.73% | 411.284 | 300.0 | 0.0 | 4740.0 |