Amazon SageMaker Autopilot Data Exploration

This report provides insights about the dataset you provided as input to the AutoML job. It was automatically generated by the AutoML training job: airbnbautopilot.

As part of the AutoML job, the input dataset was randomly split into two pieces, one for training and one for validation. The training dataset was randomly sampled, and metrics were computed for each of the columns. This notebook provides these metrics so that you can:

  1. Understand how the job analyzed features to select the candidate pipelines.
  2. Modify and improve the generated AutoML pipelines using knowledge that you have about the dataset.

We read 64095 rows from the training dataset. The dataset has 22 columns and the column named price_log is used as the target column. This is identified as a Regression problem. The labels were found to be within the range [0.0, 9.39599].

💡 Suggested Action Items - Look for sections like this for recommended actions that you can take.

Contents

  1. Dataset Sample
  2. Column Analysis

Dataset Sample

The following table is a random sample of 10 rows from the training dataset. For ease of presentation, we are only showing 20 of the 22 columns of the dataset.

💡 Suggested Action Items - Verify the input headers correctly align with the columns of the dataset sample. If they are incorrect, update the header names of your input dataset in Amazon Simple Storage Service (Amazon S3).
host_since first_review last_review summary description property_type room_type bed_type amenities ... extra_people zipcode_clean accommodates bathrooms bedrooms beds guests_included number_of_reviews mean_num_nights price_log
0 38291 2014-02-25 2018-07-08 2019-06-12 Appartement très agréable au 7èm étage . Appartement très agréable au 7èm étage . Soiré... Apartment Entire home/apt Real Bed 11 ... 0.0 75019 2 1.0 1.0 1.0 2 17 564.0 3.9318256327243257
1 23934 2013-01-28 2016-12-15 2020-03-14 Au cœur du canal Saint-Martin, ancien atelier ... Au cœur du canal Saint-Martin, ancien atelier ... Apartment Entire home/apt Real Bed 24 ... 0.0 75010 2 2.0 1.0 1.0 1 17 46.5 4.912654885736052
2 7380 2014-08-07 2018-01-02 2020-02-23 Cute Parisian apartment with wonderful views t... Cute Parisian apartment with wonderful views t... Apartment Entire home/apt Real Bed 24 ... 0.0 75020 2 1.0 1.0 1.0 1 37 11.0 4.02535169073515
3 42439 2018-10-29 2018-11-11 2020-01-20 Beautiful apartment, both modern and bright. T... Beautiful apartment, both modern and bright. T... Apartment Entire home/apt Real Bed 21 ... 0.0 75018 4 1.0 1.0 2.0 1 12 563.0 4.394449154672439
4 40742 2016-06-06 A gem in the City of Lights. Walking distance ... A gem in the City of Lights. Walking distance ... Apartment Private room Real Bed 29 ... 0.0 Other 4 1.0 1.0 1.0 1 0 566.0 6.2166061010848646
5 26763 2016-06-16 2017-05-14 2017-05-16 Alone or with a few friends, you're looking fo... Alone or with a few friends, you're looking fo... Apartment Private room Real Bed 15 ... 15.0 75018 2 1.0 0.0 2.0 2 2 563.5 3.828641396489095
6 61489 2012-09-19 2020-02-03 2020-03-04 Contemporary apartment with a view of the Eiff... Contemporary apartment with a view of the Eiff... Apartment Entire home/apt Real Bed 13 ... 0.0 75019 2 2.0 1.0 1.0 1 2 564.0 4.709530201312334
7 52201 2019-06-28 2019-07-05 2019-10-06 Beautiful apartment of 30 square meters locate... Beautiful apartment of 30 square meters locate... Apartment Entire home/apt Real Bed 17 ... 0.0 75018 2 1.0 1.0 1.0 1 18 563.0 4.02535169073515
8 51090 2013-07-05 2019-07-01 2020-01-05 Bel appartement parisien de 29m2, spacieux, or... Bel appartement parisien de 29m2, spacieux, or... Apartment Entire home/apt Real Bed 11 ... 0.0 75018 3 1.0 1.0 2.0 1 14 2.5 4.2626798770413155
9 8638 2015-04-12 Atelier d'artiste parisien transformé en appar... Atelier d'artiste parisien transformé en appar... Apartment Entire home/apt Real Bed 40 ... 0.0 75017 2 1.0 1.0 1.0 1 0 8.5 4.709530201312334

Column Analysis

The AutoML job analyzed the 22 input columns to infer each data type and select the feature processing pipelines for each training algorithm. For more details on the specific AutoML pipeline candidates, see Amazon SageMaker Autopilot Candidate Definition Notebook.ipynb.

Percent of Missing Values

Within the data sample, the following columns contained missing values, such as: nan, white spaces, or empty fields.

SageMaker Autopilot will attempt to fill in missing values using various techniques. For example, missing values can be replaced with a new 'unknown' category for Categorical features and missing Numerical values can be replaced with the mean or median of the column.

We found 10 of the 22 of the columns contained missing values. The following table shows the 10 columns with the highest percentage of missing values.

💡 Suggested Action Items - Investigate the governance of the training dataset. Do you expect this many missing values? Are you able to fill in the missing values with real data? - Use domain knowledge to define an appropriate default value for the feature. Either: - Replace all missing values with the new default value in your dataset in Amazon S3. - Add a step to the feature pre-processing pipeline to fill missing values, for example with a [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).
% of Missing Values
security_deposit 28.27%
cleaning_fee 24.2%
first_review 20.45%
last_review 20.45%
zipcode_clean 0.91%
beds 0.37%
bedrooms 0.16%
bathrooms 0.03%
host_since 0.01%
bed_type 0.01%

Count Statistics

For String features, it is important to count the number of unique values to determine whether to treat a feature as Categorical or Text and then processes the feature according to its type.

For example, SageMaker Autopilot counts the number of unique entries and the number of unique words. The following string column would have 3 total entries, 2 unique entries, and 3 unique words.

String Column
0 "red blue"
1 "red blue"
2 "red blue yellow"

If the feature is Categorical, SageMaker Autopilot can look at the total number of unique entries and transform it using techniques such as one-hot encoding. If the field contains a Text string, we look at the number of unique words, or the vocabulary size, in the string. We can use the unique words to then compute text-based features, such as Term Frequency-Inverse Document Frequency (tf-idf).

Note: If the number of unique values is too high, we risk data transformations expanding the dataset to too many features. In that case, SageMaker Autopilot will attempt to reduce the dimensionality of the post-processed data, such as by capping the number vocabulary words for tf-idf, applying Principle Component Analysis (PCA), or other dimensionality reduction techniques.

The table below shows 22 of the 22 columns ranked by the number of unique entries.

💡 Suggested Action Items - Verify the number of unique values of a feature is expected with respect to domain knowledge. If it differs, one explanation could be multiple encodings of a value. For example `US` and `U.S.` will count as two different words. You could correct the error at the data source or pre-process your dataset in your S3 bucket. - If the number of unique values seems too high for Categorical variables, investigate if using domain knowledge to group the feature to a new feature with a smaller set of possible values improves performance.
Number of Unique Entries Number of Unique Words (if Text)
room_type 4 n/a
bed_type 5 n/a
bedrooms 14 n/a
guests_included 17 n/a
property_type 18 n/a
beds 19 n/a
bathrooms 19 n/a
accommodates 19 n/a
zipcode_clean 21 n/a
amenities 68 n/a
extra_people 101 n/a
cleaning_fee 224 n/a
number_of_reviews 429 n/a
mean_num_nights 560 n/a
security_deposit 722 n/a
price_log 728 n/a
last_review 2012 2012
first_review 2941 2941
host_since 3594 3594
summary 62010 105113
description 62850 190834
64059 n/a

Descriptive Statistics

For each of the numerical input features, several descriptive statistics are computed from the data sample.

SageMaker Autopilot may treat numerical features as Categorical if the number of unique entries is sufficiently low. For Numerical features, we may apply numerical transformations such as normalization, log and quantile transforms, and binning to manage outlier values and difference in feature scales.

We found 14 of the 22 columns contained at least one numerical value. The table below shows the 14 columns which have the largest percentage of numerical values.

💡 Suggested Action Items - Investigate the origin of the data field. Are some values non-finite (e.g. infinity, nan)? Are they missing or is it an error in data input? - Missing and extreme values may indicate a bug in the data collection process. Verify the numerical descriptions align with expectations. For example, use domain knowledge to check that the range of values for a feature meets with expectations.
% of Numerical Values Mean Median Min Max
100.0% 33370.2 32895.0 0.0 66899.0
amenities 100.0% 18.7832 18.0 1.0 91.0
extra_people 100.0% 5.54269 0.0 0.0 277.0
accommodates 100.0% 3.08643 2.0 1.0 22.0
guests_included 100.0% 1.51823 1.0 1.0 100.0
number_of_reviews 100.0% 20.3424 6.0 0.0 867.0
mean_num_nights 100.0% 436.142 563.0 1.0 5e+06
price_log 100.0% 4.53285 4.45435 0.0 9.39599
bathrooms 99.97% 1.12773 1.0 0.0 50.0
bedrooms 99.84% 1.09691 1.0 0.0 50.0
beds 99.63% 1.6858 1.0 0.0 50.0
zipcode_clean 91.82% 75012.2 75012.0 75002.0 75020.0
cleaning_fee 75.8% 42.2892 35.0 0.0 735.0
security_deposit 71.73% 411.284 300.0 0.0 4740.0