Amazon SageMaker Autopilot Data Exploration¶

This report provides insights about the dataset you provided as input to the AutoML job. It was automatically generated by the AutoML training job: airbnbautopilot.

As part of the AutoML job, the input dataset was randomly split into two pieces, one for training and one for validation. The training dataset was randomly sampled, and metrics were computed for each of the columns. This notebook provides these metrics so that you can:

Understand how the job analyzed features to select the candidate pipelines.
Modify and improve the generated AutoML pipelines using knowledge that you have about the dataset.

We read 64095 rows from the training dataset. The dataset has 22 columns and the column named price_log is used as the target column. This is identified as a Regression problem. The labels were found to be within the range [0.0, 9.39599].

💡 Suggested Action Items - Look for sections like this for recommended actions that you can take.

Dataset Sample¶

The following table is a random sample of 10 rows from the training dataset. For ease of presentation, we are only showing 20 of the 22 columns of the dataset.

💡 Suggested Action Items - Verify the input headers correctly align with the columns of the dataset sample. If they are incorrect, update the header names of your input dataset in Amazon Simple Storage Service (Amazon S3).

		host_since	first_review	last_review	summary	description	property_type	room_type	bed_type	amenities	...	extra_people	zipcode_clean	accommodates	bathrooms	bedrooms	beds	guests_included	number_of_reviews	mean_num_nights	price_log
0	38291	2014-02-25	2018-07-08	2019-06-12	Appartement très agréable au 7èm étage .	Appartement très agréable au 7èm étage . Soiré...	Apartment	Entire home/apt	Real Bed	11	...	0.0	75019	2	1.0	1.0	1.0	2	17	564.0	3.9318256327243257
1	23934	2013-01-28	2016-12-15	2020-03-14	Au cœur du canal Saint-Martin, ancien atelier ...	Au cœur du canal Saint-Martin, ancien atelier ...	Apartment	Entire home/apt	Real Bed	24	...	0.0	75010	2	2.0	1.0	1.0	1	17	46.5	4.912654885736052
2	7380	2014-08-07	2018-01-02	2020-02-23	Cute Parisian apartment with wonderful views t...	Cute Parisian apartment with wonderful views t...	Apartment	Entire home/apt	Real Bed	24	...	0.0	75020	2	1.0	1.0	1.0	1	37	11.0	4.02535169073515
3	42439	2018-10-29	2018-11-11	2020-01-20	Beautiful apartment, both modern and bright. T...	Beautiful apartment, both modern and bright. T...	Apartment	Entire home/apt	Real Bed	21	...	0.0	75018	4	1.0	1.0	2.0	1	12	563.0	4.394449154672439
4	40742	2016-06-06			A gem in the City of Lights. Walking distance ...	A gem in the City of Lights. Walking distance ...	Apartment	Private room	Real Bed	29	...	0.0	Other	4	1.0	1.0	1.0	1	0	566.0	6.2166061010848646
5	26763	2016-06-16	2017-05-14	2017-05-16	Alone or with a few friends, you're looking fo...	Alone or with a few friends, you're looking fo...	Apartment	Private room	Real Bed	15	...	15.0	75018	2	1.0	0.0	2.0	2	2	563.5	3.828641396489095
6	61489	2012-09-19	2020-02-03	2020-03-04	Contemporary apartment with a view of the Eiff...	Contemporary apartment with a view of the Eiff...	Apartment	Entire home/apt	Real Bed	13	...	0.0	75019	2	2.0	1.0	1.0	1	2	564.0	4.709530201312334
7	52201	2019-06-28	2019-07-05	2019-10-06	Beautiful apartment of 30 square meters locate...	Beautiful apartment of 30 square meters locate...	Apartment	Entire home/apt	Real Bed	17	...	0.0	75018	2	1.0	1.0	1.0	1	18	563.0	4.02535169073515
8	51090	2013-07-05	2019-07-01	2020-01-05	Bel appartement parisien de 29m2, spacieux, or...	Bel appartement parisien de 29m2, spacieux, or...	Apartment	Entire home/apt	Real Bed	11	...	0.0	75018	3	1.0	1.0	2.0	1	14	2.5	4.2626798770413155
9	8638	2015-04-12			Atelier d'artiste parisien transformé en appar...	Atelier d'artiste parisien transformé en appar...	Apartment	Entire home/apt	Real Bed	40	...	0.0	75017	2	1.0	1.0	1.0	1	0	8.5	4.709530201312334

Column Analysis¶

The AutoML job analyzed the 22 input columns to infer each data type and select the feature processing pipelines for each training algorithm. For more details on the specific AutoML pipeline candidates, see Amazon SageMaker Autopilot Candidate Definition Notebook.ipynb.

Percent of Missing Values¶

Within the data sample, the following columns contained missing values, such as: nan, white spaces, or empty fields.

SageMaker Autopilot will attempt to fill in missing values using various techniques. For example, missing values can be replaced with a new 'unknown' category for Categorical features and missing Numerical values can be replaced with the mean or median of the column.

We found 10 of the 22 of the columns contained missing values. The following table shows the 10 columns with the highest percentage of missing values.

💡 Suggested Action Items - Investigate the governance of the training dataset. Do you expect this many missing values? Are you able to fill in the missing values with real data? - Use domain knowledge to define an appropriate default value for the feature. Either: - Replace all missing values with the new default value in your dataset in Amazon S3. - Add a step to the feature pre-processing pipeline to fill missing values, for example with a [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).

	% of Missing Values
security_deposit	28.27%
cleaning_fee	24.2%
first_review	20.45%
last_review	20.45%
zipcode_clean	0.91%
beds	0.37%
bedrooms	0.16%
bathrooms	0.03%
host_since	0.01%
bed_type	0.01%

Count Statistics¶

For String features, it is important to count the number of unique values to determine whether to treat a feature as Categorical or Text and then processes the feature according to its type.

For example, SageMaker Autopilot counts the number of unique entries and the number of unique words. The following string column would have 3 total entries, 2 unique entries, and 3 unique words.

	String Column
0	"red blue"
1	"red blue"
2	"red blue yellow"

If the feature is Categorical, SageMaker Autopilot can look at the total number of unique entries and transform it using techniques such as one-hot encoding. If the field contains a Text string, we look at the number of unique words, or the vocabulary size, in the string. We can use the unique words to then compute text-based features, such as Term Frequency-Inverse Document Frequency (tf-idf).

Note: If the number of unique values is too high, we risk data transformations expanding the dataset to too many features. In that case, SageMaker Autopilot will attempt to reduce the dimensionality of the post-processed data, such as by capping the number vocabulary words for tf-idf, applying Principle Component Analysis (PCA), or other dimensionality reduction techniques.

The table below shows 22 of the 22 columns ranked by the number of unique entries.

💡 Suggested Action Items - Verify the number of unique values of a feature is expected with respect to domain knowledge. If it differs, one explanation could be multiple encodings of a value. For example `US` and `U.S.` will count as two different words. You could correct the error at the data source or pre-process your dataset in your S3 bucket. - If the number of unique values seems too high for Categorical variables, investigate if using domain knowledge to group the feature to a new feature with a smaller set of possible values improves performance.

	Number of Unique Entries	Number of Unique Words (if Text)
room_type	4	n/a
bed_type	5	n/a
bedrooms	14	n/a
guests_included	17	n/a
property_type	18	n/a
beds	19	n/a
bathrooms	19	n/a
accommodates	19	n/a
zipcode_clean	21	n/a
amenities	68	n/a
extra_people	101	n/a
cleaning_fee	224	n/a
number_of_reviews	429	n/a
mean_num_nights	560	n/a
security_deposit	722	n/a
price_log	728	n/a
last_review	2012	2012
first_review	2941	2941
host_since	3594	3594
summary	62010	105113
description	62850	190834
	64059	n/a

Descriptive Statistics¶

For each of the numerical input features, several descriptive statistics are computed from the data sample.

SageMaker Autopilot may treat numerical features as Categorical if the number of unique entries is sufficiently low. For Numerical features, we may apply numerical transformations such as normalization, log and quantile transforms, and binning to manage outlier values and difference in feature scales.

We found 14 of the 22 columns contained at least one numerical value. The table below shows the 14 columns which have the largest percentage of numerical values.

💡 Suggested Action Items - Investigate the origin of the data field. Are some values non-finite (e.g. infinity, nan)? Are they missing or is it an error in data input? - Missing and extreme values may indicate a bug in the data collection process. Verify the numerical descriptions align with expectations. For example, use domain knowledge to check that the range of values for a feature meets with expectations.

	% of Numerical Values	Mean	Median	Min	Max
	100.0%	33370.2	32895.0	0.0	66899.0
amenities	100.0%	18.7832	18.0	1.0	91.0
extra_people	100.0%	5.54269	0.0	0.0	277.0
accommodates	100.0%	3.08643	2.0	1.0	22.0
guests_included	100.0%	1.51823	1.0	1.0	100.0
number_of_reviews	100.0%	20.3424	6.0	0.0	867.0
mean_num_nights	100.0%	436.142	563.0	1.0	5e+06
price_log	100.0%	4.53285	4.45435	0.0	9.39599
bathrooms	99.97%	1.12773	1.0	0.0	50.0
bedrooms	99.84%	1.09691	1.0	0.0	50.0
beds	99.63%	1.6858	1.0	0.0	50.0
zipcode_clean	91.82%	75012.2	75012.0	75002.0	75020.0
cleaning_fee	75.8%	42.2892	35.0	0.0	735.0
security_deposit	71.73%	411.284	300.0	0.0	4740.0