This notebook was automatically generated by the AutoML job airbnbautopilot. This notebook allows you to customize the candidate definitions and execute the SageMaker Autopilot workflow.
The dataset has 22 columns and the column named price_log is used as the target column. This is being treated as a Regression problem. This notebook will build a Regression model that minimizes the "MSE" quality metric of the trained models. The "MSE" metric stands for mean square error. It minimizes the square distance between the model's prediction and the true answer.
As part of the AutoML job, the input dataset has been randomly split into two pieces, one for training and one for validation. This notebook helps you inspect and modify the data transformation approaches proposed by Amazon SageMaker Autopilot. You can interactively train the data transformation models and use them to transform the data. Finally, you can execute a multiple algorithm hyperparameter optimization (multi-algo HPO) job that helps you find the best model for your dataset by jointly optimizing the data transformations and machine learning algorithms.
Before you launch the SageMaker Autopilot jobs, we'll setup the environment for Amazon SageMaker
Minimal Environment Requirements
JupyterLab 1.0.6
, jupyter_core 4.5.0
and IPython 6.4.0
conda_python3
sagemaker-python-sdk>=v1.43.4
Download the generated data transformation modules and an SageMaker Autopilot helper module used by this notebook. Those artifacts will be downloaded to airbnbautopilot-artifacts folder.
!mkdir -p airbnbautopilot-artifacts
!aws s3 sync s3://sagemaker-studio-924252573936-6upcy2kudh2/airbnbautopilot/sagemaker-automl-candidates/pr-1-509a669e98294abcb2ee0626ecc272570922ae3e5f194471b1a91ae131/generated_module airbnbautopilot-artifacts/generated_module --only-show-errors
!aws s3 sync s3://sagemaker-studio-924252573936-6upcy2kudh2/airbnbautopilot/sagemaker-automl-candidates/pr-1-509a669e98294abcb2ee0626ecc272570922ae3e5f194471b1a91ae131/notebooks/sagemaker_automl airbnbautopilot-artifacts/sagemaker_automl --only-show-errors
import sys
sys.path.append("airbnbautopilot-artifacts")
The following configuration has been derived from the SageMaker Autopilot job. These items configure where this notebook will look for generated candidates, and where input and output data is stored on Amazon S3.
from sagemaker_automl import uid, AutoMLLocalRunConfig
# Where the preprocessed data from the existing AutoML job is stored
BASE_AUTOML_JOB_NAME = 'airbnbautopilot'
BASE_AUTOML_JOB_CONFIG = {
'automl_job_name': BASE_AUTOML_JOB_NAME,
'automl_output_s3_base_path': 's3://sagemaker-studio-924252573936-6upcy2kudh2/airbnbautopilot',
'data_transformer_image_repo_version': '0.2-1-cpu-py3',
'algo_image_repo_versions': {'xgboost': '1.0-1-cpu-py3', 'linear-learner': 'latest'}
}
# Path conventions of the output data storage path from the local AutoML job run of this notebook
LOCAL_AUTOML_JOB_NAME = 'airbnbauto-notebook-run-{}'.format(uid())
LOCAL_AUTOML_JOB_CONFIG = {
'local_automl_job_name': LOCAL_AUTOML_JOB_NAME,
'local_automl_job_output_s3_base_path': 's3://sagemaker-studio-924252573936-6upcy2kudh2/airbnbautopilot/{}'.format(LOCAL_AUTOML_JOB_NAME),
'data_processing_model_dir': 'data-processor-models',
'data_processing_transformed_output_dir': 'transformed-data',
'multi_algo_tuning_output_dir': 'multi-algo-tuning'
}
AUTOML_LOCAL_RUN_CONFIG = AutoMLLocalRunConfig(
role='arn:aws:iam::924252573936:role/service-role/AmazonSageMaker-ExecutionRole-20201011T111659',
base_automl_job_config=BASE_AUTOML_JOB_CONFIG,
local_automl_job_config=LOCAL_AUTOML_JOB_CONFIG,
security_config={'EnableInterContainerTrafficEncryption': False, 'VpcConfig': {}})
AUTOML_LOCAL_RUN_CONFIG.display()
The AutoMLLocalRunner
keeps track of selected candidates and automates many of the steps needed to execute feature engineering and tuning steps.
from sagemaker_automl import AutoMLInteractiveRunner, AutoMLLocalCandidate
automl_interactive_runner = AutoMLInteractiveRunner(AUTOML_LOCAL_RUN_CONFIG)
The SageMaker Autopilot Job has analyzed the dataset and has generated 10 machine learning pipeline(s) that use 2 algorithm(s). Each pipeline contains a set of feature transformers and an algorithm.
dpp0-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp0",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp1-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp1",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp2-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp2",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp3-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp3",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp4-linear-learner: This data transformation strategy first transforms 'numeric' features using combined RobustImputer and RobustMissingIndicator followed by QuantileExtremeValuesTransformer, 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustPCA followed by RobustStandardScaler. The transformed data will be used to tune a linear-learner model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp4",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": False
},
"algorithm": {
"name": "linear-learner",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp5-linear-learner: This data transformation strategy first transforms 'numeric' features using RobustImputer, 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a linear-learner model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp5",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "linear-learner",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp6-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp6",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp7-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp7",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp8-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer (converts missing values to nan), 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp8",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "application/x-recordio-protobuf",
"sparse_encoding": True
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
dpp9-xgboost: This data transformation strategy first transforms 'numeric' features using RobustImputer, 'categorical' features using ThresholdOneHotEncoder, 'text' features using MultiColumnTfidfVectorizer. It merges all the generated features and applies RobustPCA followed by RobustStandardScaler. The transformed data will be used to tune a xgboost model. Here is the definition:
automl_interactive_runner.select_candidate({
"data_transformer": {
"name": "dpp9",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
"volume_size_in_gb": 50
},
"transform_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
},
"transforms_label": False,
"transformed_data_format": "text/csv",
"sparse_encoding": False
},
"algorithm": {
"name": "xgboost",
"training_resource_config": {
"instance_type": "ml.m5.4xlarge",
"instance_count": 1,
}
}
})
You have selected the following candidates (please run the cell below and click on the feature transformer links for details):
automl_interactive_runner.display_candidates()
The feature engineering pipeline consists of two SageMaker jobs:
The transformers and its training pipeline are built using open sourced sagemaker-scikit-learn-container and sagemaker-scikit-learn-extension.
Each candidate pipeline consists of two steps, feature transformation and algorithm training. For efficiency first execute the feature transformation step which will generate a featurized dataset on S3 for each pipeline.
After each featurized dataset is prepared, execute a multi-algorithm tuning job that will run tuning jobs in parallel for each pipeline. This tuning job will execute training jobs to find the best set of hyper-parameters for each pipeline, as well as finding the overall best performing pipeline.
Now you are ready to start execution all data transformation steps. The cell below may take some time to finish,
feel free to go grab a cup of coffee. To expedite the process you can set the number of parallel_jobs
to be up to 10.
Please check the account limits to increase the limits before increasing the number of jobs to run in parallel.
automl_interactive_runner.fit_data_transformers(parallel_jobs=7)
Now that the algorithm compatible trasformed datasets are ready, you can start the multi-algorithm model tuning job to find the best predictive model. The following algorithm training job configuration for each algorithm is auto-generated by the AutoML Job as part of the recommendation.
The AutoML recommendation job has recommended the following hyperparameters, objectives and accuracy metrics for the algorithm and problem type:
ALGORITHM_OBJECTIVE_METRICS = {
'xgboost': 'validation:mse',
'linear-learner': 'validation:objective_loss',
}
STATIC_HYPERPARAMETERS = {
'xgboost': {
'objective': 'reg:squarederror',
'save_model_on_termination': 'true',
},
'linear-learner': {
'predictor_type': 'regressor',
'epochs': 50,
'loss': 'auto',
'mini_batch_size': 800,
},
}
The following tunable hyperparameters search ranges are recommended for the Multi-Algo tuning job:
from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter
ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES = {
'xgboost': {
'num_round': IntegerParameter(2, 1024, scaling_type='Logarithmic'),
'max_depth': IntegerParameter(2, 8, scaling_type='Logarithmic'),
'eta': ContinuousParameter(1e-3, 1.0, scaling_type='Logarithmic'),
'gamma': ContinuousParameter(1e-6, 64.0, scaling_type='Logarithmic'),
'min_child_weight': ContinuousParameter(1e-6, 32.0, scaling_type='Logarithmic'),
'subsample': ContinuousParameter(0.5, 1.0, scaling_type='Linear'),
'colsample_bytree': ContinuousParameter(0.3, 1.0, scaling_type='Linear'),
'lambda': ContinuousParameter(1e-6, 2.0, scaling_type='Logarithmic'),
'alpha': ContinuousParameter(1e-6, 2.0, scaling_type='Logarithmic'),
},
'linear-learner': {
'wd': ContinuousParameter(1e-7, 1.0, scaling_type='Logarithmic'),
'l1': ContinuousParameter(1e-7, 1.0, scaling_type='Logarithmic'),
'learning_rate': ContinuousParameter(1e-5, 1.0, scaling_type='Logarithmic'),
},
}
To use the multi-algorithm HPO tuner, prepare some inputs and parameters. Prepare a dictionary whose key is the name of the trained pipeline candidates and the values are respectively:
multi_algo_tuning_parameters = automl_interactive_runner.prepare_multi_algo_parameters(
objective_metrics=ALGORITHM_OBJECTIVE_METRICS,
static_hyperparameters=STATIC_HYPERPARAMETERS,
hyperparameters_search_ranges=ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES)
Below you prepare the inputs data to the multi-algo tuner:
multi_algo_tuning_inputs = automl_interactive_runner.prepare_multi_algo_inputs()
With the recommended Hyperparameter ranges and the transformed dataset, create a multi-algorithm model tuning job that coordinates hyper parameter optimizations across the different possible algorithms and feature processing strategies.
from sagemaker.tuner import HyperparameterTuner
base_tuning_job_name = "{}-tuning".format(AUTOML_LOCAL_RUN_CONFIG.local_automl_job_name)
tuner = HyperparameterTuner.create(
base_tuning_job_name=base_tuning_job_name,
strategy='Bayesian',
objective_type='Minimize',
max_parallel_jobs=7,
max_jobs=250,
**multi_algo_tuning_parameters,
)
Now you are ready to start running the Multi-Algo Tuning job. After the job is finished, store the tuning job name which you use to select models in the next section. The tuning process will take some time, please track the progress in the Amazon SageMaker Hyperparameter tuning jobs console.
from IPython.display import display, Markdown
# Run tuning
tuner.fit(inputs=multi_algo_tuning_inputs, include_cls_metadata=None)
tuning_job_name = tuner.latest_tuning_job.name
display(
Markdown(f"Tuning Job {tuning_job_name} started, please track the progress from [here](https://{AUTOML_LOCAL_RUN_CONFIG.region}.console.aws.amazon.com/sagemaker/home?region={AUTOML_LOCAL_RUN_CONFIG.region}#/hyper-tuning-jobs/{tuning_job_name})"))
# Wait for tuning job to finish
tuner.wait()
This section guides you through the model selection process. Afterward, you construct an inference pipeline on Amazon SageMaker to host the best candidate.
Because you executed the feature transformation and algorithm training in two separate steps, you now need to manually link each trained model with the feature transformer that it is associated with. When running a regular Amazon SageMaker Autopilot job, this will automatically be done for you.
The performance of each candidate pipeline can be viewed as a Pandas dataframe. For more interactive usage please refers to model tuning monitor.
from pprint import pprint
from sagemaker.analytics import HyperparameterTuningJobAnalytics
SAGEMAKER_SESSION = AUTOML_LOCAL_RUN_CONFIG.sagemaker_session
SAGEMAKER_ROLE = AUTOML_LOCAL_RUN_CONFIG.role
tuner_analytics = HyperparameterTuningJobAnalytics(
tuner.latest_tuning_job.name, sagemaker_session=SAGEMAKER_SESSION)
df_tuning_job_analytics = tuner_analytics.dataframe()
# Sort the tuning job analytics by the final metrics value
df_tuning_job_analytics.sort_values(
by=['FinalObjectiveValue'],
inplace=True,
ascending=False if tuner.objective_type == "Maximize" else True)
# Show detailed analytics for the top 20 models
df_tuning_job_analytics.head(20)
The best training job can be selected as below:
attached_tuner = HyperparameterTuner.attach(tuner.latest_tuning_job.name, sagemaker_session=SAGEMAKER_SESSION)
best_training_job = attached_tuner.best_training_job()
print("Best Multi Algorithm HPO training job name is {}".format(best_training_job))
Finally, deploy the best training job to Amazon SageMaker along with its companion feature engineering models. At the end of the section, you get an endpoint that's ready to serve online inference or start batch transform jobs!
Deploy a PipelineModel that has multiple containers of the following:
Get both best data transformation model and algorithm model from best training job and create an pipeline model:
from sagemaker.estimator import Estimator
from sagemaker import PipelineModel
from sagemaker_automl import select_inference_output
# Get a data transformation model from chosen candidate
best_candidate = automl_interactive_runner.choose_candidate(df_tuning_job_analytics, best_training_job)
best_data_transformer_model = best_candidate.get_data_transformer_model(role=SAGEMAKER_ROLE, sagemaker_session=SAGEMAKER_SESSION)
# Our first data transformation container will always return recordio-protobuf format
best_data_transformer_model.env["SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT"] = 'application/x-recordio-protobuf'
# Add environment variable for sparse encoding
if best_candidate.data_transformer_step.sparse_encoding:
best_data_transformer_model.env["AUTOML_SPARSE_ENCODE_RECORDIO_PROTOBUF"] = '1'
# Get a algo model from chosen training job of the candidate
algo_estimator = Estimator.attach(best_training_job)
best_algo_model = algo_estimator.create_model(env={'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT':"text/csv"})
# Final pipeline model is composed of data transformation models and algo model and an
# inverse label transform model if we need to transform the intermediates back to non-numerical value
model_containers = [best_data_transformer_model, best_algo_model]
if best_candidate.transforms_label:
model_containers.append(best_candidate.get_data_transformer_model(
transform_mode="inverse-label-transform",
role=SAGEMAKER_ROLE,
sagemaker_session=SAGEMAKER_SESSION))
pipeline_model = PipelineModel(
name="AutoML-{}".format(AUTOML_LOCAL_RUN_CONFIG.local_automl_job_name),
role=SAGEMAKER_ROLE,
models=model_containers,
vpc_config=AUTOML_LOCAL_RUN_CONFIG.vpc_config)
Finally, deploy the model to SageMaker to make it functional.
pipeline_model.deploy(initial_instance_count=1,
instance_type='ml.m5.2xlarge',
endpoint_name=pipeline_model.name,
wait=True)
Congratulations! Now you could visit the sagemaker endpoint console page to find the deployed endpoint (it'll take a few minutes to be in service).
endpoint_name
in the previous code block.