AutoAI incremental learning implementation details

Last updated: Jul 25, 2025

Review the technical details for training an AutoAI experiment on batches of data for a large data set.

Incremental learning for large tabular data sets

If you are training an experiment by using a large, structured data source, the data is subsampled, so initial training takes place with a portion of the data. The training data limit depends on the environment size that is selected for the experiment. For details, see AutoAI overview.

Incremental learning algorithms can be used to continue training by using the remaining data in a subsampled source, dividing the remaining data into batches, if needed. Each batch of training data is scored independently by using the optimized metric, so you can review the performance of each batch when you explore the results.

Incremental learning process overview

The process of incremental learning and model ensembling enables the training of a model with up to 100 GB of structured data. Depending on whether the experiment is a classification or regression model, AutoAI applies either the Snap Machine Learning BatchedTreeEnsembleClassifier or BatchedTreeEnsembleRegressor in combination with standard machine learning algorithms to support incremental training of model pipelines.

Configuring your experiment to support incremental learning adds two phases to the training process. The first applies a Batched tree ensemble algorithm to prepare the pipelines. The final stage trains the pipelines with the batches of data. Pipelines are scored and ranked based on how well the holdout data for the experiment performs given the optimized metric.

Algorithms for classification models that support incremental learning

ExtraTreesClassifier
XGBClassifier
LGBMClassifier
RandomForestClassifier
SnapRandomForestClassifier
SnapBoostingMachineClassifier

Algorithms for regression models that support incremental learning

ExtraTreesRegressor
LGBMRegressor
RandomForestRegressor
SnapBoostingMachineRegressor
SnapRandomForestRegressor
XGBRegressor

For descriptions of the algorithms, see AutoAI implementation details.

Creating the batched ensemble

When you configure your experiment, you can opt to train with incremental learning if your data set is large. The list of algorithms includes the appropriate ensemble estimator for the model type.

This illustration maps the incremental learning process. In a standard experiment, each algorithm trains four pipelines. When incremental learning is enabled, training the model creates a fifth pipeline for the batched ensemble for each algorithm that supports incremental learning. The batched ensemble pipeline has partial_fit capabilities, which means it can be continuously trained in batches of data.

AutoAI incremental learning process

The incremental learning sequence

This figure shows the flow of data in the AutoAI experiment from ingestion to training in batches, as follows:

When a large data set is used for an experiment, AutoAI automatically samples the data. The default is random sampling, but you can configure sampling to use first values (reading the data from the first value to the cut-off point), or stratified sampling instead. See Using incremental learning to train with a large data set for configuration details.
When the experiment training begins, AutoAI uses the sampled data to train the model candidate pipelines, and creates the BatchedTreeEnsemble pipeline for each supporting algorithm.
The final training step trains the pipelines that use the batches of data.

Process flow for training batches of data using incremental learning

For more of the technical details on incremental learning, see the blog post Large tabular data and AutoAI

Saving the experiment as a notebook

You can save any pipeline as an auto-generated notebook so you can review the code for training the experiment. You can then run the notebook as is to re-create the experiment. The batch training step uses ExperimentIterableDataset, which is a torch-compatible data loader. You can customize the data loader for the batch training to include:

- custom data loader (must return batches of data in form of Pandas DataFrames)
- custom scorer function (metrics)
- learning stop constraints

Best practices and troubleshooting

Review these items for best practices and troubleshooting tips for training an experiment with incremental learning.

Reviewing pipeline scores

When you are reviewing pipeline scores in the leaderboard, you might find that pipelines with more transformations applied do not have scores that are better than a pipeline without the transformations. This scenario can happen because the Feature Engineering phase for training the experiment finds the best transformation on a sample of the training data, for performance reasons. Because of that it might happen that a newly generated feature does not significantly improve the pipeline score that is trained on the full data set. That result is not faulty or unexpected.

If you upload a holdout data file for the incremental learning notebook, the calculation for the number of ensembles that are required to process the holdout data might be incorrect, as it is calculated based on file size rather than number of rows. To be sure the number of ensembles is correct for your holdout data, manually update the code that establishes the number of ensembles, that use this command:

batches_count = total_number_of_rows//number_of_batch_rows
pipeline_model.steps[-1][1].impl.max_sub_ensembles = batches_count

You can then run the notebook to complete the experiment training.

Working with incremental learning in an auto-generated notebook

If you save a pipeline as a notebook, you can review the code that is used to train each pipeline. You can also make modifications to customize the training, such as manually correcting for imbalanced data and outliers.

Workaround for imbalanced data and outliers

If the data passed to partial_fit is highly imbalanced (for example, >1:10), consider applying the sample_weight parameter to correct for the imbalance:

from sklearn.utils.class_weight import compute_sample_weight

pipeline_model.partial_fit(X_train, y_train, freeze_trained_prefix=True, sample_weight=compute_sample_weight('balanced', y_train))

If your data contains outliers and the learning curve shows suspicious drops, follow these steps to compensate.

Filter the outliers out before calling partial_fit.

batch_df = batch_df[batch_df[experiment_metadata['target_column']] < threshold]

Increase the outer_lr_scaling parameter in the BatchedTreeEnsemble. This technique reduces the learning rate applied across batches, and slows the learning down, making it more robust to outliers. Note that this could also degrade the overall performance.
```
pipeline_model.steps[-1][1].impl.outer_lr_scaling=0.65
```
Increase the batch size, if possible (the max limit is 100,000). This change makes the learning algorithms more robust to outliers.

Learn more

For more information on training AutoAI experiments with large data, read the blog article Large tabular data & AutoAI

Parent topic: Incremental learning for AutoAI experiments

Was the topic helpful?

0/1000