[tabular] AutoGluon Distributed #4606

LennartPurucker · 2024-11-02T21:14:17Z

Description of changes:

This PR starts with the implementation of distributed AutoGluon, which is based on our Kaggle Grand Prix code.

I am adding some documentation below. Feel free to ignore this for now, as it is mostly documentation for me.

Road Map

Distributed fit
Distributed refit
Distributed predict

Open Problems / Questions

How do we test distributed AutoGluon (in the CI)?
Do we plan to support users in initializing the ray clusters?
How do we support cloud and network file system setups?
How do we document the usage of this part of the code for expert users?
Given that this allows parallel model fitting even in a local setup, should we consider making this a default with enough CPUs? Likewise, should we automatically disable this mode if there are not enough CPUs available?

Minor Open Questions (can be ignored for a merge)

Can we avoid calling _add_model multiple times?
How do we select the number of CPUs per fit of a model?
Can we manage memory at all?
See other new todos in code (try to resolve most before merging)
Delaying the scheduling of training favors one-CPU models over multi-CPU models, so we might need to default to disabling use_child_oof for RF/XT.
Improve logging messages.

Local Testing Script:

from __future__ import annotations

import os
from pathlib import Path

import ray

# --- Parameters ---
"""The Ray address to connect to. Provide "auto" to start a new cluster."""
ray_address: str | None = None

"""If AutoGluon should be run in a distributed mode across several nodes."""
autogluon_distributed: bool = True

"""Whether to use a shared filesystem for the distributed AutoGluon setup.
This is necessary for some setups, e.g. SLURM. and I tested all of this with this set to True."""
autogluon_distributed_network_shared_filesystem: bool = True

# -- Setup
env_vars = {}

if autogluon_distributed:
    os.environ["AG_DISTRIBUTED_MODE"] = "True"
    env_vars["AG_DISTRIBUTED_MODE"] = "True"

if autogluon_distributed_network_shared_filesystem:
    os.environ["AG_DISTRIBUTED_FILESYSTEM"] = "NFS"
    env_vars["AG_DISTRIBUTED_FILESYSTEM"] = "NFS"

if ray_address is not None:
    os.environ["RAY_ADDRESS"] = ray_address
    env_vars["RAY_ADDRESS"] = ray_address

working_dir = Path(__file__).parent
# These settings are likely only correct for NFS setups.
runtime_env = {
    "working_dir": str(working_dir),
    "excludes": [
        "*",  # exclude everything
    ],
    "env_vars": env_vars,
}

ray.init(runtime_env=runtime_env, namespace="autogluon")


# --- Run AutoGluon ---
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset("./train.csv")
test_data = TabularDataset("./test.csv")

predictor = TabularPredictor(label="class", path=str(working_dir / "ag_path")).fit(
    train_data=train_data,
    time_limit=int(60 * 2),
    presets="best_quality",
    num_bag_folds=2,
    num_bag_sets=1,
    verbosity=2,
    dynamic_stacking=False,
)
predictions = predictor.leaderboard(test_data)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2024-11-05T04:40:37Z

Job PR-4606-9e7b9c3 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/9e7b9c3/index.html

github-actions · 2024-11-06T01:59:20Z

Job PR-4606-35bdd07 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/35bdd07/index.html

github-actions · 2024-11-08T00:21:59Z

Job PR-4606-2a1dc74 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/2a1dc74/index.html

github-actions · 2024-11-08T04:21:23Z

Job PR-4606-134e77c is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/134e77c/index.html

core/src/autogluon/core/models/ensemble/bagged_ensemble_model.py

github-actions · 2024-11-09T10:31:22Z

Job PR-4606-c46a88a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/c46a88a/index.html

github-actions · 2024-11-10T02:32:37Z

Job PR-4606-f38572c is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/f38572c/index.html

github-actions · 2024-11-10T09:32:42Z

Job PR-4606-c2dc07f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/c2dc07f/index.html

LennartPurucker · 2024-11-10T09:43:04Z

At this point, I am considering not adding distributed prediction to this PR.

The use case for distributed predicting is still vague to me (besides Kaggle competitions). Moreover, this would be a nice / cleaner new PR by chance (maybe after 1.2).

Innixma · 2024-11-20T01:55:40Z

Marking PR as ready for review. I have addressed many of the prior limitations / TODOs.

Benchmark results on TabRepo with m6i.16xlarge (64 CPU cores)

Parallel logic has zero failures across 1464 tasks.
Parallel logic has a 64% win-rate vs sequential on tasks with >=10000 rows with 4 hour time limit

Parallel logic produces identical* results to sequential if both are given infinite time, but parallel trains over 2x faster.

*With the exception of NeuralNetFastAI, which differs depending on how many CPU cores were used to train it, however the results are not better nor worse on average.

Elo Table on TabRepo datasets >=10000 samples (258 tasks):

Note: "pr4606" == parallel mode, "pr4606_seq" == sequential mode, aka mainline

Rank	Model	Elo	95% CI	Winrate	Rescaled Acc	Champ Delta %
1	AutoGluon_bq_pr4606_4h64c_2024_11_18	1208	+22/-20	0.77	0.88	7.5
2	AutoGluon_bq_pr4606_seq_4h64c_2024_11_18	1146	+19/-18	0.69	0.86	9.1
2	AutoGluon_bq_pr4606_1h64c_2024_11_18	1119	+18/-17	0.66	0.84	8
4	AutoGluon_bq_pr4606_seq_1h64c_2024_11_18	1102	+18/-17	0.64	0.84	9.8
5	AutoGluon_bq_24h8c_2023_11_14	1022	+16/-17	0.53	0.72	15
6	AutoGluon_bq_4h8c_2023_11_14	976	+17/-15	0.47	0.72	15.5
7	H2OAutoML_4h8c_2023_11_14	819	+20/-20	0.26	0.38	22.2
7	autosklearn2_4h8c_2023_11_14	815	+19/-20	0.26	0.43	21
7	autosklearn_4h8c_2023_11_14	793	+20/-22	0.23	0.33	22.1

Parallel mode for 4 hour runtime has an increase of 62 elo compared to sequential. 1 hour runtime has an increase of 17 elo compared to sequential.

Current limitations

Parallel does not support model refit. Sequential will be used for refit. Will try to add in a follow-up PR.
Parallel does not support GPU. Sequential will be used if GPUs are present/specified. Will try to add in a follow-up PR.
Parallel does not support hyperparameter tuning. Sequential will be used if HPO is enabled. Will not plan to add HPO support in v1.2.

Other notes

Will add logging message to recommend users to try parallel mode as a follow-up PR.

LennartPurucker

Minor comments, otherwise LGTM!

core/src/autogluon/core/ray/distributed_jobs_managers.py

Co-authored-by: Lennart Purucker <contact@lennart-purucker.com>

github-actions · 2024-11-21T01:48:44Z

Job PR-4606-6cb3e8d is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/6cb3e8d/index.html

Innixma

LGTM! Approving after extensive benchmarking and testing.

LennartPurucker · 2024-11-23T10:09:47Z

For documentation purposes: https://gist.github.com/LennartPurucker/61357dd53efadd0e72fbee7986e2b025

Innixma added enhancement New feature or request module: tabular priority: 0 Maximum priority labels Nov 4, 2024

Innixma added this to the 1.2 Release milestone Nov 4, 2024

Innixma mentioned this pull request Nov 5, 2024

[tabular] Parallel Model Training #4215

Closed

Innixma linked an issue Nov 5, 2024 that may be closed by this pull request

[tabular] Parallel Model Training #4215

Closed

LennartPurucker commented Nov 8, 2024

View reviewed changes

core/src/autogluon/core/models/ensemble/bagged_ensemble_model.py Show resolved Hide resolved

Innixma force-pushed the ag_dist_pr branch from 134e77c to c7e4baa Compare November 9, 2024 05:06

Innixma force-pushed the ag_dist_pr branch from 106fb65 to f38572c Compare November 9, 2024 23:39

Innixma force-pushed the ag_dist_pr branch from f38572c to c2dc07f Compare November 10, 2024 06:38

LennartPurucker and others added 12 commits November 13, 2024 03:44

add: start adding utility for distributed

9d67cf6

add: prepare distributed training for NFS

91debb2

add: logging robust for repeated ray calls with re-imports

600b668

add: metadata needed for distributed predict and refit

9d39368

todo: add distributed fitting

87a1d1a

fix: too strict resources control and improve logging

b14b4f0

fix: typo

5301ed9

add: distributed refit

35d9ec5

fix import statement

ba10c01

Logging cleanup

0411702

add FIXME comment

c2f4d8c

fix: make allocated model resources are respected by model

12467a6

Innixma and others added 10 commits November 14, 2024 01:32

minor update

b35adb1

major update: add memory aware parallel fits

54d3e72

Fix Python 3.9

5ac0a08

Improve memory management and logging

db92c13

Fix crash during init for KNN with NoValidFeatures

72200fa

Optimize model memory estimates runtime by 100x+

f47ffa3

Update

ee31de5

add: memory call for Ray and use better API for per-node resources

ca248ba

fix: use absolute path, which is required for distributed

871dd74

add: allow loose ray version env var

8cb38d8

Innixma force-pushed the ag_dist_pr branch from f73e4e3 to 8cb38d8 Compare November 19, 2024 00:55

Innixma added 4 commits November 19, 2024 03:39

[parallel] Use more CPUs per model if available

166beee

[parallel] Use more CPUs per model if available

9c819c2

[parallel] Improve logging, disable for GPU/refit/HPO

710c12c

Add docstrings for model memory estimate methods

ec7ef2f

Innixma changed the title ~~[WIP] [tabular] AutoGluon Distributed~~ [tabular] AutoGluon Distributed Nov 20, 2024

Innixma marked this pull request as ready for review November 20, 2024 01:40

LennartPurucker commented Nov 20, 2024

View reviewed changes

Innixma and others added 5 commits November 20, 2024 18:06

fix linting

53ccf6b

Update core/src/autogluon/core/ray/distributed_jobs_managers.py

6be38ec

Co-authored-by: Lennart Purucker <contact@lennart-purucker.com>

Update core/src/autogluon/core/ray/distributed_jobs_managers.py

7bf3bde

Co-authored-by: Lennart Purucker <contact@lennart-purucker.com>

minor cleanup

847f777

fix test

6cb3e8d

Innixma approved these changes Nov 21, 2024

View reviewed changes

Innixma merged commit 22971d7 into autogluon:master Nov 21, 2024
27 checks passed

suzhoum mentioned this pull request Dec 11, 2024

Fix setup_outputdir #4734

Merged

LennartPurucker mentioned this pull request Jul 30, 2025

[tabular] Add num_cpus, num_gpus to predictor.predict #4900

Open

[tabular] AutoGluon Distributed #4606

[tabular] AutoGluon Distributed #4606

Uh oh!

Conversation

LennartPurucker commented Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Road Map

Open Problems / Questions

Minor Open Questions (can be ignored for a merge)

Local Testing Script:

Uh oh!

github-actions bot commented Nov 5, 2024

Uh oh!

github-actions bot commented Nov 6, 2024

Uh oh!

github-actions bot commented Nov 8, 2024

Uh oh!

github-actions bot commented Nov 8, 2024

Uh oh!

Uh oh!

github-actions bot commented Nov 9, 2024

Uh oh!

github-actions bot commented Nov 10, 2024

Uh oh!

github-actions bot commented Nov 10, 2024

Uh oh!

LennartPurucker commented Nov 10, 2024

Uh oh!

Innixma commented Nov 20, 2024

Benchmark results on TabRepo with m6i.16xlarge (64 CPU cores)

Elo Table on TabRepo datasets >=10000 samples (258 tasks):

Current limitations

Other notes

Uh oh!

LennartPurucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2024

Uh oh!

Innixma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LennartPurucker commented Nov 23, 2024

Uh oh!

Uh oh!

LennartPurucker commented Nov 2, 2024 •

edited

Loading