Skip to content

Conversation

LennartPurucker
Copy link
Collaborator

@LennartPurucker LennartPurucker commented Nov 2, 2024

Description of changes:

This PR starts with the implementation of distributed AutoGluon, which is based on our Kaggle Grand Prix code.

I am adding some documentation below. Feel free to ignore this for now, as it is mostly documentation for me.

Road Map

  • Distributed fit
  • Distributed refit
  • Distributed predict

Open Problems / Questions

  • How do we test distributed AutoGluon (in the CI)?
  • Do we plan to support users in initializing the ray clusters?
  • How do we support cloud and network file system setups?
  • How do we document the usage of this part of the code for expert users?
  • Given that this allows parallel model fitting even in a local setup, should we consider making this a default with enough CPUs? Likewise, should we automatically disable this mode if there are not enough CPUs available?

Minor Open Questions (can be ignored for a merge)

  • Can we avoid calling _add_model multiple times?
  • How do we select the number of CPUs per fit of a model?
  • Can we manage memory at all?
  • See other new todos in code (try to resolve most before merging)
  • Delaying the scheduling of training favors one-CPU models over multi-CPU models, so we might need to default to disabling use_child_oof for RF/XT.
  • Improve logging messages.

Local Testing Script:

from __future__ import annotations

import os
from pathlib import Path

import ray

# --- Parameters ---
"""The Ray address to connect to. Provide "auto" to start a new cluster."""
ray_address: str | None = None

"""If AutoGluon should be run in a distributed mode across several nodes."""
autogluon_distributed: bool = True

"""Whether to use a shared filesystem for the distributed AutoGluon setup.
This is necessary for some setups, e.g. SLURM. and I tested all of this with this set to True."""
autogluon_distributed_network_shared_filesystem: bool = True

# -- Setup
env_vars = {}

if autogluon_distributed:
    os.environ["AG_DISTRIBUTED_MODE"] = "True"
    env_vars["AG_DISTRIBUTED_MODE"] = "True"

if autogluon_distributed_network_shared_filesystem:
    os.environ["AG_DISTRIBUTED_FILESYSTEM"] = "NFS"
    env_vars["AG_DISTRIBUTED_FILESYSTEM"] = "NFS"

if ray_address is not None:
    os.environ["RAY_ADDRESS"] = ray_address
    env_vars["RAY_ADDRESS"] = ray_address

working_dir = Path(__file__).parent
# These settings are likely only correct for NFS setups.
runtime_env = {
    "working_dir": str(working_dir),
    "excludes": [
        "*",  # exclude everything
    ],
    "env_vars": env_vars,
}

ray.init(runtime_env=runtime_env, namespace="autogluon")


# --- Run AutoGluon ---
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset("./train.csv")
test_data = TabularDataset("./test.csv")

predictor = TabularPredictor(label="class", path=str(working_dir / "ag_path")).fit(
    train_data=train_data,
    time_limit=int(60 * 2),
    presets="best_quality",
    num_bag_folds=2,
    num_bag_sets=1,
    verbosity=2,
    dynamic_stacking=False,
)
predictions = predictor.leaderboard(test_data)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma added enhancement New feature or request module: tabular priority: 0 Maximum priority labels Nov 4, 2024
@Innixma Innixma added this to the 1.2 Release milestone Nov 4, 2024
Copy link

github-actions bot commented Nov 5, 2024

Job PR-4606-9e7b9c3 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/9e7b9c3/index.html

@Innixma Innixma linked an issue Nov 5, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Nov 6, 2024

Job PR-4606-35bdd07 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/35bdd07/index.html

Copy link

github-actions bot commented Nov 8, 2024

Job PR-4606-2a1dc74 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/2a1dc74/index.html

Copy link

github-actions bot commented Nov 8, 2024

Job PR-4606-134e77c is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/134e77c/index.html

Copy link

github-actions bot commented Nov 9, 2024

Job PR-4606-c46a88a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/c46a88a/index.html

Copy link

Job PR-4606-f38572c is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/f38572c/index.html

Copy link

Job PR-4606-c2dc07f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/c2dc07f/index.html

@LennartPurucker
Copy link
Collaborator Author

At this point, I am considering not adding distributed prediction to this PR.

The use case for distributed predicting is still vague to me (besides Kaggle competitions). Moreover, this would be a nice / cleaner new PR by chance (maybe after 1.2).

@Innixma Innixma changed the title [WIP] [tabular] AutoGluon Distributed [tabular] AutoGluon Distributed Nov 20, 2024
@Innixma Innixma marked this pull request as ready for review November 20, 2024 01:40
@Innixma
Copy link
Contributor

Innixma commented Nov 20, 2024

Marking PR as ready for review. I have addressed many of the prior limitations / TODOs.

Benchmark results on TabRepo with m6i.16xlarge (64 CPU cores)

Parallel logic has zero failures across 1464 tasks.
Parallel logic has a 64% win-rate vs sequential on tasks with >=10000 rows with 4 hour time limit

Parallel logic produces identical* results to sequential if both are given infinite time, but parallel trains over 2x faster.

*With the exception of NeuralNetFastAI, which differs depending on how many CPU cores were used to train it, however the results are not better nor worse on average.

Elo Table on TabRepo datasets >=10000 samples (258 tasks):

Note: "pr4606" == parallel mode, "pr4606_seq" == sequential mode, aka mainline

Rank Model Elo 95% CI Winrate Rescaled Acc Champ Delta %
1 AutoGluon_bq_pr4606_4h64c_2024_11_18 1208 +22/-20 0.77 0.88 7.5
2 AutoGluon_bq_pr4606_seq_4h64c_2024_11_18 1146 +19/-18 0.69 0.86 9.1
2 AutoGluon_bq_pr4606_1h64c_2024_11_18 1119 +18/-17 0.66 0.84 8
4 AutoGluon_bq_pr4606_seq_1h64c_2024_11_18 1102 +18/-17 0.64 0.84 9.8
5 AutoGluon_bq_24h8c_2023_11_14 1022 +16/-17 0.53 0.72 15
6 AutoGluon_bq_4h8c_2023_11_14 976 +17/-15 0.47 0.72 15.5
7 H2OAutoML_4h8c_2023_11_14 819 +20/-20 0.26 0.38 22.2
7 autosklearn2_4h8c_2023_11_14 815 +19/-20 0.26 0.43 21
7 autosklearn_4h8c_2023_11_14 793 +20/-22 0.23 0.33 22.1

Parallel mode for 4 hour runtime has an increase of 62 elo compared to sequential. 1 hour runtime has an increase of 17 elo compared to sequential.

Current limitations

  1. Parallel does not support model refit. Sequential will be used for refit. Will try to add in a follow-up PR.
  2. Parallel does not support GPU. Sequential will be used if GPUs are present/specified. Will try to add in a follow-up PR.
  3. Parallel does not support hyperparameter tuning. Sequential will be used if HPO is enabled. Will not plan to add HPO support in v1.2.

Other notes

  1. Will add logging message to recommend users to try parallel mode as a follow-up PR.

Copy link
Collaborator Author

@LennartPurucker LennartPurucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise LGTM!

Innixma and others added 5 commits November 20, 2024 18:06
Co-authored-by: Lennart Purucker <contact@lennart-purucker.com>
Co-authored-by: Lennart Purucker <contact@lennart-purucker.com>
Copy link

Job PR-4606-6cb3e8d is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-4606/6cb3e8d/index.html

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Approving after extensive benchmarking and testing.

@Innixma Innixma merged commit 22971d7 into autogluon:master Nov 21, 2024
27 checks passed
@LennartPurucker
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: tabular priority: 0 Maximum priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tabular] Parallel Model Training
2 participants