-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
bugSomething isn't workingSomething isn't workingmodule: tabularpriority: 0Maximum priorityMaximum priority
Milestone
Description
Describe the bug
I attempted to train models using TabularPredictor on a regression task where the target is a continuous value.
When I set the validation_procedure of dynamic stacking to 'cv', an error occurred at the point where dynamic stacking was initiated.
To Reproduce
This is a sample code of the TabularPredictor that I set up and used:
predictor_params = {
"label": 'amount',
"eval_metric": "mean_absolute_error",
'verbosity': 4
}
fit_params = {
'train_data': train_data,
'hyperparameters': 'zeroshot',
'dynamic_stacking': True,
'ds_args': {
'detection_time_frac': 0.2,
'validation_procedure': 'cv', # Set to 'cv'
'n_folds': 2,
'n_repeats': 1
}
}
predictor = TabularPredictor(**predictor_params).fit(**fit_params)
Screenshots / Logs
Traceback (most recent call last):
predictor = TabularPredictor(**predictor_params).fit(**fit_params)
File "/***/***/lib/python3.10/site-packages/autogluon/core/utils/decorators.py", line 31, in _call
return f(*gargs, **gkwargs)
File "/***/***/lib/python3.10/site-packages/autogluon/tabular/predictor/predictor.py", line 1151, in fit
num_stack_levels, time_limit = self._dynamic_stacking(**ds_args, ag_fit_kwargs=ag_fit_kwargs, ag_post_fit_kwargs=ag_post_fit_kwargs)
File "/***/***/lib/python3.10/site-packages/autogluon/tabular/predictor/predictor.py", line 1255, in _dynamic_stacking
splits = CVSplitter(n_splits=n_folds, n_repeats=n_repeats, groups=self._learner.groups, stratified=is_stratified, random_state=42).split(
File "/***/***/lib/python3.10/site-packages/autogluon/core/utils/utils.py", line 80, in split
out = [[train_index, test_index] for train_index, test_index in self._splitter.split(X_dummy, y_dummy)]
File "/***/***/lib/python3.10/site-packages/autogluon/core/utils/utils.py", line 80, in <listcomp>
out = [[train_index, test_index] for train_index, test_index in self._splitter.split(X_dummy, y_dummy)]
File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 1600, in split
for train_index, test_index in cv.split(X, y, groups):
File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 416, in split
for train, test in super().split(X, y, groups):
File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 147, in split
for test_index in self._iter_test_masks(X, y, groups):
File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 809, in _iter_test_masks
test_folds = self._make_test_folds(X, y)
File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 752, in _make_test_folds
raise ValueError(
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
Installed Versions
python : 3.10.12
autogluon : 1.1.1
- In regression problems, the code is incorrectly setting is_stratified = True and passing stratified=True to CVSplitter.
In autogluon/tabular/predictor/predictor.py, line 1373~:
# -- Validation Method
if validation_procedure == "holdout":
if holdout_data is None:
ds_fit_kwargs.update(dict(holdout_frac=holdout_frac, ds_fit_context=os.path.join(ds_fit_context, "sub_fit_ho")))
else:
_, holdout_data, _, _ = self._validate_fit_data(train_data=X, tuning_data=holdout_data)
ds_fit_kwargs["ds_fit_context"] = os.path.join(ds_fit_context, "sub_fit_custom_ho")
stacked_overfitting = self._sub_fit_memory_save_wrapper(
train_data=X,
time_limit=time_limit,
time_start=time_start,
ds_fit_kwargs=ds_fit_kwargs,
ag_fit_kwargs=inner_ag_fit_kwargs,
ag_post_fit_kwargs=inner_ag_post_fit_kwargs,
holdout_data=holdout_data,
)
else:
# Holdout is false, use (repeated) cross-validation
is_stratified = self.problem_type in [REGRESSION, QUANTILE, SOFTCLASS]
self._learner._validate_groups(X=X, X_val=X_val) # Validate splits before splitting
splits = CVSplitter(n_splits=n_folds, n_repeats=n_repeats, groups=self._learner.groups, stratified=is_stratified, random_state=42).split(
X=X.drop(self.label, axis=1), y=X[self.label]
)
n_splits = len(splits)
- The _get_splitter_cls function is then selecting RepeatedStratifiedKFold based on the stratified=True parameter
In autogluon/core/src/autogluon/core/utils/utils.py, line 37~:
class CVSplitter:
def __init__(self, splitter_cls=None, n_splits=5, n_repeats=1, random_state=0, stratified=False, groups=None):
self.n_splits = n_splits
self.n_repeats = n_repeats
self.random_state = random_state
self.stratified = stratified
self.groups = groups
if splitter_cls is None:
splitter_cls = self._get_splitter_cls()
self._splitter = self._get_splitter(splitter_cls)
def _get_splitter_cls(self):
if self.groups is not None:
num_groups = len(self.groups.unique())
if self.n_repeats != 1:
raise AssertionError(f"n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})")
self.n_splits = num_groups
splitter_cls = LeaveOneGroupOut
# pass
elif self.stratified:
splitter_cls = RepeatedStratifiedKFold
else:
splitter_cls = RepeatedKFold
return splitter_cls
def _get_splitter(self, splitter_cls):
if splitter_cls == LeaveOneGroupOut:
return splitter_cls()
elif splitter_cls in [RepeatedKFold, RepeatedStratifiedKFold]:
return splitter_cls(n_splits=self.n_splits, n_repeats=self.n_repeats, random_state=self.random_state)
else:
raise AssertionError(f"{splitter_cls} is not supported as a valid `splitter_cls` input to CVSplitter.")
- RepeatedStratifiedKFold is designed to repeat StratifiedKFold, which is specifically intended for binary and multiclass classification tasks.
This selection is problematic because StratifiedKFold is not suitable for regression problems.
As noted in the scikit-learn documentation(scikit-learn/sklearn/model_selection/_split.py):
StratifiedKFold : Takes class information into account to avoid building folds with imbalanced class proportions (for binary or multiclass classification tasks).
In my opinion, to solve this problem, the code should be modified to use RepeatedKFold instead of RepeatedStratifiedKFold for regression problems. This would involve changing the logic that sets the parameters for regression tasks, setting is_stratified = False.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingmodule: tabularpriority: 0Maximum priorityMaximum priority
Type
Projects
Status
Done