[tabular][BUG] When the validation procedure of dynamic stacking is set to 'cv', the CVSplitter is incorrectly assigned for regression tasks.

**Describe the bug**
I attempted to train models using TabularPredictor on a regression task where the target is a continuous value. 
When I set the validation_procedure of dynamic stacking to 'cv', an error occurred at the point where dynamic stacking was initiated.

**To Reproduce**
This is a sample code of the TabularPredictor that I set up and used:
```
predictor_params = {
    "label": 'amount',
    "eval_metric": "mean_absolute_error",
    'verbosity': 4
}

fit_params = {
    'train_data': train_data,
    'hyperparameters': 'zeroshot',
    'dynamic_stacking': True,
    'ds_args': {
        'detection_time_frac': 0.2,
        'validation_procedure': 'cv',  # Set to 'cv'
        'n_folds': 2,
        'n_repeats': 1
    }
}
predictor = TabularPredictor(**predictor_params).fit(**fit_params)
```

**Screenshots / Logs**
```
Traceback (most recent call last):
    predictor = TabularPredictor(**predictor_params).fit(**fit_params)
  File "/***/***/lib/python3.10/site-packages/autogluon/core/utils/decorators.py", line 31, in _call
    return f(*gargs, **gkwargs)
  File "/***/***/lib/python3.10/site-packages/autogluon/tabular/predictor/predictor.py", line 1151, in fit
    num_stack_levels, time_limit = self._dynamic_stacking(**ds_args, ag_fit_kwargs=ag_fit_kwargs, ag_post_fit_kwargs=ag_post_fit_kwargs)
  File "/***/***/lib/python3.10/site-packages/autogluon/tabular/predictor/predictor.py", line 1255, in _dynamic_stacking
    splits = CVSplitter(n_splits=n_folds, n_repeats=n_repeats, groups=self._learner.groups, stratified=is_stratified, random_state=42).split(
  File "/***/***/lib/python3.10/site-packages/autogluon/core/utils/utils.py", line 80, in split
    out = [[train_index, test_index] for train_index, test_index in self._splitter.split(X_dummy, y_dummy)]
  File "/***/***/lib/python3.10/site-packages/autogluon/core/utils/utils.py", line 80, in <listcomp>
    out = [[train_index, test_index] for train_index, test_index in self._splitter.split(X_dummy, y_dummy)]
  File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 1600, in split
    for train_index, test_index in cv.split(X, y, groups):
  File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 416, in split
    for train, test in super().split(X, y, groups):
  File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 147, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 809, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/***/***/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 752, in _make_test_folds
    raise ValueError(
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
```

**Installed Versions**

python : 3.10.12
autogluon : 1.1.1

</details>
It is suspected that the error occurs because CVSplitter selects RepeatedStratifiedKFold instead of RepeatedKFold, even though the problem type is regression.


1. In regression problems, the code is incorrectly setting is_stratified = True and passing stratified=True to CVSplitter.
In autogluon/tabular/predictor/predictor.py, line 1373~:
```
# -- Validation Method
        if validation_procedure == "holdout":
            if holdout_data is None:
                ds_fit_kwargs.update(dict(holdout_frac=holdout_frac, ds_fit_context=os.path.join(ds_fit_context, "sub_fit_ho")))
            else:
                _, holdout_data, _, _ = self._validate_fit_data(train_data=X, tuning_data=holdout_data)
                ds_fit_kwargs["ds_fit_context"] = os.path.join(ds_fit_context, "sub_fit_custom_ho")

            stacked_overfitting = self._sub_fit_memory_save_wrapper(
                train_data=X,
                time_limit=time_limit,
                time_start=time_start,
                ds_fit_kwargs=ds_fit_kwargs,
                ag_fit_kwargs=inner_ag_fit_kwargs,
                ag_post_fit_kwargs=inner_ag_post_fit_kwargs,
                holdout_data=holdout_data,
            )
        else:
            # Holdout is false, use (repeated) cross-validation
            is_stratified = self.problem_type in [REGRESSION, QUANTILE, SOFTCLASS]
            self._learner._validate_groups(X=X, X_val=X_val)  # Validate splits before splitting
            splits = CVSplitter(n_splits=n_folds, n_repeats=n_repeats, groups=self._learner.groups, stratified=is_stratified, random_state=42).split(
                X=X.drop(self.label, axis=1), y=X[self.label]
            )
            n_splits = len(splits)
```
2. The _get_splitter_cls function is then selecting RepeatedStratifiedKFold based on the stratified=True parameter
In autogluon/core/src/autogluon/core/utils/utils.py, line 37~:
```
class CVSplitter:
    def __init__(self, splitter_cls=None, n_splits=5, n_repeats=1, random_state=0, stratified=False, groups=None):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state
        self.stratified = stratified
        self.groups = groups
        if splitter_cls is None:
            splitter_cls = self._get_splitter_cls()
        self._splitter = self._get_splitter(splitter_cls)

    def _get_splitter_cls(self):
        if self.groups is not None:
            num_groups = len(self.groups.unique())
            if self.n_repeats != 1:
                raise AssertionError(f"n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})")
            self.n_splits = num_groups
            splitter_cls = LeaveOneGroupOut
            # pass
        elif self.stratified:
            splitter_cls = RepeatedStratifiedKFold
        else:
            splitter_cls = RepeatedKFold
        return splitter_cls

    def _get_splitter(self, splitter_cls):
        if splitter_cls == LeaveOneGroupOut:
            return splitter_cls()
        elif splitter_cls in [RepeatedKFold, RepeatedStratifiedKFold]:
            return splitter_cls(n_splits=self.n_splits, n_repeats=self.n_repeats, random_state=self.random_state)
        else:
            raise AssertionError(f"{splitter_cls} is not supported as a valid `splitter_cls` input to CVSplitter.")

```
3. RepeatedStratifiedKFold is designed to repeat StratifiedKFold, which is specifically intended for binary and multiclass classification tasks.
This selection is problematic because StratifiedKFold is not suitable for regression problems. 
As noted in the scikit-learn documentation(scikit-learn/sklearn/model_selection/_split.py):
StratifiedKFold : Takes class information into account to avoid building folds with imbalanced class proportions (for binary or multiclass classification tasks).

In my opinion, to solve this problem, the code should be modified to use RepeatedKFold instead of RepeatedStratifiedKFold for regression problems. This would involve changing the logic that sets the parameters for regression tasks, setting is_stratified = False.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tabular][BUG] When the validation procedure of dynamic stacking is set to 'cv', the CVSplitter is incorrectly assigned for regression tasks. #4771

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[tabular][BUG] When the validation procedure of dynamic stacking is set to 'cv', the CVSplitter is incorrectly assigned for regression tasks. #4771

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions