Skip to content

[BUG]: Default value for data_split_shuffle should be False when fold_strategy is 'timeseries' #3631

@admo1

Description

@admo1

pycaret version checks

Issue Description

Dataset should not be shuffled at any moment when doing a 'timeseries' fold strategy. This default behavior of shuffling the dataset before doing the train/test split will lead to misleading results because of data leakage. The 'timeseries' folding strategy relies on the data being ordered by date beforehand to avoid leakage of the data in the holdout/future dataset into the train/past dataset. The current behavior can be very misleading to users that may expect their evaluation strategy to be correct when they see the fold_strategy='timeseries', and don't know that they also need to set the data_split_shuffle parameter to False.

Reproducible Example

import pandas as pd

from pycaret.classification import ClassificationExperiment

df = pd.DataFrame({'feature': [1, 2, 3, 4, 1, 2, 3, 4], 'y': [1, 2, 3, 4, 1, 2, 3, 4]})

experiment = ClassificationExperiment()
experiment.setup(df, target="y", fold_strategy="timeseries")

model = experiment.compare_models()

Expected Behavior

Data should not be shuffled during the train/test split when fold_strategy is 'timeseries'.

Actual Results

Data was shuffled during the train/test split when fold_strategy was'timeseries', causing a data leakage.

Installed Versions

3.0.4

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingclassificationTopics related to the classificationregressionTopics related to the regression

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions