-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
pycaret version checks
-
I have checked that this issue has not already been reported here.
-
I have confirmed this bug exists on the latest version of pycaret.
-
I have confirmed this bug exists on the master branch of pycaret (pip install -U git+https://github.com/pycaret/pycaret.git@master).
Issue Description
Dataset should not be shuffled at any moment when doing a 'timeseries' fold strategy. This default behavior of shuffling the dataset before doing the train/test split will lead to misleading results because of data leakage. The 'timeseries' folding strategy relies on the data being ordered by date beforehand to avoid leakage of the data in the holdout/future dataset into the train/past dataset. The current behavior can be very misleading to users that may expect their evaluation strategy to be correct when they see the fold_strategy='timeseries', and don't know that they also need to set the data_split_shuffle parameter to False.
Reproducible Example
import pandas as pd
from pycaret.classification import ClassificationExperiment
df = pd.DataFrame({'feature': [1, 2, 3, 4, 1, 2, 3, 4], 'y': [1, 2, 3, 4, 1, 2, 3, 4]})
experiment = ClassificationExperiment()
experiment.setup(df, target="y", fold_strategy="timeseries")
model = experiment.compare_models()
Expected Behavior
Data should not be shuffled during the train/test split when fold_strategy is 'timeseries'.
Actual Results
Data was shuffled during the train/test split when fold_strategy was'timeseries', causing a data leakage.
Installed Versions
3.0.4