Skip to content

[BUG]: Data leakage in pycaret 3 classification with unbalanced dataset? #3507

@sorenwacker

Description

@sorenwacker

pycaret version checks

Issue Description

I am getting a test AUC of 1 with tree based models with random data.

models tested rt, xgboost, lightgbm, et, dt

Reproducible Example

from pycaret.classification import *

# Import libraries
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Define number of samples
N = 30000

# Generate features
numeric_feature = np.random.normal(0, 1, N)
cat_feature = np.random.choice(31, N).astype(int)
cat_feature2 = np.random.choice(2, N).astype(str)

# Combine features into a single matrix
X = np.column_stack((cat_feature, cat_feature2))

# Generate target variable
y = np.random.binomial(1, 0.2, N)

# Convert to DataFrame and add target variable
df = pd.DataFrame(X, columns=['categorical-cardinal', 'categorical-binary'])
df['target'] = y

# Print the first five rows of the dataset
print("Synthetic Dataset for Machine Learning:\n")
print(df.head())

setup(df, target='target', numeric_features=[], categorical_features=['categorical-cardinal', 'categorical-binary'])

model = create_model('lightgbm')

plot_model(model, 'auc')

predict_model(model)

Expected Behavior

The test auc should be around 0.5, as in pycaret version 2.

image

Actual Results

image

No error message, but unrealistic results.

Installed Versions

System: python: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] executable: /bulk/LSARP/envs/conda/pycaret3/bin/python machine: Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.31

PyCaret required dependencies:
pip: 23.0.1
setuptools: 66.0.0
pycaret: 2.1.post14082020
IPython: 7.34.0
ipywidgets: 7.7.5
tqdm: 4.64.1
numpy: 1.23.5
pandas: 1.5.3
jinja2: 3.1.2
scipy: 1.9.3
joblib: 1.2.0
sklearn: 1.2.2
pyod: 1.0.9
imblearn: 0.10.1
category_encoders: 2.6.0
lightgbm: 3.3.5
numba: 0.56.4
requests: 2.28.2
matplotlib: 3.6.3
scikitplot: 0.3.7
yellowbrick: 1.5
plotly: 5.14.1
kaleido: 0.2.1
statsmodels: 0.13.5
sktime: 0.17.0
tbats: 1.1.3
pmdarima: 2.0.3
psutil: 5.9.5

PyCaret optional dependencies:
shap: 0.41.0
interpret: 0.3.2
umap: 0.5.3
pandas_profiling: 3.6.6
explainerdashboard: 0.4.2.1
autoviz: 0.1.601
fairlearn: 0.7.0
xgboost: 1.7.5
catboost: 1.1.1
kmodes: 0.12.2
mlxtend: 0.22.0
statsforecast: 1.5.0
tune_sklearn: 0.4.5
ray: 2.3.1
hyperopt: 0.2.7
optuna: 3.1.1
skopt: 0.9.0
mlflow: 1.30.1
gradio: 3.27.0
fastapi: 0.95.1
uvicorn: 0.21.1
m2cgen: 0.10.0
evidently: 0.3.0
fugue: 0.8.3
streamlit: Not installed
prophet: Not installed

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingclassificationTopics related to the classification

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions