Skip to content

Categorical data with too many levels stops setup() without throwing an error #1352

@mandar-karhade

Description

@mandar-karhade

Describe the bug
If training data and test data are explicitly specified and target column does not exist in the test_data then
setup(data=data, test_data=test_data, silent=True, target=target) will throw error only at lower dimensionality.. at higher dimensionality it will just be stuck forever.

Upon further exploration its a single feature v22 that caused the issue. The distribution of this categorical feature is as following.
pd.DataFrame(train.v22.value_counts()).shape (18210, 1)
Upon dropping this feature the setup() completed in 15 sec.

To Reproduce

#!/usr/bin/env python
# coding: utf-8

# In[1]:
from pathlib import PureWindowsPath, Path
from pycaret.classification import *
from pycaret import *

import pandas as pd
Path().resolve().parent


# In[3]:
# source data https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data

# load X_train, X_test, y_train, y_test
train = pd.read_csv(Path('../../paribas/train.csv'))
test = pd.read_csv(Path('../../paribas/test.csv'))


# In[ ]:
# Throws no error but keeps waiting for ages 
model = setup(data=train, target='target', ignore_features = ['ID'], session_id = 123, log_experiment = False, experiment_name = 'test1', silent=True)


# In[4]:
# Throws error appropriately 

train = train.iloc[:,0:10]
test = test.iloc[:,0:10]
model = setup(data=train, target='target', ignore_features = ['ID'], session_id = 123, log_experiment = False, experiment_name = 'test1', silent=True)

Expected behavior
Train and test data when explicitly defined should assert the similar structure and must have target column in both frames

Versions
'2.3.1'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions