Skip to content

NaN values and Scikit-Learn RFECV #5401

@benjaminvdb

Description

@benjaminvdb

I couldn't find the issue associated with this XGBoost forum topic, so I assume none was created. I can confirm this problem persists with the latest nightly of xgboost (a38e7bd19c461e0bed7bd96ec72d56132157d4af) and scikit-learn (018c6dc57d21c89c7d1278c686c7d5d62f32ee48).

I agree with Mike Creeth's statement in the previously mentioned forum post:

I believe this is because RFECV does some checking based on the tags that it gets from the estimator. It uses the tag ‘allow_nan’ to determine whether or not to check X for NaN values. It seems that currently XGBoost simply inherits the default “allow_nan” tag value from the scikit-learn estimator class, which is False. As XGB does in fact handle null values in X, I believe this behavior is incorrect.

from xgboost import XGBClassifier
from sklearn.feature_selection import RFECV

estimator = XGBClassifier()
selector = RFECV(estimator, cv=3)
selector = selector.fit(X, y)

with X having one or more np.nan values, raises the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-6d04ec0892c9> in <module>
     18 
     19 selector = RFECV(model, cv=3)#, scoring=neg_rmse)
---> 20 selector = selector.fit(X_train.values, y_train.values)

/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
    498             X, y, accept_sparse="csr", ensure_min_features=2,
    499             force_all_finite=not tags.get('allow_nan', True),
--> 500             multi_output=True
    501         )
    502 

/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, **check_params)
    404             out = X
    405         else:
--> 406             X, y = check_X_y(X, y, **check_params)
    407             out = X, y
    408 

/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    724                     ensure_min_samples=ensure_min_samples,
    725                     ensure_min_features=ensure_min_features,
--> 726                     estimator=estimator)
    727     if multi_output:
    728         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    571         if force_all_finite:
    572             _assert_all_finite(array,
--> 573                                allow_nan=force_all_finite == 'allow-nan')
    574 
    575     if ensure_min_samples > 0:

/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     60                     msg_err.format
     61                     (type_err,
---> 62                      msg_dtype if msg_dtype is not None else X.dtype)
     63             )
     64     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The allow_nan tag should probably be set to True in XGBClassifier.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions