-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
I couldn't find the issue associated with this XGBoost forum topic, so I assume none was created. I can confirm this problem persists with the latest nightly of xgboost (a38e7bd19c461e0bed7bd96ec72d56132157d4af
) and scikit-learn (018c6dc57d21c89c7d1278c686c7d5d62f32ee48
).
I agree with Mike Creeth's statement in the previously mentioned forum post:
I believe this is because RFECV does some checking based on the tags that it gets from the estimator. It uses the tag ‘allow_nan’ to determine whether or not to check X for NaN values. It seems that currently XGBoost simply inherits the default “allow_nan” tag value from the scikit-learn estimator class, which is False. As XGB does in fact handle null values in X, I believe this behavior is incorrect.
from xgboost import XGBClassifier
from sklearn.feature_selection import RFECV
estimator = XGBClassifier()
selector = RFECV(estimator, cv=3)
selector = selector.fit(X, y)
with X
having one or more np.nan
values, raises the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-6d04ec0892c9> in <module>
18
19 selector = RFECV(model, cv=3)#, scoring=neg_rmse)
---> 20 selector = selector.fit(X_train.values, y_train.values)
/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups)
498 X, y, accept_sparse="csr", ensure_min_features=2,
499 force_all_finite=not tags.get('allow_nan', True),
--> 500 multi_output=True
501 )
502
/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, **check_params)
404 out = X
405 else:
--> 406 X, y = check_X_y(X, y, **check_params)
407 out = X, y
408
/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
724 ensure_min_samples=ensure_min_samples,
725 ensure_min_features=ensure_min_features,
--> 726 estimator=estimator)
727 if multi_output:
728 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
571 if force_all_finite:
572 _assert_all_finite(array,
--> 573 allow_nan=force_all_finite == 'allow-nan')
574
575 if ensure_min_samples > 0:
/local/burghbvander/miniconda3/envs/fastai/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
60 msg_err.format
61 (type_err,
---> 62 msg_dtype if msg_dtype is not None else X.dtype)
63 )
64 # for object dtype data, we only check for NaNs (GH-13254)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The allow_nan
tag should probably be set to True
in XGBClassifier
.