-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Closed
Description
Sklearn Specification for Classifiers
The .predict()
method of xgboost.dask.DaskXGBClassifier
currently returns probabilities. Per the specification, the .predict()
method is supposed to return class labels. This is also inconsistent with the behavior of the .predict()
method of xgboost.XGBClassifier
, which properly returns class labels.
Other functionality in dask
(specifically in dask_ml.model_selection
) depend on the behavior being correct.
Example of correct behavior in xgboost.XGBClassifier
:
import xgboost as xgb
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_informative=5, n_classes=2, random_state=1234)
clf = xgb.XGBClassifier(objective="binary:logistic")
clf.fit(X, y)
print(clf.predict(X)[:5])
# [0 0 1 1 1]
Example of incorrect behavior in xgboost.dask.DaskXGBClassifier
:
import xgboost as xgb
import dask.dataframe as dd
import dask.distributed
from sklearn.datasets import make_classification
cluster = dask.distributed.LocalCluster(n_workers=2, threads_per_worker=1)
client = dask.distributed.Client(cluster)
X, y = make_classification(n_samples=1000, n_informative=5, n_classes=2, random_state=1234)
X_ = dd.from_array(X, chunksize=500)
y_ = dd.from_array(y, chunksize=500)
clf = xgb.dask.DaskXGBClassifier(objective="binary:logistic")
clf.fit(X_, y_)
print(clf.predict(X_).compute()[:5])
# [0.03111755 0.00773133 0.99876463 0.99792993 0.9944484 ]
Metadata
Metadata
Assignees
Labels
No labels