Skip to content

Training fails when using external memory version for large datasets with instance weights #5866

@prvnsmpth

Description

@prvnsmpth

I am attempting to train an XGBoost model using a large dataset that I cannot completely load in memory. So I decided to use the external memory feature of XGBoost training, like so:

dtrain = xgb.DMatrix(f"data/train.libsvm#train.cache", feature_names=feature_names)

Now I also need to be able to specify instance weights, so I tried to do that by specifying the weights in a separate train.libsvm.weight file:

$ head -5 data/train.libsvm.weight
5.28486226776928e-7
5.28486226776928e-7
5.28486226776928e-7
5.28486226776928e-7
5.28486226776928e-7

However, training fails with the following error:

[22:15:43] 11252555x174 matrix with 44859377 entries loaded from data/train.libsvm#train.cache
[22:15:46] 11252555 weights are loaded from data/train.libsvm.weight
Traceback (most recent call last):
  File "train.py", line 104, in <module>
    train(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
  File "train.py", line 66, in train
    model = xgb.train(params, dtrain, num_rounds, watchlist)
  File "/home/praveen/auto-test-web/auto-test-web/src/ml/venv/lib/python3.8/site-packages/xgboost/training.py", line 208, in train
    return _train_internal(params, dtrain,
  File "/home/praveen/auto-test-web/auto-test-web/src/ml/venv/lib/python3.8/site-packages/xgboost/training.py", line 75, in _train_internal
    bst.update(dtrain, i, obj)
  File "/home/praveen/auto-test-web/auto-test-web/src/ml/venv/lib/python3.8/site-packages/xgboost/core.py", line 1367, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
  File "/home/praveen/auto-test-web/auto-test-web/src/ml/venv/lib/python3.8/site-packages/xgboost/core.py", line 190, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [22:15:46] /workspace/src/tree/updater_gpu_hist.cu:952: Exception in gpu_hist: [22:15:46] /workspace/src/common/hist_util.cu:287: Check failed: weights.size() == page.offset.Size() - 1 (11252555 vs. 921785

So from the error message, it appears the weights file is recognized and weights for all 1125255 instances are loaded. However, since we are training in batches (using the external memory feature), we only load 921785 instances and the requirement that the size of the weights vector should equal that of the training dataset doesn't hold.

I have also tried specifying weights in the LibSVM input file directly - by replacing the label entry with label:weight, but I get the exact same error.

I'm using XGBoost version 1.1.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions