-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
CatBoost Version: 1.2.2
Environment: Corporate Jupyter Lab Server
Problem Description
When creating a Pool
from a pandas DataFrame
, memory usage doubles (one copy for the DataFrame
, another for the Pool
). To mitigate this, we implemented the following workflow:
- Create Pool
- Save Pool
- Restart kernel to clear memory
- Reload Pool
- Train model
Issue: Training consistently fails at iteration 998/1000 when using a quantized Pool containing a large categorical feature ("registration address").
Key Details About the Feature
Name: registration address
Type: String-based categorical
Unique values: ~1.15 million unique values (80% uniqueness in 1,443,378 samples)
Reproduction Code
# Step 1: Create and quantize Pool
train_pool = Pool(
data=train_df[all_features],
label=train_df[TARGET],
cat_features=cat_feats
)
train_pool.quantize()
train_pool.save('volumes/my_work/Tarasov/Model/Pool.bin')
# After kernel restart
# Step 4: Reload Pool
train_pool = Pool(
'quantized://volumes/my_work/Tarasov/Model/Pool.bin'
)
# Step 5: Train
model = CatBoostClassifier(
iterations=1000,
loss_function='Logloss',
eval_metric='AUC',
random_state=888,
thread_count=30
)
model.fit(train_pool, logging_level='Debug') # Kernel dies here (fails at 998/1000 iterations)
Observations
The failure only occurs when using the quantized Pool with the "registration address" feature. It fails even is Pool consists only from "registration address" feature.
Removing this feature allows training to complete successfully.
Memory usage appears normal during training (no OOM errors observed).
Questions
Pool Serialization: Is there a way to save a Pool with raw string categorical features without quantization?
(Current workaround forces quantization via quantize() to reduce memory usage)
Potential Bug: Could quantization of high-cardinality categorical features cause instability during training?
The consistent failure at iteration 998 suggests a possible edge-case bug in the quantization/training pipeline.
Unfortunately I cant represent dataset here. I have 128 GB of CPU RAM and 32 kernels, and peak CPU RAM usage was no more than 10 gb i guess.
Update
Kernel crashes even on small subsample of the example above.
I managed to recreate this issue in 1.2.7 version https://colab.research.google.com/drive/1O27wEymA_jrcRdijSrxdoTTNz3PDSG7v#scrollTo=v5HS8y4gZCQN
Please note that kernel crashed on 998 iteration