Skip to content

Kernel dies after quantized categorical variable #2816

@FedorTarasow

Description

@FedorTarasow

CatBoost Version: 1.2.2

Environment: Corporate Jupyter Lab Server

Problem Description

When creating a Pool from a pandas DataFrame, memory usage doubles (one copy for the DataFrame, another for the Pool). To mitigate this, we implemented the following workflow:

  1. Create Pool
  2. Save Pool
  3. Restart kernel to clear memory
  4. Reload Pool
  5. Train model

Issue: Training consistently fails at iteration 998/1000 when using a quantized Pool containing a large categorical feature ("registration address").

Key Details About the Feature
Name: registration address
Type: String-based categorical
Unique values: ~1.15 million unique values (80% uniqueness in 1,443,378 samples)

Reproduction Code

# Step 1: Create and quantize Pool
train_pool = Pool(
    data=train_df[all_features],
    label=train_df[TARGET],
    cat_features=cat_feats
)
train_pool.quantize()
train_pool.save('volumes/my_work/Tarasov/Model/Pool.bin')

# After kernel restart
# Step 4: Reload Pool
train_pool = Pool(
    'quantized://volumes/my_work/Tarasov/Model/Pool.bin'
)

# Step 5: Train
model = CatBoostClassifier(
    iterations=1000,
    loss_function='Logloss',
    eval_metric='AUC',
    random_state=888,
    thread_count=30
)
model.fit(train_pool, logging_level='Debug')  # Kernel dies here  (fails at 998/1000 iterations)

Observations
The failure only occurs when using the quantized Pool with the "registration address" feature. It fails even is Pool consists only from "registration address" feature.
Removing this feature allows training to complete successfully.
Memory usage appears normal during training (no OOM errors observed).

Questions
Pool Serialization: Is there a way to save a Pool with raw string categorical features without quantization?
(Current workaround forces quantization via quantize() to reduce memory usage)

Potential Bug: Could quantization of high-cardinality categorical features cause instability during training?
The consistent failure at iteration 998 suggests a possible edge-case bug in the quantization/training pipeline.

Unfortunately I cant represent dataset here. I have 128 GB of CPU RAM and 32 kernels, and peak CPU RAM usage was no more than 10 gb i guess.

Update

Kernel crashes even on small subsample of the example above.
I managed to recreate this issue in 1.2.7 version https://colab.research.google.com/drive/1O27wEymA_jrcRdijSrxdoTTNz3PDSG7v#scrollTo=v5HS8y4gZCQN
Please note that kernel crashed on 998 iteration

Image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions