Skip to content

Multidimensional (multi-task) classification with embedding features #2308

@ybm11

Description

@ybm11

Problem: Fitting a multidimensional (multitask - multiple y label sets) classification model with embedding features seems to be unsupported.
catboost version: 1.1.1
Operating System: MacOS Ventura 13.2.1

Catboost's built-in embeddings_features support has been very useful for me, as well as Catboost's multidimensional (multitask) training functionality.
However, it seems like the combination of the two - training multitask classifications (using MultiLogloss) with embedding features - is unsupported.

Based on some Catboost's documentation reading, the information derived from embedding features (projection to lower dimension space and nearest neighbor search) seems to be agnostic to the target label, therefore my understanding is that there are no logical reasons preventing training multitask classifications with embedding features.

My goal in opening this issue is to understand whether it's prevented due to logical / mathematical reasons, and if there aren't any, whether this development is in the planned pipeline.

Below is some reproducible code for fitting a multidimensional classification model with an embedding feature:

import catboost
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import make_multilabel_classification, make_classification
from sklearn.model_selection import train_test_split
import pandas as pd

print('catboost version:', catboost.__version__)

X, y = make_multilabel_classification(n_samples=500, n_features=20, n_classes=5, random_state=0)
X = pd.DataFrame(X)
X.loc[:,'embeddings'] = list(np.random.rand(500, 100))

train_pool = Pool(X, y, embedding_features=['embeddings'])

clf = CatBoostClassifier(
    loss_function='MultiLogloss',
    iterations=500,
    class_names=['A', 'B', 'C', 'D', 'E'],

)
clf.fit(train_pool, metric_period=10, verbose=50)

Output:

catboost version: 1.1.1
Traceback (most recent call last):
  File "/Users/benmalka/miniforge3/envs/OnboardingAnalysis/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3378, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-51-9c85739ea8f5>", line 21, in <module>
    clf.fit(train_pool, metric_period=10, verbose=50)
  File "/Users/benmalka/miniforge3/envs/OnboardingAnalysis/lib/python3.8/site-packages/catboost/core.py", line 5128, in fit
    self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
  File "/Users/benmalka/miniforge3/envs/OnboardingAnalysis/lib/python3.8/site-packages/catboost/core.py", line 2355, in _fit
    self._train(
  File "/Users/benmalka/miniforge3/envs/OnboardingAnalysis/lib/python3.8/site-packages/catboost/core.py", line 1759, in _train
    self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None)
  File "_catboost.pyx", line 4623, in _catboost._CatBoost._train
  File "_catboost.pyx", line 4672, in _catboost._CatBoost._train
_catboost.CatBoostError: catboost/libs/data/target.h:315: Attempt to use multi-dimensional target as one-dimensional

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions