Skip to content

[BUG]: when using create_api the values for y are referenced wrongly (pandas selection) for the data model with named index #3816

@pascal456

Description

@pascal456

pycaret version checks

Issue Description

When attempting to create an API with create_api the pydantic data model is composed of entries in experiment.X and experiment.y. Therefore the data selection is done per key in case of y. This leads to a keyerror when the item 0 is not present in the (train-)data's index.

This is already correctly referenced in case of independent features (experiment.X.iloc[0]).

Current workaround would be to drop the index before training, which is undisirable with later observation identification in mind. Also it is simply not very robust.

Reproducible Example

import random

from pycaret.datasets import get_data
from pycaret.regression import (
    compare_models,
    create_api,
    create_model,
    setup,
    tune_model,
)

data = get_data("insurance")

# lets say we have customer IDs that we want to as the observations names:
customer_ids = random.sample(range(10000, 99999), len(data))
data["customer_id"] = customer_ids
data.set_index("customer_id", inplace=True)

s = setup(data, target="charges", session_id=123)

best = compare_models(include=["rf", "gbr"])
model = create_model(best)
tuned_model = tune_model(model)

create_api(tuned_model, "trained_models/my_first_api")

Expected Behavior

Creating an API with underlying data that has an index that does not contain the item 0 should also be possible. This is already implemented correctly for the reference to X-data (.iloc[0])

Actual Results

A keyerror occurs, when there is no item named `0` in `experiment.y`

Output:

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
                    Description             Value
0                    Session id               123
1                        Target           charges
2                   Target type        Regression
3           Original data shape         (1338, 7)
4        Transformed data shape        (1338, 10)
5   Transformed train set shape         (936, 10)
6    Transformed test set shape         (402, 10)
7              Ordinal features                 2
8              Numeric features                 3
9          Categorical features                 3
10                   Preprocess              True
11              Imputation type            simple
12           Numeric imputation              mean
13       Categorical imputation              mode
14     Maximum one-hot encoding                25
15              Encoding method              None
16               Fold Generator             KFold
17                  Fold Number                10
18                     CPU Jobs                -1
19                      Use GPU             False
20               Log Experiment             False
21              Experiment Name  reg-default-name
22                          USI              f227
                           Model        MAE           MSE       RMSE      R2  \                                                            
gbr  Gradient Boosting Regressor  2701.9919  2.354866e+07  4832.9329  0.8320   
rf       Random Forest Regressor  2771.4583  2.541650e+07  5028.6343  0.8172   

      RMSLE    MAPE  TT (Sec)  
gbr  0.4447  0.3137     0.265  
rf   0.4690  0.3303     0.413  
            MAE           MSE       RMSE      R2   RMSLE    MAPE                                                                           
Fold                                                            
0     2651.0179  2.031297e+07  4506.9911  0.8787  0.4506  0.3416
1     3047.5028  3.178944e+07  5638.2121  0.8152  0.4560  0.2993
2     2526.0336  2.221968e+07  4713.7761  0.7187  0.4909  0.3007
3     2975.2686  2.309014e+07  4805.2205  0.8072  0.4866  0.3810
4     2847.7050  2.716372e+07  5211.8830  0.7980  0.5103  0.3367
5     2580.5742  1.905227e+07  4364.8910  0.8774  0.3340  0.2437
6     2366.1844  1.924113e+07  4386.4713  0.8691  0.3504  0.2649
7     2671.5877  2.550159e+07  5049.9101  0.8598  0.4414  0.2748
8     2325.6224  1.856430e+07  4308.6309  0.8801  0.3889  0.2888
9     3028.4227  2.855132e+07  5343.3434  0.8161  0.5379  0.4058
Mean  2701.9919  2.354866e+07  4832.9329  0.8320  0.4447  0.3137
Std    250.0988  4.311402e+06   437.5116  0.0488  0.0643  0.0491
Processing:   0%|                                                                                                    | 0/7 [00:00<?, ?it/s]Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
            MAE           MSE       RMSE      R2   RMSLE    MAPE
Fold                                                            
0     3385.2881  2.980593e+07  5459.4806  0.8220  0.5659  0.4748
1     3609.4719  3.409537e+07  5839.1241  0.8018  0.5114  0.4118
2     3710.7020  3.587883e+07  5989.8940  0.5457  0.7977  0.5374
3     3845.1139  3.313262e+07  5756.0945  0.7233  0.7606  0.5996
4     3801.3662  3.986850e+07  6314.1507  0.7035  0.6453  0.4905
5     3617.8384  3.086773e+07  5555.8735  0.8014  0.5108  0.3681
6     3353.2437  2.737254e+07  5231.8769  0.8137  0.5943  0.4230
7     3171.2929  2.978108e+07  5457.2044  0.8362  0.7103  0.3373
8     3010.7330  2.390871e+07  4889.6532  0.8456  0.5965  0.5119
9     3850.3856  4.028390e+07  6346.9600  0.7405  0.7465  0.5663
Mean  3535.5436  3.249952e+07  5684.0312  0.7634  0.6439  0.4721
Std    277.9305  4.966506e+06   437.3902  0.0860  0.0991  0.0814
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.10.13/envs/pycaret-development/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2263, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2273, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/projects/pycaret/trained_models/delete-me-experiment.py", line 26, in <module>
    create_api(tuned_model, "trained_models/my_first_api")
  File "/home/ubuntu/projects/pycaret/pycaret/utils/generic.py", line 965, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/projects/pycaret/pycaret/regression/functional.py", line 2885, in create_api
    return _CURRENT_EXPERIMENT.create_api(
  File "/home/ubuntu/projects/pycaret/pycaret/internal/pycaret_experiment/tabular_experiment.py", line 2588, in create_api
    output_model = create_model("{api_name}_output", {target}={repr(self.y[0])})
  File "/home/ubuntu/.pyenv/versions/3.10.13/envs/pycaret-development/lib/python3.10/site-packages/pandas/core/series.py", line 981, in __getitem__
    return self._get_value(key)
  File "/home/ubuntu/.pyenv/versions/3.10.13/envs/pycaret-development/lib/python3.10/site-packages/pandas/core/series.py", line 1089, in _get_value
    loc = self.index.get_loc(label)
  File "/home/ubuntu/.pyenv/versions/3.10.13/envs/pycaret-development/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 0


### Installed Versions

<details>
System:
    python: 3.10.13 (main, Nov  7 2023, 10:19:12) [GCC 9.4.0]
executable: /home/ubuntu/.pyenv/versions/pycaret-development/bin/python
   machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31

PyCaret required dependencies:
                 pip: 23.3.1
          setuptools: 65.5.0
             pycaret: 3.1.0
             IPython: 8.17.2
          ipywidgets: 8.1.1
                tqdm: 4.66.1
               numpy: 1.23.5
              pandas: 1.5.3
              jinja2: 3.1.2
               scipy: 1.10.1
              joblib: 1.3.2
             sklearn: 1.2.2
                pyod: 1.1.1
            imblearn: 0.11.0
   category_encoders: 2.6.3
            lightgbm: 4.1.0
               numba: 0.58.1
            requests: 2.31.0
          matplotlib: 3.6.0
          scikitplot: 0.3.7
         yellowbrick: 1.5
              plotly: 5.18.0
    plotly-resampler: Not installed
             kaleido: 0.2.1
           schemdraw: 0.15
         statsmodels: 0.14.0
              sktime: 0.21.1
               tbats: 1.1.3
            pmdarima: 2.0.4
              psutil: 5.9.6
          markupsafe: 2.1.3
             pickle5: Not installed
         cloudpickle: 2.2.1
         deprecation: 2.1.0
              xxhash: 3.4.1
           wurlitzer: 3.0.3

PyCaret optional dependencies:
                shap: 0.43.0
           interpret: 0.4.4
                umap: 0.5.4
     ydata_profiling: 4.6.0
  explainerdashboard: 0.4.3
             autoviz: Not installed
           fairlearn: 0.7.0
          deepchecks: Not installed
             xgboost: 2.0.1
            catboost: 1.2.2
              kmodes: 0.12.2
             mlxtend: 0.23.0
       statsforecast: 1.5.0
        tune_sklearn: 0.5.0
                 ray: 2.8.0
            hyperopt: 0.2.7
              optuna: 3.4.0
               skopt: 0.9.0
              mlflow: 1.30.1
              gradio: 3.50.2
             fastapi: 0.104.1
             uvicorn: 0.24.0.post1
              m2cgen: 0.10.0
           evidently: 0.2.8
               fugue: 0.8.6
           streamlit: Not installed
             prophet: Not installed
</details>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions