Threshold effect due to multiple cut point strategies in histogram mode

Hi,

looking at AddCutPoint() method, I noticed a heuristic with size threshold (16) which determines the strategy to define cut point values.
Is there a particular reason (except to avoid division operations and to be a bit faster) to not always consider the mid-point strategy even for large cardinality?

Indeed, when I encode my categorical features based on the category frequency, if the cardinality is greater than 16, the last 2 categories (most frequent ones) and the first 2 categories cannot be split. The consequence is that if 2 categories at extreme boundaries are significantly discriminative to improve the model accuracy, the split cannot occur between them.
Such issue is more effective with frequency encoding as most frequent categories are placed at the upper boundary.

Thanks a lot for clarification.
Regards

Jacques


void AddCutPoint(WXQSketch::SummaryContainer const& summary) {
    if (summary.size > 1 && summary.size <= 16) {
      /* specialized code categorial / ordinal data -- use midpoints */
      for (size_t i = 1; i < summary.size; ++i) {
        bst_float cpt = (summary.data[i].value + summary.data[i - 1].value) / 2.0f;
        if (i == 1 || cpt > p_cuts_->cut_values_.back()) {
          p_cuts_->cut_values_.push_back(cpt);
        }
      }
    } else {
      for (size_t i = 2; i < summary.size; ++i) {
        bst_float cpt = summary.data[i - 1].value;
        if (i == 2 || cpt > p_cuts_->cut_values_.back()) {
          p_cuts_->cut_values_.push_back(cpt);
        }
      }
    }
  }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Threshold effect due to multiple cut point strategies in histogram mode #5095

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Threshold effect due to multiple cut point strategies in histogram mode #5095

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions