Skip to content

Threshold effect due to multiple cut point strategies in histogram mode #5095

@DoanHuuJacques

Description

@DoanHuuJacques

Hi,

looking at AddCutPoint() method, I noticed a heuristic with size threshold (16) which determines the strategy to define cut point values.
Is there a particular reason (except to avoid division operations and to be a bit faster) to not always consider the mid-point strategy even for large cardinality?

Indeed, when I encode my categorical features based on the category frequency, if the cardinality is greater than 16, the last 2 categories (most frequent ones) and the first 2 categories cannot be split. The consequence is that if 2 categories at extreme boundaries are significantly discriminative to improve the model accuracy, the split cannot occur between them.
Such issue is more effective with frequency encoding as most frequent categories are placed at the upper boundary.

Thanks a lot for clarification.
Regards

Jacques

void AddCutPoint(WXQSketch::SummaryContainer const& summary) {
if (summary.size > 1 && summary.size <= 16) {
/* specialized code categorial / ordinal data -- use midpoints */
for (size_t i = 1; i < summary.size; ++i) {
bst_float cpt = (summary.data[i].value + summary.data[i - 1].value) / 2.0f;
if (i == 1 || cpt > p_cuts_->cut_values_.back()) {
p_cuts_->cut_values_.push_back(cpt);
}
}
} else {
for (size_t i = 2; i < summary.size; ++i) {
bst_float cpt = summary.data[i - 1].value;
if (i == 2 || cpt > p_cuts_->cut_values_.back()) {
p_cuts_->cut_values_.push_back(cpt);
}
}
}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions