-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
Hi,
looking at AddCutPoint() method, I noticed a heuristic with size threshold (16) which determines the strategy to define cut point values.
Is there a particular reason (except to avoid division operations and to be a bit faster) to not always consider the mid-point strategy even for large cardinality?
Indeed, when I encode my categorical features based on the category frequency, if the cardinality is greater than 16, the last 2 categories (most frequent ones) and the first 2 categories cannot be split. The consequence is that if 2 categories at extreme boundaries are significantly discriminative to improve the model accuracy, the split cannot occur between them.
Such issue is more effective with frequency encoding as most frequent categories are placed at the upper boundary.
Thanks a lot for clarification.
Regards
Jacques
void AddCutPoint(WXQSketch::SummaryContainer const& summary) {
if (summary.size > 1 && summary.size <= 16) {
/* specialized code categorial / ordinal data -- use midpoints */
for (size_t i = 1; i < summary.size; ++i) {
bst_float cpt = (summary.data[i].value + summary.data[i - 1].value) / 2.0f;
if (i == 1 || cpt > p_cuts_->cut_values_.back()) {
p_cuts_->cut_values_.push_back(cpt);
}
}
} else {
for (size_t i = 2; i < summary.size; ++i) {
bst_float cpt = summary.data[i - 1].value;
if (i == 2 || cpt > p_cuts_->cut_values_.back()) {
p_cuts_->cut_values_.push_back(cpt);
}
}
}
}