-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
In file src/common/quantil.h, at line 210-211:
CHECK(i != src.size - 1);
if (dx2 < src.data[i].RMinNext() + src.data[i + 1].RMaxPrev()) { ... ... }
The CHECK logs an error message if i == src.size - 1
, then execution continues to the next line where src.data[i + 1]
is accessed. This appears to be an out-of-bound array access error. Using a large dataset, e.g. a 64GB mortgage dataset, in distributed training on Spark, we see task failures that can be attributed to this bug.
The two lines of code mentioned above are found in function WQSummary::SetPrune()
, which have been around for years, but the problem manifests itself only recently when this PR was merged. One thing the PR changed was switching from WXQSketch
to WQSketch
. As a result, WQSummary::SetPrune()
replaced WXQSummary::SetPrune()
in the execution path. In WXQSummary::SetPrune()
, there was a similar check, but it breaks out of the enclosing for-loop instead of continuing when the check fails, see line 425-426 in file quantile.h:
if (i == end) break;
if (dx2 < src.data[i].RMinNext() + src.data[i + 1].RMaxPrev()) { ... ... }
I believe we should do the same (breaking from the for-loop) in WQSummary::SetPrune()
. Thanks.