[#8218] fix(doris): Fix BUCKETS AUTO parsing for RANDOM distribution #8240

mbaurin · 2025-08-21T13:50:38Z

What changes were proposed in this pull request?

Fixed the DorisUtils.extractBucketNum() method to properly handle BUCKETS AUTO parsing for DISTRIBUTED BY RANDOM statements. The method was incorrectly calling matcher.find(5) which advances the matcher position, causing the bucket value extraction to fail and resulting in a NumberFormatException when parsing null values.

Changes:

Changed if (matcher.find(5)) to if (matcher.group(5) != null) in DorisUtils.extractBucketNum()
Added test case for DISTRIBUTED BY RANDOM BUCKETS AUTO in TestDorisUtils.testDistributedInfoPattern()

Why are the changes needed?

This fixes a critical bug where Doris table operations fail when encountering DISTRIBUTED BY RANDOM BUCKETS AUTO syntax. The root cause was that matcher.find(5) advances the matcher position after the initial match, and since there are no subsequent matches, it returns false and never reaches the bucket value extraction logic. This leaves bucketValue as null, causing Integer.valueOf(bucketValue) to throw NumberFormatException: Cannot parse null string.

The bug affects users trying to load tables with auto bucket distribution using random strategy, which is a valid Doris SQL syntax.

Fix: #8218

Does this PR introduce any user-facing change?

No user-facing API changes. This is a bug fix that enables proper parsing of existing Doris SQL syntax that was previously failing. Users can now successfully work with tables using DISTRIBUTED BY RANDOM BUCKETS AUTO without encountering parsing errors.

How was this patch tested?

Existing tests: Verified all existing Doris utility tests continue to pass, ensuring no regression
New test case: Added TestDorisUtils.testDistributedInfoPattern() test case specifically for DISTRIBUTED BY RANDOM BUCKETS AUTO to prevent future regressions
Manual verification: Created and ran a standalone test to reproduce the original bug and verify the fix resolves the NumberFormatException
Code formatting: Applied Spotless formatting to ensure code style compliance

Test command:

./gradlew :catalogs:catalog-jdbc-doris:test --tests="*TestDorisUtils*"

All tests pass successfully with the fix applied.

yuqi1129 · 2025-08-22T11:57:43Z

@mbaurin
Thank you for your help. I have a question? In which version does Doris support the keyword AUTO? Currently, Doris catalog is based on version 1.2.x, I remembered that there is no AUTO there, please correct me if I'm wrong.

mbaurin · 2025-08-22T12:47:57Z

@mbaurin Thank you for your help. I have a question? In which version does Doris support the keyword AUTO? Currently, Doris catalog is based on version 1.2.x, I remembered that there is no AUTO there, please correct me if I'm wrong.

Thanks for raising this ! I did some research and found that BUCKETS AUTO was actually introduced in Apache Doris 1.2.2, not earlier 1.2.x versions.

From the Doris 1.2.2 release notes and documentation:

Feature was added as "Auto Bucket" in 1.2.2
The 1.2-lts branch has regression tests for BUCKETS AUTO dating from April 2023
It's not available in 1.2.0/1.2.1

What specific Doris version is Gravitino's catalog currently targeting? If we're supporting 1.2.2+, then this fix is valid. If we're still on 1.2.0/1.2.1, we might need to either:

Keep the fix - if we want forward compatibility with 1.2.2+
Add version validation - reject BUCKETS AUTO syntax for older versions
Update our target Doris version - if it's time to move to 1.2.2+

The existing codebase already had partial support for this (the HASH variant was working), so it seems like there was already some expectation of BUCKETS AUTO support.

What's your preference on how to handle this version compatibility ? Happy to adjust the approach accordingly !

yuqi1129 · 2025-08-22T13:18:21Z

Accordingly to docs, the version is 1.2.x, it hard to tell whether it is 1.2.0/1.2.1 or a higher version(1.2.2+)? So, is there any way to support both?

mbaurin · 2025-08-22T13:23:25Z

Accordingly to docs, the version is 1.2.x, it hard to tell whether it is 1.2.0/1.2.1 or a higher version(1.2.2+)? So, is there any way to support both?

Good point, here are some approaches to support both:

Option 1: Graceful degradation

Keep the parsing logic to handle BUCKETS AUTO
When actually executing against Doris, if it fails due to unsupported syntax, we could catch the error and provide a
helpful message
This way newer Doris versions work seamlessly, older versions get clear feedback

Option 2: Runtime version detection

Add a version check against the Doris instance
Parse BUCKETS AUTO only if version >= 1.2.2
Requires adding version detection logic to the catalog

Option 3: Configuration-based

Add a catalog property like doris.support.auto.buckets=true/false
Let users configure based on their Doris version
Default to false for backward compatibility

Option 4: Keep current approach

The existing codebase already had partial AUTO support (HASH variant worked)
This suggests the maintainers expected AUTO support
We're just fixing the RANDOM variant bug

I'd lean toward Option 1, it's the least disruptive and provides the best user experience. What do you think?

yuqi1129 · 2025-08-25T12:43:09Z

Accordingly to docs, the version is 1.2.x, it hard to tell whether it is 1.2.0/1.2.1 or a higher version(1.2.2+)? So, is there any way to support both?

Good point, here are some approaches to support both:

Option 1: Graceful degradation

Keep the parsing logic to handle BUCKETS AUTO

When actually executing against Doris, if it fails due to unsupported syntax, we could catch the error and provide a
helpful message

This way newer Doris versions work seamlessly, older versions get clear feedback

Option 2: Runtime version detection

Add a version check against the Doris instance

Parse BUCKETS AUTO only if version >= 1.2.2

Requires adding version detection logic to the catalog

Option 3: Configuration-based

Add a catalog property like doris.support.auto.buckets=true/false

Let users configure based on their Doris version

Default to false for backward compatibility

Option 4: Keep current approach

The existing codebase already had partial AUTO support (HASH variant worked)

This suggests the maintainers expected AUTO support

We're just fixing the RANDOM variant bug

I'd lean toward Option 1, it's the least disruptive and provides the best user experience. What do you think?

Yeah, solution one is more graceful, and the change is less intrusive.

mbaurin · 2025-09-01T11:29:17Z

@yuqi1129 Option 1 implemented 👍

yuqi1129 · 2025-09-02T03:07:06Z

@mbaurin

Please resolve the code format issue.

…ution

…ection

mbaurin · 2025-09-02T09:47:24Z

@mbaurin

Please resolve the code format issue.

done @yuqi1129

…8240)  ### What changes were proposed in this pull request? Fixed the `DorisUtils.extractBucketNum()` method to properly handle `BUCKETS AUTO` parsing for `DISTRIBUTED BY RANDOM` statements. The method was incorrectly calling `matcher.find(5)` which advances the matcher position, causing the bucket value extraction to fail and resulting in a `NumberFormatException` when parsing null values. **Changes:** - Changed `if (matcher.find(5))` to `if (matcher.group(5) != null)` in `DorisUtils.extractBucketNum()` - Added test case for `DISTRIBUTED BY RANDOM BUCKETS AUTO` in `TestDorisUtils.testDistributedInfoPattern()` ### Why are the changes needed? This fixes a critical bug where Doris table operations fail when encountering `DISTRIBUTED BY RANDOM BUCKETS AUTO` syntax. The root cause was that `matcher.find(5)` advances the matcher position after the initial match, and since there are no subsequent matches, it returns `false` and never reaches the bucket value extraction logic. This leaves `bucketValue` as null, causing `Integer.valueOf(bucketValue)` to throw `NumberFormatException: Cannot parse null string`. The bug affects users trying to load tables with auto bucket distribution using random strategy, which is a valid Doris SQL syntax. Fix: #8218 ### Does this PR introduce _any_ user-facing change? No user-facing API changes. This is a bug fix that enables proper parsing of existing Doris SQL syntax that was previously failing. Users can now successfully work with tables using `DISTRIBUTED BY RANDOM BUCKETS AUTO` without encountering parsing errors. ### How was this patch tested? 1. **Existing tests**: Verified all existing Doris utility tests continue to pass, ensuring no regression 2. **New test case**: Added `TestDorisUtils.testDistributedInfoPattern()` test case specifically for `DISTRIBUTED BY RANDOM BUCKETS AUTO` to prevent future regressions 3. **Manual verification**: Created and ran a standalone test to reproduce the original bug and verify the fix resolves the `NumberFormatException` 4. **Code formatting**: Applied Spotless formatting to ensure code style compliance **Test command:** ``` ./gradlew :catalogs:catalog-jdbc-doris:test --tests="*TestDorisUtils*" ``` All tests pass successfully with the fix applied.

FANNG1 requested a review from yuqi1129 August 22, 2025 03:01

mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 46bebdd to d4f581b Compare August 23, 2025 19:55

mbaurin force-pushed the fix/doris-buckets-auto-parsing branch 3 times, most recently from 09aa88a to 65f6da5 Compare September 1, 2025 11:28

mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 65f6da5 to 4c44994 Compare September 1, 2025 14:00

yuqi1129 previously approved these changes Sep 2, 2025

View reviewed changes

mbaurin dismissed yuqi1129’s stale review via 8631743 September 2, 2025 09:46

mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 8631743 to 86a58fa Compare September 2, 2025 09:46

mbaurin added 2 commits September 2, 2025 11:47

[apache#8218] fix(doris): Fix BUCKETS AUTO parsing for RANDOM distrib…

7befe3f

…ution

[apache#8218] fix(doris): Improve BUCKETS AUTO error handling and det…

06029e6

…ection

mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 86a58fa to 06029e6 Compare September 2, 2025 09:47

Eipi-Tong mentioned this pull request Sep 3, 2025

[Improvement] Fix AUTO handling in DorisUtils.java #8344

Open

yuqi1129 mentioned this pull request Sep 3, 2025

[#8344] improvement(catalog): set default bucket number to be 1 when group is missing in dorisUtils #8403

Open

yuqi1129 approved these changes Sep 5, 2025

View reviewed changes

yuqi1129 added the branch-1.0 Automatically cherry-pick commit to branch-1.0 label Sep 5, 2025

yuqi1129 assigned mbaurin Sep 5, 2025

yuqi1129 merged commit a3eda2e into apache:main Sep 5, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#8218] fix(doris): Fix BUCKETS AUTO parsing for RANDOM distribution #8240

[#8218] fix(doris): Fix BUCKETS AUTO parsing for RANDOM distribution #8240

Uh oh!

mbaurin commented Aug 21, 2025

Uh oh!

yuqi1129 commented Aug 22, 2025

Uh oh!

mbaurin commented Aug 22, 2025 •

edited

Loading

Uh oh!

yuqi1129 commented Aug 22, 2025

Uh oh!

mbaurin commented Aug 22, 2025 •

edited

Loading

Uh oh!

yuqi1129 commented Aug 25, 2025

Uh oh!

mbaurin commented Sep 1, 2025

Uh oh!

yuqi1129 commented Sep 2, 2025

Uh oh!

mbaurin commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

[#8218] fix(doris): Fix BUCKETS AUTO parsing for RANDOM distribution #8240

[#8218] fix(doris): Fix BUCKETS AUTO parsing for RANDOM distribution #8240

Uh oh!

Conversation

mbaurin commented Aug 21, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

yuqi1129 commented Aug 22, 2025

Uh oh!

mbaurin commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi1129 commented Aug 22, 2025

Uh oh!

mbaurin commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi1129 commented Aug 25, 2025

Uh oh!

mbaurin commented Sep 1, 2025

Uh oh!

yuqi1129 commented Sep 2, 2025

Uh oh!

mbaurin commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

mbaurin commented Aug 22, 2025 •

edited

Loading

mbaurin commented Aug 22, 2025 •

edited

Loading