Skip to content

Conversation

mbaurin
Copy link
Contributor

@mbaurin mbaurin commented Aug 21, 2025

What changes were proposed in this pull request?

Fixed the DorisUtils.extractBucketNum() method to properly handle BUCKETS AUTO parsing for DISTRIBUTED BY RANDOM statements. The method was incorrectly calling matcher.find(5) which advances the matcher position, causing the bucket value extraction to fail and resulting in a NumberFormatException when parsing null values.

Changes:

  • Changed if (matcher.find(5)) to if (matcher.group(5) != null) in DorisUtils.extractBucketNum()
  • Added test case for DISTRIBUTED BY RANDOM BUCKETS AUTO in TestDorisUtils.testDistributedInfoPattern()

Why are the changes needed?

This fixes a critical bug where Doris table operations fail when encountering DISTRIBUTED BY RANDOM BUCKETS AUTO syntax. The root cause was that matcher.find(5) advances the matcher position after the initial match, and since there are no subsequent matches, it returns false and never reaches the bucket value extraction logic. This leaves bucketValue as null, causing Integer.valueOf(bucketValue) to throw NumberFormatException: Cannot parse null string.

The bug affects users trying to load tables with auto bucket distribution using random strategy, which is a valid Doris SQL syntax.

Fix: #8218

Does this PR introduce any user-facing change?

No user-facing API changes. This is a bug fix that enables proper parsing of existing Doris SQL syntax that was previously failing. Users can now successfully work with tables using DISTRIBUTED BY RANDOM BUCKETS AUTO without encountering parsing errors.

How was this patch tested?

  1. Existing tests: Verified all existing Doris utility tests continue to pass, ensuring no regression
  2. New test case: Added TestDorisUtils.testDistributedInfoPattern() test case specifically for DISTRIBUTED BY RANDOM BUCKETS AUTO to prevent future regressions
  3. Manual verification: Created and ran a standalone test to reproduce the original bug and verify the fix resolves the NumberFormatException
  4. Code formatting: Applied Spotless formatting to ensure code style compliance

Test command:

./gradlew :catalogs:catalog-jdbc-doris:test --tests="*TestDorisUtils*"

All tests pass successfully with the fix applied.

@FANNG1 FANNG1 requested a review from yuqi1129 August 22, 2025 03:01
@yuqi1129
Copy link
Contributor

@mbaurin
Thank you for your help. I have a question? In which version does Doris support the keyword AUTO? Currently, Doris catalog is based on version 1.2.x, I remembered that there is no AUTO there, please correct me if I'm wrong.

@mbaurin
Copy link
Contributor Author

mbaurin commented Aug 22, 2025

@mbaurin Thank you for your help. I have a question? In which version does Doris support the keyword AUTO? Currently, Doris catalog is based on version 1.2.x, I remembered that there is no AUTO there, please correct me if I'm wrong.

Thanks for raising this ! I did some research and found that BUCKETS AUTO was actually introduced in Apache Doris 1.2.2, not earlier 1.2.x versions.

From the Doris 1.2.2 release notes and documentation:

  • Feature was added as "Auto Bucket" in 1.2.2
  • The 1.2-lts branch has regression tests for BUCKETS AUTO dating from April 2023
  • It's not available in 1.2.0/1.2.1

What specific Doris version is Gravitino's catalog currently targeting? If we're supporting 1.2.2+, then this fix is valid. If we're still on 1.2.0/1.2.1, we might need to either:

  1. Keep the fix - if we want forward compatibility with 1.2.2+
  2. Add version validation - reject BUCKETS AUTO syntax for older versions
  3. Update our target Doris version - if it's time to move to 1.2.2+

The existing codebase already had partial support for this (the HASH variant was working), so it seems like there was already some expectation of BUCKETS AUTO support.

What's your preference on how to handle this version compatibility ? Happy to adjust the approach accordingly !

@yuqi1129
Copy link
Contributor

Accordingly to docs, the version is 1.2.x, it hard to tell whether it is 1.2.0/1.2.1 or a higher version(1.2.2+)? So, is there any way to support both?

@mbaurin
Copy link
Contributor Author

mbaurin commented Aug 22, 2025

Accordingly to docs, the version is 1.2.x, it hard to tell whether it is 1.2.0/1.2.1 or a higher version(1.2.2+)? So, is there any way to support both?

Good point, here are some approaches to support both:

Option 1: Graceful degradation

  • Keep the parsing logic to handle BUCKETS AUTO
  • When actually executing against Doris, if it fails due to unsupported syntax, we could catch the error and provide a
    helpful message
  • This way newer Doris versions work seamlessly, older versions get clear feedback

Option 2: Runtime version detection

  • Add a version check against the Doris instance
  • Parse BUCKETS AUTO only if version >= 1.2.2
  • Requires adding version detection logic to the catalog

Option 3: Configuration-based

  • Add a catalog property like doris.support.auto.buckets=true/false
  • Let users configure based on their Doris version
  • Default to false for backward compatibility

Option 4: Keep current approach

  • The existing codebase already had partial AUTO support (HASH variant worked)
  • This suggests the maintainers expected AUTO support
  • We're just fixing the RANDOM variant bug

I'd lean toward Option 1, it's the least disruptive and provides the best user experience. What do you think?

@mbaurin mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 46bebdd to d4f581b Compare August 23, 2025 19:55
@yuqi1129
Copy link
Contributor

Accordingly to docs, the version is 1.2.x, it hard to tell whether it is 1.2.0/1.2.1 or a higher version(1.2.2+)? So, is there any way to support both?

Good point, here are some approaches to support both:

Option 1: Graceful degradation

  • Keep the parsing logic to handle BUCKETS AUTO
  • When actually executing against Doris, if it fails due to unsupported syntax, we could catch the error and provide a
    helpful message
  • This way newer Doris versions work seamlessly, older versions get clear feedback

Option 2: Runtime version detection

  • Add a version check against the Doris instance
  • Parse BUCKETS AUTO only if version >= 1.2.2
  • Requires adding version detection logic to the catalog

Option 3: Configuration-based

  • Add a catalog property like doris.support.auto.buckets=true/false
  • Let users configure based on their Doris version
  • Default to false for backward compatibility

Option 4: Keep current approach

  • The existing codebase already had partial AUTO support (HASH variant worked)
  • This suggests the maintainers expected AUTO support
  • We're just fixing the RANDOM variant bug

I'd lean toward Option 1, it's the least disruptive and provides the best user experience. What do you think?

Yeah, solution one is more graceful, and the change is less intrusive.

@mbaurin mbaurin force-pushed the fix/doris-buckets-auto-parsing branch 3 times, most recently from 09aa88a to 65f6da5 Compare September 1, 2025 11:28
@mbaurin
Copy link
Contributor Author

mbaurin commented Sep 1, 2025

@yuqi1129 Option 1 implemented 👍

@mbaurin mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 65f6da5 to 4c44994 Compare September 1, 2025 14:00
yuqi1129
yuqi1129 previously approved these changes Sep 2, 2025
@yuqi1129
Copy link
Contributor

yuqi1129 commented Sep 2, 2025

@mbaurin

Please resolve the code format issue.

@mbaurin mbaurin force-pushed the fix/doris-buckets-auto-parsing branch from 86a58fa to 06029e6 Compare September 2, 2025 09:47
@mbaurin
Copy link
Contributor Author

mbaurin commented Sep 2, 2025

@mbaurin

Please resolve the code format issue.

done @yuqi1129

@yuqi1129 yuqi1129 added the branch-1.0 Automatically cherry-pick commit to branch-1.0 label Sep 5, 2025
@yuqi1129 yuqi1129 merged commit a3eda2e into apache:main Sep 5, 2025
28 checks passed
github-actions bot pushed a commit that referenced this pull request Sep 5, 2025
…8240)

<!--
1. Title: [#<issue>] <type>(<scope>): <subject>
   Examples:
     - "[#123] feat(operator): support xxx"
     - "[#233] fix: check null before access result in xxx"
     - "[MINOR] refactor: fix typo in variable name"
     - "[MINOR] docs: fix typo in README"
     - "[#255] test: fix flaky test NameOfTheTest"
   Reference: https://www.conventionalcommits.org/en/v1.0.0/
2. If the PR is unfinished, please mark this PR as draft.
-->

### What changes were proposed in this pull request?

Fixed the `DorisUtils.extractBucketNum()` method to properly handle
`BUCKETS AUTO` parsing for `DISTRIBUTED BY RANDOM` statements. The
method was incorrectly calling `matcher.find(5)` which advances the
matcher position, causing the bucket value extraction to fail and
resulting in a `NumberFormatException` when parsing null values.

**Changes:**
- Changed `if (matcher.find(5))` to `if (matcher.group(5) != null)` in
`DorisUtils.extractBucketNum()`
- Added test case for `DISTRIBUTED BY RANDOM BUCKETS AUTO` in
`TestDorisUtils.testDistributedInfoPattern()`

### Why are the changes needed?

This fixes a critical bug where Doris table operations fail when
encountering `DISTRIBUTED BY RANDOM BUCKETS AUTO` syntax. The root cause
was that `matcher.find(5)` advances the matcher position after the
initial match, and since there are no subsequent matches, it returns
`false` and never reaches the bucket value extraction logic. This leaves
`bucketValue` as null, causing `Integer.valueOf(bucketValue)` to throw
`NumberFormatException: Cannot parse null string`.

The bug affects users trying to load tables with auto bucket
distribution using random strategy, which is a valid Doris SQL syntax.

  Fix: #8218 

### Does this PR introduce _any_ user-facing change?

No user-facing API changes. This is a bug fix that enables proper
parsing of existing Doris SQL syntax that was previously failing. Users
can now successfully work with tables using `DISTRIBUTED BY RANDOM
BUCKETS AUTO` without encountering parsing errors.

### How was this patch tested?


1. **Existing tests**: Verified all existing Doris utility tests
continue to pass, ensuring no regression
2. **New test case**: Added
`TestDorisUtils.testDistributedInfoPattern()` test case specifically for
`DISTRIBUTED BY RANDOM BUCKETS AUTO` to prevent future regressions
3. **Manual verification**: Created and ran a standalone test to
reproduce the original bug and verify the fix resolves the
`NumberFormatException`
4. **Code formatting**: Applied Spotless formatting to ensure code style
compliance

**Test command:**
```
./gradlew :catalogs:catalog-jdbc-doris:test --tests="*TestDorisUtils*"
```

All tests pass successfully with the fix applied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-1.0 Automatically cherry-pick commit to branch-1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug report] Doris Bucket Not Number
2 participants