-
Notifications
You must be signed in to change notification settings - Fork 96
Description
simdutf BOM Detection Bug Report
Title
BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes
Labels
bug
, BOM
, UTF-8
, encoding
Bug Description
The BOM::check_bom
function incorrectly requires length >= 4
to detect UTF-8 BOM, but UTF-8 BOM consists of exactly 3 bytes (0xEF 0xBB 0xBF
). This causes the function to return encoding_type::unspecified
instead of encoding_type::UTF8
when the input contains only the BOM with no additional content.
Affected Code
File: src/simdutf.cpp
(around line 943)
Current incorrect code:
} else if (length >= 4 && byte[0] == 0xef and byte[1] == 0xbb and
byte[2] == 0xbf) {
return encoding_type::UTF8;
}
Should be:
} else if (length >= 3 && byte[0] == 0xef and byte[1] == 0xbb and
byte[2] == 0xbf) {
return encoding_type::UTF8;
}
Steps to Reproduce
#include "simdutf.h"
#include <iostream>
int main() {
// Test case 1: UTF-8 BOM only (3 bytes) - FAILS
const char utf8_bom_only[] = "\xEF\xBB\xBF";
auto result1 = simdutf::BOM::check_bom(utf8_bom_only, 3);
std::cout << "BOM only (3 bytes): " << (int)result1 << std::endl;
// Expected: 1 (UTF8), Actual: 0 (unspecified)
// Test case 2: UTF-8 BOM + content (4+ bytes) - WORKS
const char utf8_bom_content[] = "\xEF\xBB\xBFH";
auto result2 = simdutf::BOM::check_bom(utf8_bom_content, 4);
std::cout << "BOM + content (4+ bytes): " << (int)result2 << std::endl;
// Expected: 1 (UTF8), Actual: 1 (UTF8)
return 0;
}
Expected vs Actual Behavior
Input | Length | Expected Result | Actual Result |
---|---|---|---|
"\xEF\xBB\xBF" |
3 | UTF8 (1) |
unspecified (0) |
"\xEF\xBB\xBFH" |
4 | UTF8 (1) |
UTF8 (1) ✓ |
Impact
- Files containing only UTF-8 BOM are not correctly identified
- Inconsistent behavior between "BOM-only" and "BOM+content" cases
- Violates Unicode standard expectations for UTF-8 BOM detection
- Affects edge cases in encoding detection workflows
Environment
- simdutf version: 7.3.2 (based on current source)
- Compiler: Any (logic error independent of compiler)
- Platform: All platforms
Proposed Fix
Single character change: Change 4
to 3
in the condition:
- } else if (length >= 4 && byte[0] == 0xef and byte[1] == 0xbb and
+ } else if (length >= 3 && byte[0] == 0xef and byte[1] == 0xbb and
byte[2] == 0xbf) {
Additional Context
This bug was discovered while implementing comprehensive encoding detection in a C++ project. The UTF-8 BOM is standardized as exactly 3 bytes (EF BB BF
), so requiring 4 bytes is incorrect.
All other BOM detections in the same function correctly use the minimum required length:
- UTF-16 LE/BE:
length >= 2
✓ - UTF-32 LE/BE:
length >= 4
✓ - UTF-8:
length >= 4
❌ (should be 3)
Analysis Summary
✅ Working "BOM-only" cases:
- UTF-16 LE BOM (
0xFF 0xFE
, 2 bytes):length >= 2
✓ - UTF-16 BE BOM (
0xFE 0xFF
, 2 bytes):length >= 2
✓ - UTF-32 LE BOM (
0xFF 0xFE 0x00 0x00
, 4 bytes):length >= 4
✓ - UTF-32 BE BOM (
0x00 0x00 0xFE 0xFF
, 4 bytes):length >= 4
✓
❌ Broken "BOM-only" case:
- UTF-8 BOM (
0xEF 0xBB 0xBF
, 3 bytes):length >= 4
❌ (should be>= 3
)
This is the only inconsistency in the BOM detection logic.
References
Issue prepared by: [@nszhsl]
Date: 2025-07-07
Severity: Medium (edge case but clear bug)
Fix complexity: Trivial (single character change)