Skip to content

BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes #818

@nszhsl

Description

@nszhsl

simdutf BOM Detection Bug Report

Title

BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes

Labels

bug, BOM, UTF-8, encoding


Bug Description

The BOM::check_bom function incorrectly requires length >= 4 to detect UTF-8 BOM, but UTF-8 BOM consists of exactly 3 bytes (0xEF 0xBB 0xBF). This causes the function to return encoding_type::unspecified instead of encoding_type::UTF8 when the input contains only the BOM with no additional content.

Affected Code

File: src/simdutf.cpp (around line 943)

Current incorrect code:

} else if (length >= 4 && byte[0] == 0xef and byte[1] == 0xbb and
           byte[2] == 0xbf) {
  return encoding_type::UTF8;
}

Should be:

} else if (length >= 3 && byte[0] == 0xef and byte[1] == 0xbb and
           byte[2] == 0xbf) {
  return encoding_type::UTF8;
}

Steps to Reproduce

#include "simdutf.h"
#include <iostream>

int main() {
    // Test case 1: UTF-8 BOM only (3 bytes) - FAILS
    const char utf8_bom_only[] = "\xEF\xBB\xBF";
    auto result1 = simdutf::BOM::check_bom(utf8_bom_only, 3);
    std::cout << "BOM only (3 bytes): " << (int)result1 << std::endl;
    // Expected: 1 (UTF8), Actual: 0 (unspecified)
    
    // Test case 2: UTF-8 BOM + content (4+ bytes) - WORKS
    const char utf8_bom_content[] = "\xEF\xBB\xBFH";
    auto result2 = simdutf::BOM::check_bom(utf8_bom_content, 4);
    std::cout << "BOM + content (4+ bytes): " << (int)result2 << std::endl;
    // Expected: 1 (UTF8), Actual: 1 (UTF8)
    
    return 0;
}

Expected vs Actual Behavior

Input Length Expected Result Actual Result
"\xEF\xBB\xBF" 3 UTF8 (1) unspecified (0)
"\xEF\xBB\xBFH" 4 UTF8 (1) UTF8 (1) ✓

Impact

  • Files containing only UTF-8 BOM are not correctly identified
  • Inconsistent behavior between "BOM-only" and "BOM+content" cases
  • Violates Unicode standard expectations for UTF-8 BOM detection
  • Affects edge cases in encoding detection workflows

Environment

  • simdutf version: 7.3.2 (based on current source)
  • Compiler: Any (logic error independent of compiler)
  • Platform: All platforms

Proposed Fix

Single character change: Change 4 to 3 in the condition:

- } else if (length >= 4 && byte[0] == 0xef and byte[1] == 0xbb and
+ } else if (length >= 3 && byte[0] == 0xef and byte[1] == 0xbb and
             byte[2] == 0xbf) {

Additional Context

This bug was discovered while implementing comprehensive encoding detection in a C++ project. The UTF-8 BOM is standardized as exactly 3 bytes (EF BB BF), so requiring 4 bytes is incorrect.

All other BOM detections in the same function correctly use the minimum required length:

  • UTF-16 LE/BE: length >= 2
  • UTF-32 LE/BE: length >= 4
  • UTF-8: length >= 4 ❌ (should be 3)

Analysis Summary

✅ Working "BOM-only" cases:

  1. UTF-16 LE BOM (0xFF 0xFE, 2 bytes): length >= 2
  2. UTF-16 BE BOM (0xFE 0xFF, 2 bytes): length >= 2
  3. UTF-32 LE BOM (0xFF 0xFE 0x00 0x00, 4 bytes): length >= 4
  4. UTF-32 BE BOM (0x00 0x00 0xFE 0xFF, 4 bytes): length >= 4

❌ Broken "BOM-only" case:

  1. UTF-8 BOM (0xEF 0xBB 0xBF, 3 bytes): length >= 4 ❌ (should be >= 3)

This is the only inconsistency in the BOM detection logic.

References


Issue prepared by: [@nszhsl]
Date: 2025-07-07
Severity: Medium (edge case but clear bug)
Fix complexity: Trivial (single character change)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions