BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes

# simdutf BOM Detection Bug Report

## Title
`BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes`

## Labels
`bug`, `BOM`, `UTF-8`, `encoding`

---

## Bug Description

The `BOM::check_bom` function incorrectly requires `length >= 4` to detect UTF-8 BOM, but UTF-8 BOM consists of exactly 3 bytes (`0xEF 0xBB 0xBF`). This causes the function to return `encoding_type::unspecified` instead of `encoding_type::UTF8` when the input contains only the BOM with no additional content.

## Affected Code

**File:** `src/simdutf.cpp` (around line 943)

**Current incorrect code:**
```cpp
} else if (length >= 4 && byte[0] == 0xef and byte[1] == 0xbb and
           byte[2] == 0xbf) {
  return encoding_type::UTF8;
}
```

**Should be:**
```cpp
} else if (length >= 3 && byte[0] == 0xef and byte[1] == 0xbb and
           byte[2] == 0xbf) {
  return encoding_type::UTF8;
}
```

## Steps to Reproduce

```cpp
#include "simdutf.h"
#include <iostream>

int main() {
    // Test case 1: UTF-8 BOM only (3 bytes) - FAILS
    const char utf8_bom_only[] = "\xEF\xBB\xBF";
    auto result1 = simdutf::BOM::check_bom(utf8_bom_only, 3);
    std::cout << "BOM only (3 bytes): " << (int)result1 << std::endl;
    // Expected: 1 (UTF8), Actual: 0 (unspecified)
    
    // Test case 2: UTF-8 BOM + content (4+ bytes) - WORKS
    const char utf8_bom_content[] = "\xEF\xBB\xBFH";
    auto result2 = simdutf::BOM::check_bom(utf8_bom_content, 4);
    std::cout << "BOM + content (4+ bytes): " << (int)result2 << std::endl;
    // Expected: 1 (UTF8), Actual: 1 (UTF8)
    
    return 0;
}
```

## Expected vs Actual Behavior

| Input | Length | Expected Result | Actual Result |
|-------|--------|----------------|---------------|
| `"\xEF\xBB\xBF"` | 3 | `UTF8` (1) | `unspecified` (0) |
| `"\xEF\xBB\xBFH"` | 4 | `UTF8` (1) | `UTF8` (1) ✓ |

## Impact

- Files containing only UTF-8 BOM are not correctly identified
- Inconsistent behavior between "BOM-only" and "BOM+content" cases
- Violates Unicode standard expectations for UTF-8 BOM detection
- Affects edge cases in encoding detection workflows

## Environment

- **simdutf version:** 7.3.2 (based on current source)
- **Compiler:** Any (logic error independent of compiler)
- **Platform:** All platforms

## Proposed Fix

**Single character change:** Change `4` to `3` in the condition:

```diff
- } else if (length >= 4 && byte[0] == 0xef and byte[1] == 0xbb and
+ } else if (length >= 3 && byte[0] == 0xef and byte[1] == 0xbb and
             byte[2] == 0xbf) {
```

## Additional Context

This bug was discovered while implementing comprehensive encoding detection in a C++ project. The UTF-8 BOM is standardized as exactly 3 bytes (`EF BB BF`), so requiring 4 bytes is incorrect.

All other BOM detections in the same function correctly use the minimum required length:
- UTF-16 LE/BE: `length >= 2` ✓
- UTF-32 LE/BE: `length >= 4` ✓  
- UTF-8: `length >= 4` ❌ (should be 3)

## Analysis Summary

### ✅ Working "BOM-only" cases:
1. **UTF-16 LE BOM** (`0xFF 0xFE`, 2 bytes): `length >= 2` ✓
2. **UTF-16 BE BOM** (`0xFE 0xFF`, 2 bytes): `length >= 2` ✓
3. **UTF-32 LE BOM** (`0xFF 0xFE 0x00 0x00`, 4 bytes): `length >= 4` ✓
4. **UTF-32 BE BOM** (`0x00 0x00 0xFE 0xFF`, 4 bytes): `length >= 4` ✓

### ❌ Broken "BOM-only" case:
5. **UTF-8 BOM** (`0xEF 0xBB 0xBF`, 3 bytes): `length >= 4` ❌ (should be `>= 3`)

This is the **only inconsistency** in the BOM detection logic.

## References

- [Unicode Standard - Byte Order Mark](https://www.unicode.org/faq/utf_bom.html#bom4)
- [RFC 3629 - UTF-8](https://tools.ietf.org/html/rfc3629)

---

**Issue prepared by:** [[@nszhsl](https://github.com/nszhsl)] 
**Date:** 2025-07-07
**Severity:** Medium (edge case but clear bug)
**Fix complexity:** Trivial (single character change) 

Input	Length	Expected Result	Actual Result
`"\xEF\xBB\xBF"`	3	`UTF8` (1)	`unspecified` (0)
`"\xEF\xBB\xBFH"`	4	`UTF8` (1)	`UTF8` (1) ✓

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes #818

simdutf BOM Detection Bug Report

Title

Labels

Bug Description

Affected Code

Steps to Reproduce

Expected vs Actual Behavior

Impact

Environment

Proposed Fix

Additional Context

Analysis Summary

✅ Working "BOM-only" cases:

❌ Broken "BOM-only" case:

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BOM detection fails for UTF-8 BOM when input length is exactly 3 bytes #818

Description

simdutf BOM Detection Bug Report

Title

Labels

Bug Description

Affected Code

Steps to Reproduce

Expected vs Actual Behavior

Impact

Environment

Proposed Fix

Additional Context

Analysis Summary

✅ Working "BOM-only" cases:

❌ Broken "BOM-only" case:

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions