html file is not reported as UTF8 after conversion

**Provide the file**
[110-original.zip](https://github.com/Ousret/charset_normalizer/files/13298606/110-original.zip)

**Verbose output**
Using the CLI, run `normalizer -v ./my-file.txt` and past the result in here.

```
❯ # rm+unzip

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:42:49,817 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:42:49,821 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:42:49,821 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:42:49,830 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:42:49,830 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250


❯ normalizer -rfnvvv 110-original.htm
2023-11-08 16:39:42,180 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:39:42,183 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:39:42,184 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:39:42,192 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:39:42,192 | DEBUG | Encoding detection: cp1250 is most likely the one.
{
    "path": "/home/adax/code/other/encoding/110-original.htm",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.783,
    "coherence": 66.66,
    "unicode_path": "/home/adax/code/other/encoding/110-original.htm",
    "is_preferred": true
}

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:41:07,958 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:41:07,961 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 1.267000 %
2023-11-08 16:41:07,962 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:41:07,970 | Level 5 | We detected language [('English', 0.7029), ('Indonesian', 0.572), ('Dutch', 0.51), ('Italian', 0.4949), ('Czech', 0.4862), ('Spanish', 0.4806), ('Croatian', 0.4724), ('Norwegian', 0.4692), ('Slovene', 0.4669), ('Romanian', 0.4632), ('Hungarian', 0.4624), ('Slovak', 0.4605), ('Finnish', 0.4565), ('German', 0.4533), ('Swedish', 0.4453), ('French', 0.443), ('Danish', 0.4366), ('Portuguese', 0.4116), ('Polish', 0.4113), ('Lithuanian', 0.3931), ('Estonian', 0.3828), ('Turkish', 0.3828), ('Vietnamese', 0.3795)] using cp1250
2023-11-08 16:41:07,970 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250
```

**`enca` will however detect UTF-8 as it should**

```
❯ # rm+unzip

❯ enca -L hr 110-original.htm
Unrecognized encoding

❯ normalizer -rfnvvv 110-original.htm

❯ enca -L hr 110-original.htm
Universal transformation format 8 bits; UTF-8
  CRLF line terminators
```

**Expected encoding**
Expected normalizer to show UTF-8 encoding after conversion to UTF-8.
Am I wrong here?

**Desktop (please complete the following information):**
 - OS: Linux
 - Python version  3.11.5
 - Package version charset-normalizer==3.3.2

**Additional context**
I know. Html is not the same as text.
But I will document this here.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

html file is not reported as UTF8 after conversion #381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

html file is not reported as UTF8 after conversion #381

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions