-
-
Notifications
You must be signed in to change notification settings - Fork 57
Closed
Labels
CLIAnything related to the CLI script (normalizer)Anything related to the CLI script (normalizer)bugSomething isn't workingSomething isn't working
Milestone
Description
Provide the file
110-original.zip
Verbose output
Using the CLI, run normalizer -v ./my-file.txt
and past the result in here.
❯ # rm+unzip
❯ normalizer -mvvv 110-original.htm
2023-11-08 16:42:49,817 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:42:49,821 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:42:49,821 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:42:49,830 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:42:49,830 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250
❯ normalizer -rfnvvv 110-original.htm
2023-11-08 16:39:42,180 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:39:42,183 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:39:42,184 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:39:42,192 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:39:42,192 | DEBUG | Encoding detection: cp1250 is most likely the one.
{
"path": "/home/adax/code/other/encoding/110-original.htm",
"encoding": "cp1250",
"encoding_aliases": [
"1250",
"windows_1250"
],
"alternative_encodings": [],
"language": "English",
"alphabets": [
"Basic Latin",
"Control character",
"General Punctuation",
"Latin Extended-A",
"Latin-1 Supplement"
],
"has_sig_or_bom": false,
"chaos": 0.783,
"coherence": 66.66,
"unicode_path": "/home/adax/code/other/encoding/110-original.htm",
"is_preferred": true
}
❯ normalizer -mvvv 110-original.htm
2023-11-08 16:41:07,958 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:41:07,961 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 1.267000 %
2023-11-08 16:41:07,962 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:41:07,970 | Level 5 | We detected language [('English', 0.7029), ('Indonesian', 0.572), ('Dutch', 0.51), ('Italian', 0.4949), ('Czech', 0.4862), ('Spanish', 0.4806), ('Croatian', 0.4724), ('Norwegian', 0.4692), ('Slovene', 0.4669), ('Romanian', 0.4632), ('Hungarian', 0.4624), ('Slovak', 0.4605), ('Finnish', 0.4565), ('German', 0.4533), ('Swedish', 0.4453), ('French', 0.443), ('Danish', 0.4366), ('Portuguese', 0.4116), ('Polish', 0.4113), ('Lithuanian', 0.3931), ('Estonian', 0.3828), ('Turkish', 0.3828), ('Vietnamese', 0.3795)] using cp1250
2023-11-08 16:41:07,970 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250
enca
will however detect UTF-8 as it should
❯ # rm+unzip
❯ enca -L hr 110-original.htm
Unrecognized encoding
❯ normalizer -rfnvvv 110-original.htm
❯ enca -L hr 110-original.htm
Universal transformation format 8 bits; UTF-8
CRLF line terminators
Expected encoding
Expected normalizer to show UTF-8 encoding after conversion to UTF-8.
Am I wrong here?
Desktop (please complete the following information):
- OS: Linux
- Python version 3.11.5
- Package version charset-normalizer==3.3.2
Additional context
I know. Html is not the same as text.
But I will document this here.
I think that "declarative mark" should not take over like that. But I am new to this encoding world....
Metadata
Metadata
Assignees
Labels
CLIAnything related to the CLI script (normalizer)Anything related to the CLI script (normalizer)bugSomething isn't workingSomething isn't working