-
-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Notice
I hereby announce that my raw input is not :
- Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
- Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter
Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.
https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html
(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)
Verbose output
2024-10-02 08:40:59,849 | Level 5 | Detected declarative mark in sequence. Priority +1 given for latin_1.
2024-10-02 08:40:59,852 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.533000 %
2024-10-02 08:40:59,852 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-10-02 08:40:59,857 | Level 5 | We detected language [('English', 0.656), ('Hungarian', 0.5849), ('French', 0.578), ('Spanish', 0.5486), ('Norwegian', 0.5294), ('Dutch', 0.5243), ('Finnish', 0.5221), ('Indonesian', 0.5191), ('Italian', 0.5174), ('Estonian', 0.5152), ('Danish', 0.5047), ('Swedish', 0.4706), ('Slovene', 0.4669), ('Croatian', 0.4662), ('Portuguese', 0.4648), ('Czech', 0.4546), ('Romanian', 0.4492), ('German', 0.4409), ('Slovak', 0.4296), ('Turkish', 0.4224), ('Polish', 0.3995), ('Lithuanian', 0.3933), ('Vietnamese', 0.3714)] using latin_1
2024-10-02 08:40:59,857 | DEBUG | Encoding detection: latin_1 is most likely the one.
{
"path": "/tmp/finnish-utf-8-latin-1-confusion.html",
"encoding": "latin_1",
"encoding_aliases": [
"8859",
"cp819",
"csisolatin1",
"ibm819",
"iso8859",
"iso8859_1",
"iso_8859_1",
"iso_8859_1_1987",
"iso_ir_100",
"l1",
"latin",
"latin1"
],
"alternative_encodings": [],
"language": "English",
"alphabets": [
"Basic Latin",
"Control character",
"Latin-1 Supplement"
],
"has_sig_or_bom": false,
"chaos": 0.533,
"coherence": 65.6,
"unicode_path": null,
"is_preferred": true
}
Expected encoding
This should be UTF-8. One clue is that the output includes the word Päätösehdotus
which is a mangled version of Päätösehdotus
.
Most nontrivial Finnish text will include several instances of the character ä
and possibly ö
. Upper-case versions Ä
and Ö
are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these become
- ä → \xc3\xa4 → ä
- ö → \xc3\xb6 → ö
- Ä → \xc3\x84 → à and a control character, or Ä
- Ä → \xc3\x96 → à and a control character, or Ö
The characters ä¶„ do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)
Desktop (please complete the following information):
- OS: MacOS 14.7
- Python version 3.12.6
- Package version 3.3.2
Additional context
My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.