Skip to content

[DETECTION] mistake GB2312 encoded content as Big5 #587

@Marsman1996

Description

@Marsman1996

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
dct.h.zip

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

$ normalizer -v ./dct.h 
2025-01-10 20:20:37,568 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (402 byte(s) given) parameters.
2025-01-10 20:20:37,568 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xba in position 101: ordinal not in range(128)
2025-01-10 20:20:37,568 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xba in position 101: invalid start byte
2025-01-10 20:20:37,568 | Level 5 | Code page big5 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,569 | Level 5 | big5 passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,569 | Level 5 | big5 should target any language(s) of ['Chinese']
2025-01-10 20:20:37,570 | Level 5 | Code page big5hkscs is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,570 | Level 5 | big5hkscs passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,570 | Level 5 | big5hkscs should target any language(s) of ['Chinese']
2025-01-10 20:20:37,570 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 432.700000 %.
2025-01-10 20:20:37,571 | Level 5 | cp1006 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 64.300000 %.
2025-01-10 20:20:37,571 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,572 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 60.000000 %.
2025-01-10 20:20:37,572 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,573 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 7.500000 %
2025-01-10 20:20:37,574 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,575 | Level 5 | We detected language [('Estonian', 0.3333), ('French', 0.303), ('Swedish', 0.303), ('Dutch', 0.303), ('Finnish', 0.303), ('Italian', 0.303), ('Romanian', 0.2727), ('Portuguese', 0.2727), ('Norwegian', 0.2727), ('Vietnamese', 0.2727), ('Danish', 0.2424), ('Slovak', 0.2424), ('Croatian', 0.2424), ('Czech', 0.2424), ('German', 0.2121), ('Spanish', 0.2121), ('Hungarian', 0.1818), ('Turkish', 0.1515), ('Polish', 0.1212)] using cp1250
2025-01-10 20:20:37,576 | Level 5 | cp1251 passed initial chaos probing. Mean measured chaos is 4.100000 %
2025-01-10 20:20:37,580 | Level 5 | cp1251 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2025-01-10 20:20:37,581 | Level 5 | cp1252 passed initial chaos probing. Mean measured chaos is 7.500000 %
2025-01-10 20:20:37,581 | Level 5 | cp1252 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,582 | Level 5 | We detected language [('Estonian', 0.3438), ('French', 0.3125), ('Portuguese', 0.3125), ('Swedish', 0.3125), ('Dutch', 0.3125), ('Finnish', 0.3125), ('Romanian', 0.2812), ('Italian', 0.2812), ('Norwegian', 0.2812), ('Slovak', 0.2812), ('Vietnamese', 0.2812), ('German', 0.25), ('Danish', 0.25), ('Croatian', 0.25), ('Czech', 0.25), ('Spanish', 0.2188), ('Hungarian', 0.1875), ('Polish', 0.125), ('Turkish', 0.125)] using cp1252
2025-01-10 20:20:37,583 | Level 5 | cp1253 passed initial chaos probing. Mean measured chaos is 4.100000 %
2025-01-10 20:20:37,584 | Level 5 | cp1253 should target any language(s) of ['Greek']
2025-01-10 20:20:37,585 | Level 5 | cp1254 passed initial chaos probing. Mean measured chaos is 7.500000 %
2025-01-10 20:20:37,585 | Level 5 | cp1254 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,586 | Level 5 | We detected language [('Estonian', 0.3438), ('French', 0.3125), ('Portuguese', 0.3125), ('Swedish', 0.3125), ('Dutch', 0.3125), ('Finnish', 0.3125), ('Romanian', 0.2812), ('Italian', 0.2812), ('Norwegian', 0.2812), ('Slovak', 0.2812), ('Vietnamese', 0.2812), ('German', 0.25), ('Danish', 0.25), ('Croatian', 0.25), ('Czech', 0.25), ('Spanish', 0.2188), ('Hungarian', 0.1875), ('Turkish', 0.125), ('Polish', 0.125)] using cp1254
2025-01-10 20:20:37,587 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xca in position 103: character maps to <undefined>
2025-01-10 20:20:37,587 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 32.000000 %.
2025-01-10 20:20:37,588 | Level 5 | cp1257 passed initial chaos probing. Mean measured chaos is 7.500000 %
2025-01-10 20:20:37,588 | Level 5 | cp1257 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,589 | Level 5 | We detected language [('Estonian', 0.3333), ('French', 0.303), ('Swedish', 0.303), ('Croatian', 0.303), ('Dutch', 0.303), ('Finnish', 0.303), ('Italian', 0.303), ('Portuguese', 0.2727), ('Norwegian', 0.2727), ('Romanian', 0.2727), ('Vietnamese', 0.2727), ('Danish', 0.2424), ('Czech', 0.2424), ('Slovak', 0.2424), ('German', 0.2121), ('Spanish', 0.2121), ('Hungarian', 0.1818), ('Polish', 0.1515), ('Turkish', 0.1212)] using cp1257
2025-01-10 20:20:37,590 | Level 5 | cp1258 passed initial chaos probing. Mean measured chaos is 7.700000 %
2025-01-10 20:20:37,590 | Level 5 | cp1258 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,592 | Level 5 | We detected language [('French', 0.3548), ('Romanian', 0.3548), ('Estonian', 0.3548), ('Italian', 0.3548), ('Swedish', 0.3226), ('Dutch', 0.3226), ('Portuguese', 0.3226), ('Finnish', 0.3226), ('Vietnamese', 0.3226), ('Slovak', 0.3226), ('Norwegian', 0.2903), ('Danish', 0.2903), ('German', 0.2581), ('Spanish', 0.2581), ('Croatian', 0.2581), ('Czech', 0.2581), ('Hungarian', 0.2258), ('Polish', 0.1613), ('Turkish', 0.129)] using cp1258
2025-01-10 20:20:37,592 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,592 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x70 in position 64: character maps to <undefined>
2025-01-10 20:20:37,593 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 50.400000 %.
2025-01-10 20:20:37,593 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,593 | Level 5 | cp720 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 48.700000 %.
2025-01-10 20:20:37,594 | Level 5 | cp737 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 60.400000 %.
2025-01-10 20:20:37,595 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 55.200000 %.
2025-01-10 20:20:37,595 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,596 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 62.600000 %.
2025-01-10 20:20:37,596 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.600000 %.
2025-01-10 20:20:37,597 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xb6 in position 117: character maps to <undefined>
2025-01-10 20:20:37,597 | Level 5 | cp857 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 57.900000 %.
2025-01-10 20:20:37,597 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,598 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,598 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,598 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,599 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,599 | Level 5 | cp864 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 57.400000 %.
2025-01-10 20:20:37,600 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,600 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,600 | Level 5 | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 62.600000 %.
2025-01-10 20:20:37,601 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfd in position 104: character maps to <undefined>
2025-01-10 20:20:37,601 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 154.100000 %.
2025-01-10 20:20:37,601 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xf7 in position 108: illegal multibyte sequence
2025-01-10 20:20:37,602 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xc9 in position 105: illegal multibyte sequence
2025-01-10 20:20:37,602 | Level 5 | Code page cp950 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,602 | Level 5 | cp950 passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,602 | Level 5 | cp950 should target any language(s) of ['Chinese']
2025-01-10 20:20:37,602 | Level 5 | Code page euc_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,603 | Level 5 | euc_jis_2004 passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,603 | Level 5 | euc_jis_2004 should target any language(s) of ['Japanese']
2025-01-10 20:20:37,603 | Level 5 | Code page euc_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,604 | Level 5 | euc_jisx0213 passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,604 | Level 5 | euc_jisx0213 should target any language(s) of ['Japanese']
2025-01-10 20:20:37,604 | Level 5 | Code page euc_jp is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,604 | Level 5 | euc_jp passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,604 | Level 5 | euc_jp should target any language(s) of ['Japanese']
2025-01-10 20:20:37,604 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xc9 in position 105: illegal multibyte sequence
2025-01-10 20:20:37,604 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,605 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,605 | Level 5 | gb18030 should target any language(s) of ['Chinese']
2025-01-10 20:20:37,606 | Level 5 | Code page gb2312 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,606 | Level 5 | gb2312 passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,606 | Level 5 | gb2312 should target any language(s) of ['Chinese']
2025-01-10 20:20:37,606 | Level 5 | Code page gbk is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,606 | Level 5 | gbk passed initial chaos probing. Mean measured chaos is 0.000000 %
2025-01-10 20:20:37,606 | Level 5 | gbk should target any language(s) of ['Chinese']
2025-01-10 20:20:37,607 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 23.500000 %.
2025-01-10 20:20:37,607 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,607 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,607 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,608 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,608 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,608 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,608 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,609 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xba in position 101: illegal multibyte sequence
2025-01-10 20:20:37,610 | Level 5 | iso8859_10 passed initial chaos probing. Mean measured chaos is 6.100000 %
2025-01-10 20:20:37,610 | Level 5 | iso8859_10 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,611 | Level 5 | We detected language [('Croatian', 0.2895), ('Finnish', 0.2895), ('Portuguese', 0.2632), ('Swedish', 0.2632), ('Dutch', 0.2632), ('French', 0.2368), ('Estonian', 0.2368), ('Italian', 0.2368), ('German', 0.2105), ('Romanian', 0.2105), ('Norwegian', 0.2105), ('Danish', 0.2105), ('Slovak', 0.2105), ('Vietnamese', 0.2105), ('Spanish', 0.1842), ('Polish', 0.1842), ('Czech', 0.1842), ('Hungarian', 0.1579), ('Turkish', 0.1579)] using iso8859_10
2025-01-10 20:20:37,612 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfd in position 104: character maps to <undefined>
2025-01-10 20:20:37,612 | Level 5 | iso8859_13 passed initial chaos probing. Mean measured chaos is 7.500000 %
2025-01-10 20:20:37,612 | Level 5 | iso8859_13 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,612 | Level 5 | We detected language [('Estonian', 0.3333), ('French', 0.303), ('Swedish', 0.303), ('Croatian', 0.303), ('Dutch', 0.303), ('Finnish', 0.303), ('Italian', 0.303), ('Portuguese', 0.2727), ('Norwegian', 0.2727), ('Romanian', 0.2727), ('Vietnamese', 0.2727), ('Danish', 0.2424), ('Czech', 0.2424), ('Slovak', 0.2424), ('German', 0.2121), ('Spanish', 0.2121), ('Hungarian', 0.1818), ('Polish', 0.1515), ('Turkish', 0.1212)] using iso8859_13
2025-01-10 20:20:37,613 | Level 5 | iso8859_14 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 27.100000 %.
2025-01-10 20:20:37,613 | Level 5 | iso8859_15 is deemed too similar to code page iso8859_14 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,613 | Level 5 | iso8859_16 is deemed too similar to code page iso8859_14 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,614 | Level 5 | iso8859_2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 22.300000 %.
2025-01-10 20:20:37,614 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc3 in position 107: character maps to <undefined>
2025-01-10 20:20:37,614 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_2 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,615 | Level 5 | iso8859_5 passed initial chaos probing. Mean measured chaos is 1.500000 %
2025-01-10 20:20:37,615 | Level 5 | iso8859_5 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2025-01-10 20:20:37,616 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xba in position 101: character maps to <undefined>
2025-01-10 20:20:37,617 | Level 5 | iso8859_7 passed initial chaos probing. Mean measured chaos is 4.400000 %
2025-01-10 20:20:37,617 | Level 5 | iso8859_7 should target any language(s) of ['Greek']
2025-01-10 20:20:37,617 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xca in position 103: character maps to <undefined>
2025-01-10 20:20:37,618 | Level 5 | iso8859_9 is deemed too similar to code page iso8859_14 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,618 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xc3 in position 107: illegal multibyte sequence
2025-01-10 20:20:37,619 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 23.400000 %.
2025-01-10 20:20:37,619 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xba in position 101: character maps to <undefined>
2025-01-10 20:20:37,619 | Level 5 | koi8_u was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 23.000000 %.
2025-01-10 20:20:37,620 | Level 5 | kz1048 passed initial chaos probing. Mean measured chaos is 4.100000 %
2025-01-10 20:20:37,621 | Level 5 | kz1048 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2025-01-10 20:20:37,621 | Level 5 | latin_1 is deemed too similar to code page iso8859_14 and was consider unsuited already. Continuing!
2025-01-10 20:20:37,622 | Level 5 | mac_cyrillic passed initial chaos probing. Mean measured chaos is 3.600000 %
2025-01-10 20:20:37,622 | Level 5 | mac_cyrillic should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2025-01-10 20:20:37,623 | Level 5 | mac_greek passed initial chaos probing. Mean measured chaos is 1.600000 %
2025-01-10 20:20:37,624 | Level 5 | mac_greek should target any language(s) of ['Greek']
2025-01-10 20:20:37,624 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 50.600000 %.
2025-01-10 20:20:37,625 | Level 5 | mac_latin2 passed initial chaos probing. Mean measured chaos is 5.400000 %
2025-01-10 20:20:37,626 | Level 5 | mac_latin2 should target any language(s) of ['Latin Based']
2025-01-10 20:20:37,627 | Level 5 | We detected language [('Estonian', 0.3235), ('French', 0.2941), ('Dutch', 0.2941), ('Italian', 0.2941), ('Finnish', 0.2941), ('Swedish', 0.2647), ('Croatian', 0.2647), ('Romanian', 0.2647), ('Danish', 0.2647), ('Slovak', 0.2647), ('Vietnamese', 0.2647), ('Norwegian', 0.2353), ('German', 0.2059), ('Portuguese', 0.2059), ('Czech', 0.2059), ('Spanish', 0.1765), ('Hungarian', 0.1765), ('Turkish', 0.1471), ('Polish', 0.1176)] using mac_latin2
2025-01-10 20:20:37,627 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2025-01-10 20:20:37,627 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2025-01-10 20:20:37,628 | Level 5 | ptcp154 passed initial chaos probing. Mean measured chaos is 2.900000 %
2025-01-10 20:20:37,629 | Level 5 | ptcp154 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2025-01-10 20:20:37,629 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xfd in position 104: illegal multibyte sequence
2025-01-10 20:20:37,629 | Level 5 | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xfd in position 104: illegal multibyte sequence
2025-01-10 20:20:37,630 | Level 5 | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xfd in position 104: illegal multibyte sequence
2025-01-10 20:20:37,630 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfd in position 104: character maps to <undefined>
2025-01-10 20:20:37,630 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2025-01-10 20:20:37,630 | Level 5 | Code page utf_16_be is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,630 | Level 5 | utf_16_be was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 143.800000 %.
2025-01-10 20:20:37,631 | Level 5 | Code page utf_16_le is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2025-01-10 20:20:37,631 | Level 5 | utf_16_le was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 121.200000 %.
2025-01-10 20:20:37,631 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2025-01-10 20:20:37,631 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2025-01-10 20:20:37,634 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2025-01-10 20:20:37,634 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2025-01-10 20:20:37,634 | DEBUG | Encoding detection: Found big5 as plausible (best-candidate) for content. With 17 alternatives.
{
    "path": "/home/yuwei/afgen/afgenllm/database/ffjpeg/latest/code/src/dct.h",
    "encoding": "big5",
    "encoding_aliases": [
        "big5_tw",
        "csbig5",
        "x_mac_trad_chinese"
    ],
    "alternative_encodings": [
        "big5hkscs",
        "cp950"
    ],
    "language": "Chinese",
    "alphabets": [
        "Basic Latin",
        "CJK Unified Ideographs",
        "Control character"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}
....

Expected encoding
Expected GB2312.
chardet is able to detect it.

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.13
  • Package version 3.4.0

Additional context
chardet is able to detect:

>>> import chardet
>>> file = "./dct.h"
>>> with open(file, "rb") as f:
...     content = f.read()
...     chardet_res = chardet.detect(content)
...     print(chardet_res)
...     
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Metadata

Metadata

Assignees

No one assigned

    Labels

    detectionRelated to the charset detection mechanism, chaos/mess/coherencehelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions