-
-
Notifications
You must be signed in to change notification settings - Fork 57
Description
charset_normalizer.utils.iana_name('utf-8')
returns 'utf_8'
, which does not appear at all on https://www.iana.org/assignments/character-sets/character-sets.xhtml -- it's called UTF-8
there, or possibly utf-8
(as the table notes " no distinction is made between use of upper and lower case letters").
(The concrete usecase that brought this up was serving arbitrary files over HTTP and generating an appropriate content-type: text/plain; charset=UTF-8
header for them. I was quite suprised to get charset=utf_8
instead, which browsers don't understand and then interpret wrongly.)
I've looked at the current implementation, which is based on encoding.aliases
from the stdlib -- but that explicitly talks about normalizing the names beforehand, because it is meant to lookup python modules AFAIU, whose syntax rules are quite different than the IANA encoding names. So I'm not sure if that's actually an appropriate datasource for that use case, or am I completely misunderstanding something here? I'll be grateful for any light that someone could shed onto this.