Skip to content

Should utils.iana_name() return the actual IANA name? #572

@wosc

Description

@wosc

charset_normalizer.utils.iana_name('utf-8') returns 'utf_8', which does not appear at all on https://www.iana.org/assignments/character-sets/character-sets.xhtml -- it's called UTF-8 there, or possibly utf-8 (as the table notes " no distinction is made between use of upper and lower case letters").

(The concrete usecase that brought this up was serving arbitrary files over HTTP and generating an appropriate content-type: text/plain; charset=UTF-8 header for them. I was quite suprised to get charset=utf_8 instead, which browsers don't understand and then interpret wrongly.)

I've looked at the current implementation, which is based on encoding.aliases from the stdlib -- but that explicitly talks about normalizing the names beforehand, because it is meant to lookup python modules AFAIU, whose syntax rules are quite different than the IANA encoding names. So I'm not sure if that's actually an appropriate datasource for that use case, or am I completely misunderstanding something here? I'll be grateful for any light that someone could shed onto this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions