Skip to content

Proper detection report to UTF-8 with BOM #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Proper detection report to UTF-8 with BOM #8

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Oct 20, 2013

Currently chardet return 'UTF-8' for an UTF-8 with BOM file.
However open a BOM'ed UTF-8 file with UTF-8 codec will result in invalid unicode character '\ufeff' being part of the first file's line:

f = io.open('utf8bom.file', encoding='UTF-8')
l = f.readline()
l
# output: u'\ufeff FILE_CONTENTS...'

Open the file with the correct codec 'UTF-8-SIG' will free us of some trouble:

f = io.open('utf8bom.file', encoding='UTF-8-SIG')
l = f.readline()
l
# output: u' FILE_CONTENTS...'

Some infos:
http://docs.python.org/2/library/codecs.html#encodings-and-unicode
http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string

@rayer4u
Copy link

rayer4u commented Nov 13, 2013

UTF-8 seems to be a aliase of utf_8_sig acorrding to the doc. but codecs.lookup('UTF-8') return utf_8 codec, acturaly. suggest direct return 'utf_8_sig' codec

@ghost
Copy link
Author

ghost commented Nov 13, 2013

Hm, looks like open (read) a utf-8 file (with or without BOM) with utf-8-sig codec lead us to an acceptable result.
I've heard you suggestion and looked into python issue tracker. Found this: http://bugs.python.org/issue1328

dan-blanchard pushed a commit that referenced this pull request Dec 15, 2013
@ghost
Copy link
Author

ghost commented Dec 16, 2013

Hi @dan-blanchard! I think those commits resolves dcramer/chardet issue number 8[0]. Mine is a pull request number 8.
I think you misunderstood. :)

[0] dcramer/chardet#8

@dan-blanchard
Copy link
Member

That actually wasn't anything I did on purpose. There are a lot of forks of chardet, and annoyingly GitHub doesn't know which issue numbers are for which fork.

@dan-blanchard dan-blanchard reopened this Dec 17, 2013
dan-blanchard added a commit that referenced this pull request Dec 17, 2013
@dan-blanchard
Copy link
Member

Okay, I've made the same change on the develop branch. It seems reasonable to me.

@dan-blanchard dan-blanchard mentioned this pull request Oct 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants