Proper detection report to UTF-8 with BOM #8

ghost · 2013-10-20T15:24:21Z

Currently chardet return 'UTF-8' for an UTF-8 with BOM file.
However open a BOM'ed UTF-8 file with UTF-8 codec will result in invalid unicode character '\ufeff' being part of the first file's line:

f = io.open('utf8bom.file', encoding='UTF-8')
l = f.readline()
l
# output: u'\ufeff FILE_CONTENTS...'

Open the file with the correct codec 'UTF-8-SIG' will free us of some trouble:

f = io.open('utf8bom.file', encoding='UTF-8-SIG')
l = f.readline()
l
# output: u' FILE_CONTENTS...'

Some infos:
http://docs.python.org/2/library/codecs.html#encodings-and-unicode
http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string

rayer4u · 2013-11-13T10:53:46Z

UTF-8 seems to be a aliase of utf_8_sig acorrding to the doc. but codecs.lookup('UTF-8') return utf_8 codec, acturaly. suggest direct return 'utf_8_sig' codec

ghost · 2013-11-13T14:03:22Z

Hm, looks like open (read) a utf-8 file (with or without BOM) with utf-8-sig codec lead us to an acceptable result.
I've heard you suggestion and looked into python issue tracker. Found this: http://bugs.python.org/issue1328

ghost · 2013-12-16T22:30:57Z

Hi @dan-blanchard! I think those commits resolves dcramer/chardet issue number 8[0]. Mine is a pull request number 8.
I think you misunderstood. :)

[0] dcramer/chardet#8

dan-blanchard · 2013-12-17T15:13:16Z

That actually wasn't anything I did on purpose. There are a lot of forks of chardet, and annoyingly GitHub doesn't know which issue numbers are for which fork.

…at start with BOM instead of UTF-8.

dan-blanchard · 2013-12-17T18:17:50Z

Okay, I've made the same change on the develop branch. It seems reasonable to me.

Proper detection report to UTF-8 with BOM

cd6982c

dan-blanchard pushed a commit that referenced this pull request Dec 15, 2013

Actually fix #8 and release 1.0.3

f1d4e15

dan-blanchard closed this in 458c82e Dec 16, 2013

dan-blanchard reopened this Dec 17, 2013

dan-blanchard added a commit that referenced this pull request Dec 17, 2013

Fix #8 and return UTF-8-SIG as the codec when encountering strings th…

179fad0

…at start with BOM instead of UTF-8.

dan-blanchard closed this Dec 17, 2013

dan-blanchard mentioned this pull request Oct 7, 2014

Release 2.3.0 #35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proper detection report to UTF-8 with BOM #8

Proper detection report to UTF-8 with BOM #8

Uh oh!

ghost commented Oct 20, 2013

Uh oh!

rayer4u commented Nov 13, 2013

Uh oh!

ghost commented Nov 13, 2013

Uh oh!

ghost commented Dec 16, 2013

Uh oh!

dan-blanchard commented Dec 17, 2013

Uh oh!

dan-blanchard commented Dec 17, 2013

Uh oh!

Uh oh!

Proper detection report to UTF-8 with BOM #8

Proper detection report to UTF-8 with BOM #8

Uh oh!

Conversation

ghost commented Oct 20, 2013

Uh oh!

rayer4u commented Nov 13, 2013

Uh oh!

ghost commented Nov 13, 2013

Uh oh!

ghost commented Dec 16, 2013

Uh oh!

dan-blanchard commented Dec 17, 2013

Uh oh!

dan-blanchard commented Dec 17, 2013

Uh oh!

Uh oh!