Skip to content

Turkish ISO-8859-9 detection support (New) #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 9, 2015

Conversation

queeup
Copy link

@queeup queeup commented Dec 2, 2014

Discussion is here: #21

queeup added 4 commits January 22, 2014 19:38
I have no idea about side effects. With this value it works good. More
then 0.6..... always detects Turkish(iso-8859-9) as iso-8859-2.
@dan-blanchard
Copy link
Member

First of all, I'd like to apologize for my misguided discussion about mTypicalPositiveRatio in the previous PR you had, #21. I was very new to chardet then and didn't fully understand how it worked.

How everything works is actually documented fairly well here.

As you can see in the original Mozilla paper, mTypicalPositiveRatio is the ratio of "the number of occurrences of the 512 most frequently used characters divided by the number of occurrences of the rest of the characters" for a typical document written in the language in question. It is a poorly named variable for sure, but has little to do with the confidence (except that we can determine the confidence based on how far a document's ratio is from what we expect).

Anyway, the ratio has to be calculated from a "typical" Turkish document (and currently we don't have a tool to do the calculation for you).

Could you explain how you came up with all of the tables in your PR? What files did you analyze to determine the ratio?

In the future I would like to have tools available to make it much simpler for people to add new encodings, but it is currently pretty difficult.

@queeup
Copy link
Author

queeup commented Jan 9, 2015

The tables, I found it in this project: https://github.com/PyYoshi/cChardet/blob/master/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangTurkishModel.cpp

And convert it to python. I didn't analyze anything for ratio. First I tried same number from that project then I change it because It's not detecting Turkish with that ratio.

@dan-blanchard
Copy link
Member

Thanks for pointing out that you adapted it from cChardet. It took me until this moment to realize that cChardet is actually wrapping uchardet-enhanced, which is a much-improved version of the original Mozilla code chardet is currently based on.

The ratio used by cChardet is correct; the problem is with the rest of chardet. Our other probers are reporting confidences that are too high. We should switch to treating uchardet-enhanced as the upstream version and pull in all of the changes present there. That would make it even simpler to merge with cChardet as we've discussed before in #19.

@queeup
Copy link
Author

queeup commented Jan 9, 2015

OK. I can revert back my ratio to default. Do you think revert back asap? Or wait your call?

@dan-blanchard
Copy link
Member

I'd say revert it back to the default for now, and then I'll merge this in. Then we'll need to pull in the other updated tables and whatnot from cChardet/uchardet-enhanced.

… have no idea about side effects. With this value it works good. More then 0.6..... always detects Turkish(iso-8859-9) as iso-8859-2."

This reverts commit 1059680.
@queeup
Copy link
Author

queeup commented Jan 9, 2015

Reverted. All yours now.

@dan-blanchard
Copy link
Member

Thanks for the contribution.

dan-blanchard added a commit that referenced this pull request Jan 9, 2015
Turkish ISO-8859-9 detection support (New)
@dan-blanchard dan-blanchard merged commit 065b0a1 into chardet:master Jan 9, 2015
@dan-blanchard dan-blanchard mentioned this pull request Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants