-
Notifications
You must be signed in to change notification settings - Fork 265
Turkish ISO-8859-9 detection support (New) #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have no idea about side effects. With this value it works good. More then 0.6..... always detects Turkish(iso-8859-9) as iso-8859-2.
First of all, I'd like to apologize for my misguided discussion about How everything works is actually documented fairly well here. As you can see in the original Mozilla paper, Anyway, the ratio has to be calculated from a "typical" Turkish document (and currently we don't have a tool to do the calculation for you). Could you explain how you came up with all of the tables in your PR? What files did you analyze to determine the ratio? In the future I would like to have tools available to make it much simpler for people to add new encodings, but it is currently pretty difficult. |
The tables, I found it in this project: https://github.com/PyYoshi/cChardet/blob/master/src/ext/libcharsetdetect/mozilla/extensions/universalchardet/src/base/LangTurkishModel.cpp And convert it to python. I didn't analyze anything for ratio. First I tried same number from that project then I change it because It's not detecting Turkish with that ratio. |
Thanks for pointing out that you adapted it from cChardet. It took me until this moment to realize that cChardet is actually wrapping uchardet-enhanced, which is a much-improved version of the original Mozilla code chardet is currently based on. The ratio used by cChardet is correct; the problem is with the rest of chardet. Our other probers are reporting confidences that are too high. We should switch to treating uchardet-enhanced as the upstream version and pull in all of the changes present there. That would make it even simpler to merge with cChardet as we've discussed before in #19. |
OK. I can revert back my ratio to default. Do you think revert back asap? Or wait your call? |
I'd say revert it back to the default for now, and then I'll merge this in. Then we'll need to pull in the other updated tables and whatnot from cChardet/uchardet-enhanced. |
… have no idea about side effects. With this value it works good. More then 0.6..... always detects Turkish(iso-8859-9) as iso-8859-2." This reverts commit 1059680.
Reverted. All yours now. |
Thanks for the contribution. |
Turkish ISO-8859-9 detection support (New)
Discussion is here: #21