-
Notifications
You must be signed in to change notification settings - Fork 138
Improve decoding of non-Unicode files [WIP] #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pre-Unicode (BIFF5) files include a CODEPAGE record. Attempt to use the information in this record to transcode strings via iconv. This might introduce a little bit of slowness - in the future it would be better to create a single iconv_t to pass around, but this will probably break the ABI and require a minor version bump. For starters I've hard-coded the major Windows code pages - more can be added later, but it would help to find a comprehensive list somewhere. One open question is whether ancient XLS files created on MacOS use Mac character sets (Mac Roman, Mac Arabic, etc).
Previously, every string conversion resulted in its own call to iconv_open and iconv_close. This seems pretty inefficient, so instead we cache the iconv_t objects in the workbook objects, and close them when the workbook is closed. For BIFF8 documents, up to two converters are needed: one for UTF-16LE conversions, and a second converter for "compressed" UTF-16 (in which all the high bytes are 0), which I understand to be the same as Latin-1. iconv is skipped for the latter strings in favor of an in-line implementation if the output encoding is set to UTF-8. For BIFF5 documents, a single converter from the code page specified by the document to the client's desired encoding is needed. Note to self: I am still not happy with the use of wcstombs as it sets the CTYPE locale. Need to eliminate this side effect at some point, particularly since this change always hits wcstombs during the OLE reading.
Probably won't work on Windows, but we'll see what Appveyor says
Invoke uselocale to set a thread-local locale, and restore the previous locale at the end of the UTF-16 conversion function. Hopefully fixes the build.
Supporting all platforms takes some finesse so try to hide it as best we can
After further work this PR also includes these changes:
|
evanmiller
added a commit
that referenced
this pull request
Aug 7, 2020
Pull request #77 introduced some ABI-breaking changes, so let's be polite and bump the minor version.
netbsd-srcmastr
pushed a commit
to NetBSD/pkgsrc
that referenced
this pull request
Oct 1, 2020
Changes in 1.6.0: - This release enables decoding of non-Unicode character sets in older (BIFF5) XLS files. In addition, it improves string-conversion performance in newer files. See: libxls/libxls#77 Changes in 1.6.1: - This release corrects the libtool version information found in libxls 1.6.0, and fixes undefined behavior for applications that were originally linked against the 1.5.x dynamic library. The C source code is unchanged from the last release. See: libxls/libxls#79
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pre-Unicode (BIFF5) files include a
CODEPAGE
record. Attempt to use the information in this record to transcode strings via iconv. This might introduce a little bit of slowness - in the future it would be better to create a singleiconv_t
to pass around, but this will probably break the ABI and require a minor version bump.For starters I've hard-coded the major Windows code pages - more can be added later, but it would help to find a comprehensive list somewhere. One open question is whether ancient XLS files created on MacOS use Mac character sets (Mac Roman, Mac Arabic, etc).
Leaving open for now so that others can test this branch before merging it into dev.