Skip to content

Improve decoding of non-Unicode files [WIP] #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 7, 2020
Merged

Improve decoding of non-Unicode files [WIP] #77

merged 19 commits into from
Aug 7, 2020

Conversation

evanmiller
Copy link
Collaborator

Pre-Unicode (BIFF5) files include a CODEPAGE record. Attempt to use the information in this record to transcode strings via iconv. This might introduce a little bit of slowness - in the future it would be better to create a single iconv_t to pass around, but this will probably break the ABI and require a minor version bump.

For starters I've hard-coded the major Windows code pages - more can be added later, but it would help to find a comprehensive list somewhere. One open question is whether ancient XLS files created on MacOS use Mac character sets (Mac Roman, Mac Arabic, etc).

Leaving open for now so that others can test this branch before merging it into dev.

Pre-Unicode (BIFF5) files include a CODEPAGE record. Attempt to use the
information in this record to transcode strings via iconv. This might
introduce a little bit of slowness - in the future it would be better
to create a single iconv_t to pass around, but this will probably break
the ABI and require a minor version bump.

For starters I've hard-coded the major Windows code pages - more can be
added later, but it would help to find a comprehensive list somewhere.
One open question is whether ancient XLS files created on MacOS use Mac
character sets (Mac Roman, Mac Arabic, etc).
evanmiller added 16 commits June 6, 2020 16:05
Previously, every string conversion resulted in its own call to
iconv_open and iconv_close. This seems pretty inefficient, so instead
we cache the iconv_t objects in the workbook objects, and close them
when the workbook is closed.

For BIFF8 documents, up to two converters are needed: one for UTF-16LE
conversions, and a second converter for "compressed" UTF-16 (in which
all the high bytes are 0), which I understand to be the same as Latin-1.
iconv is skipped for the latter strings in favor of an in-line
implementation if the output encoding is set to UTF-8.

For BIFF5 documents, a single converter from the code page specified by
the document to the client's desired encoding is needed.

Note to self: I am still not happy with the use of wcstombs as it sets
the CTYPE locale. Need to eliminate this side effect at some point,
particularly since this change always hits wcstombs during the OLE
reading.
Probably won't work on Windows, but we'll see what Appveyor says
Invoke uselocale to set a thread-local locale, and restore the previous
locale at the end of the UTF-16 conversion function.

Hopefully fixes the build.
Supporting all platforms takes some finesse so try to hide it as
best we can
@evanmiller
Copy link
Collaborator Author

After further work this PR also includes these changes:

  • Cache the iconv_t object in each workbook structure to reduce system calls

  • In the absence of iconv, call wcstombs_l or _wcstombs_l (locale-specific versions of wcstombs) when present, passing a UTF-8 locale.

  • Support more code pages, including the old Mac code pages, which I have witnessed in the wild.

@evanmiller evanmiller merged commit 6edb17d into dev Aug 7, 2020
evanmiller added a commit that referenced this pull request Aug 7, 2020
Pull request #77 introduced some ABI-breaking changes, so let's be
polite and bump the minor version.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 1, 2020
Changes in 1.6.0:

 - This release enables decoding of non-Unicode character sets in older
   (BIFF5) XLS files. In addition, it improves string-conversion
   performance in newer files.

   See: libxls/libxls#77

Changes in 1.6.1:

 - This release corrects the libtool version information found in libxls
   1.6.0, and fixes undefined behavior for applications that were
   originally linked against the 1.5.x dynamic library. The C source
   code is unchanged from the last release.

   See: libxls/libxls#79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant