Skip to content

.xls support? And a weird header issue #259

@aodj

Description

@aodj

Firstly, thanks for roapi! I've been playing around with it for a little while and really like it.

I've been trying to use it to proxy some datafiles containing information on currency codes as published by the body behind the ISO 4217 standard. The data is available in .xls or .xml format, but I can't get roapi to parse the .xls file. Opening it in Excel shows that the file has a couple of leading rows of dud data, in a non-tabular format that's not relevant to the data to be loaded.

When running it with roapi I get the following errors:

$ RUST_LOG=debug roapi --table "https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:26:06Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:26:06Z INFO  roapi::context] loading `uri(https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file

Downloading the file locally gives the same error:

$ RUST_LOG=debug roapi --table "list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:27:03Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:27:03Z INFO  roapi::context] loading `uri(list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file

If I save it as an .xlsx file I get an error about unparseable data:

$ RUST_LOG=debug RUST_BACKTRACE=1 roapi --table "list-one.xlsx,format=xlsx,sheet_name=Active"
[2023-02-13T11:28:07Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:28:07Z INFO  roapi::context] loading `uri(list-one.xlsx)` as table `list-one`
Error: Arrow error: Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2

Caused by:
    Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2

Do you think there's anything that might be done to support the loading of this file? Something like skiprows or header in Pandas would work. I've worked around it by processing the data with Pandas and saving it as a .parquet file, but I would like the ability to reference the current upstream file without having to process it.

Thanks for any input you can provide. I look forward to any suggestions you might have.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions