Skip to content

Conversation

wardi
Copy link
Contributor

@wardi wardi commented Dec 2, 2016

This is a small improvement to datastore that allows large CSV dumps and shrinks the maximum memory used by paging over the data and streaming the output CSV

@mattfullerton
Copy link
Contributor

👍 Thanks for this!

@amercader amercader self-assigned this Dec 6, 2016
@amercader
Copy link
Member

In the original code it seems like people could define the limit on the request (eg I just want the first 10 rows). If I'm understanding your changes correctly people now will always get all the rows regardless of limit as it is used as the page size. So passing the limit parameter would just affect the internal paging size. Shouldn't we separate limit and page?

@wardi
Copy link
Contributor Author

wardi commented Dec 6, 2016

You're right. Are those parameters documented? (Do we want to keep them?) The way they were written made it easy to consume all the memory on the web server if a user requests enough rows.

I can add them back but I would like to document them at least.

@amercader
Copy link
Member

There should probably be a hard limit either way, with params or without, even if it's a big one (1m ?)
Your offset is not actually used on the following loops if there is an offset passed as param. I fine dropping support for the params to simplify things.
What would probably be more useful for selective dumps is allow an option to download as CSV on datastore_search.

@wardi
Copy link
Contributor Author

wardi commented Dec 6, 2016

I agree, a datastore_search CSV output option would be more flexible. It doesn't fit the action API protocol though. Ideally we could get any of our (paginated or not) API calls' data returned as a CSV streamed back to the end user.

Let me just add those options back here in case someone is using them.

On another topic: Should I be including a UTF-8 BOM here so that Excel can open these files? It would be a user-visible change. Add an option for that at the same time?

@amercader
Copy link
Member

I personally would not add it by default. I'm happy with an option to enable it optionally.

@wardi
Copy link
Contributor Author

wardi commented Dec 6, 2016

There's no reason for a hard limit on this call. If the client keeps reading the data this code will continue looping to the next 10k records until the end. There's no benefit in making the client send multiple requests.

@amercader
Copy link
Member

Fair enough, we'll assume there are no memory leaks there :)

@TkTech
Copy link
Member

TkTech commented Dec 9, 2016

@wardi regarding incompatibility with the existing api, we could support Accept: text/csv with a utility wrapper around certain actions. A quick LazyCSVWriter that doesn't write the headers until the first dict/row is consumed to get the structure? Makes the assumption that every row will be identical.

base.abort(404, p.toolkit._('DataStore resource not found'))

if not wr:
pylons.response.headers['Content-Type'] = 'text/csv'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content-Type: text/csv; charset=utf-8 ?

@amercader amercader merged commit ccb8dd5 into master Dec 17, 2016
@amercader amercader deleted the 3344-big-dumps branch December 17, 2016 09:39
torfsen pushed a commit to torfsen/ckan that referenced this pull request Jan 19, 2017
torfsen pushed a commit to torfsen/ckan that referenced this pull request Jan 19, 2017
torfsen pushed a commit to torfsen/ckan that referenced this pull request Jan 19, 2017
@wardi wardi mentioned this pull request Oct 31, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants