Allow larger datastore dumps #3344

wardi · 2016-12-02T23:01:05Z

This is a small improvement to datastore that allows large CSV dumps and shrinks the maximum memory used by paging over the data and streaming the output CSV

mattfullerton · 2016-12-05T09:21:03Z

👍 Thanks for this!

amercader · 2016-12-06T13:16:00Z

In the original code it seems like people could define the limit on the request (eg I just want the first 10 rows). If I'm understanding your changes correctly people now will always get all the rows regardless of limit as it is used as the page size. So passing the limit parameter would just affect the internal paging size. Shouldn't we separate limit and page?

wardi · 2016-12-06T13:20:37Z

You're right. Are those parameters documented? (Do we want to keep them?) The way they were written made it easy to consume all the memory on the web server if a user requests enough rows.

I can add them back but I would like to document them at least.

amercader · 2016-12-06T14:09:02Z

There should probably be a hard limit either way, with params or without, even if it's a big one (1m ?)
Your offset is not actually used on the following loops if there is an offset passed as param. I fine dropping support for the params to simplify things.
What would probably be more useful for selective dumps is allow an option to download as CSV on datastore_search.

wardi · 2016-12-06T14:21:28Z

I agree, a datastore_search CSV output option would be more flexible. It doesn't fit the action API protocol though. Ideally we could get any of our (paginated or not) API calls' data returned as a CSV streamed back to the end user.

Let me just add those options back here in case someone is using them.

On another topic: Should I be including a UTF-8 BOM here so that Excel can open these files? It would be a user-visible change. Add an option for that at the same time?

amercader · 2016-12-06T14:30:27Z

I personally would not add it by default. I'm happy with an option to enable it optionally.

wardi · 2016-12-06T14:37:50Z

There's no reason for a hard limit on this call. If the client keeps reading the data this code will continue looping to the next 10k records until the end. There's no benefit in making the client send multiple requests.

amercader · 2016-12-06T14:42:29Z

Fair enough, we'll assume there are no memory leaks there :)

TkTech · 2016-12-09T09:00:53Z

@wardi regarding incompatibility with the existing api, we could support Accept: text/csv with a utility wrapper around certain actions. A quick LazyCSVWriter that doesn't write the headers until the first dict/row is consumed to get the structure? Makes the assumption that every row will be identical.

amercader · 2016-12-13T12:19:52Z

ckanext/datastore/controller.py

+                base.abort(404, p.toolkit._('DataStore resource not found'))
+
+            if not wr:
+                pylons.response.headers['Content-Type'] = 'text/csv'


Content-Type: text/csv; charset=utf-8 ?

[#3344] datastore dump: internally paginate with datastore_search

2fe4f67

wardi mentioned this pull request Dec 2, 2016

DataStore to CSV service, for download of large resources. ckan/ideas#34

Open

[#3344] pep8

e29e8cd

amercader self-assigned this Dec 6, 2016

amercader reviewed Dec 13, 2016

View reviewed changes

[#3344] handle offset, limit; better toolkit imports

cb020f8

amercader merged commit ccb8dd5 into master Dec 17, 2016

amercader deleted the 3344-big-dumps branch December 17, 2016 09:39

torfsen pushed a commit to torfsen/ckan that referenced this pull request Jan 19, 2017

[ckan#3344] datastore dump: internally paginate with datastore_search

1847a5d

torfsen pushed a commit to torfsen/ckan that referenced this pull request Jan 19, 2017

[ckan#3344] pep8

a6df65e

torfsen pushed a commit to torfsen/ckan that referenced this pull request Jan 19, 2017

[ckan#3344] handle offset, limit; better toolkit imports

36aa371

wardi mentioned this pull request Oct 31, 2023

📊🖌️ Table Designer #7882

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow larger datastore dumps #3344

Allow larger datastore dumps #3344

Uh oh!

wardi commented Dec 2, 2016

Uh oh!

mattfullerton commented Dec 5, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

wardi commented Dec 6, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

wardi commented Dec 6, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

wardi commented Dec 6, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

TkTech commented Dec 9, 2016

Uh oh!

amercader Dec 13, 2016

Uh oh!

Uh oh!

Allow larger datastore dumps #3344

Allow larger datastore dumps #3344

Uh oh!

Conversation

wardi commented Dec 2, 2016

Uh oh!

mattfullerton commented Dec 5, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

wardi commented Dec 6, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

wardi commented Dec 6, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

wardi commented Dec 6, 2016

Uh oh!

amercader commented Dec 6, 2016

Uh oh!

TkTech commented Dec 9, 2016

Uh oh!

amercader Dec 13, 2016

Choose a reason for hiding this comment

Uh oh!

Uh oh!