FIX: reduce dataset size in compressor example #735

aabadie · 2018-08-07T08:25:32Z

This PR is an attempt to fix #724.

The idea is just to use less rows in the input dataset. Other changes are related to data displays.

This fixes the documentation build on ReadTheDocs, see the generated documentation here.

Some problems: the size of the dataset is now very small, so there's a lot of noise in the measured durations. This induces a lot of variations in the measures and sometimes LZ4 compression/decompression is slightly slower than raw (which is contradicting the example comments).

Other changes are related to data displays. This fixes the documentation build on ReadTheDocs.

aabadie · 2018-08-07T08:28:58Z

Another thing was also changed in the ReadTheDocs configuration: it now builds with CPython3.5. This is required by the LZMA compression used in the compressor example, since it was introduced in Python 3.3.

codecov · 2018-08-07T08:55:16Z

Codecov Report

Merging #735 into master will decrease coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #735      +/-   ##
==========================================
- Coverage   95.31%   95.28%   -0.04%     
==========================================
  Files          42       42              
  Lines        6128     6128              
==========================================
- Hits         5841     5839       -2     
- Misses        287      289       +2

Impacted Files	Coverage Δ
joblib/backports.py	`93.75% <0%> (-2.09%)`	⬇️
joblib/_store_backends.py	`90.47% <0%> (-0.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3556ca4...ce6aa6c. Read the comment docs.

GaelVaroquaux · 2018-08-07T15:20:04Z

That's great!! Thank you very much. Merging!

GaelVaroquaux · 2018-08-07T15:25:46Z

That worked, but the timings on the generate documentation are really strange and do not convey the ideas that we were hoping :(

aabadie · 2018-08-07T20:37:41Z

Thanks for merging @GaelVaroquaux !

the timings on the generate documentation are really strange and do not convey the ideas that we were hoping

Indeed, it's ugly.
I never had a plot like this during my tests. I just triggered a couple of builds and now it's a bit nicer but the timings are too short (around 0.02s), compression/decompression duration might be hidden by other random things.

ogrisel · 2018-08-21T08:26:58Z

Maybe try a twice as large number of rows? Would that fit the rtfd constraints?

aabadie · 2018-08-22T12:34:36Z

Maybe try a twice as large number of rows? Would that fit the rtfd constraints?

I did some benchmarks of the impact of the nrows values used in the compressor example dataset on the memory consumed when building the documentation.
I used psrecord while building the doc from scratch (make doc-clean && make doc):

psrecord $(pgrep -f sphinx-build) --interval 1 --plot plot-500k.png

Here are some plots:

nrows=all, e.g 4.9e6 lines => ~2.5GB of RAM

nrows=1e6 => ~800MB of RAM
nrows=500k => ~450MB of RAM
nrows=100k => ~250MB of RAM
nrows=2k => ~200MB of RAM

I also tested the builds on RTD and they all pass until 2e6 lines. The duration results are really meaningful starting from 500k lines in the dataset. I ran a couple of builds on RTD with 1e6 lines (800MB of RAM) and they passed.

Conclusion:
We can definitely increase nrows. I don't understand why I had to go that low when testing the first time. I read somewhere that RTD VMs were limited to 500MB of RAM but apparently they increased this (since using 1e6 lines builds fine now).
Are you ok to update the example with 1e6lines in the dataset ? (the result can be seen here)

FIX: reduce dataset size in compressor example

ce6aa6c

Other changes are related to data displays. This fixes the documentation build on ReadTheDocs.

aabadie requested a review from GaelVaroquaux August 7, 2018 08:25

GaelVaroquaux merged commit daf7f8e into master Aug 7, 2018

aabadie deleted the rtd_fix branch August 7, 2018 20:27

aabadie mentioned this pull request Aug 22, 2018

DOC: fine tune compressor example dataset size for ReadTheDocs #753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX: reduce dataset size in compressor example #735

FIX: reduce dataset size in compressor example #735

Uh oh!

aabadie commented Aug 7, 2018

Uh oh!

aabadie commented Aug 7, 2018 •

edited

Loading

Uh oh!

codecov bot commented Aug 7, 2018 •

edited

Loading

Uh oh!

GaelVaroquaux commented Aug 7, 2018

Uh oh!

GaelVaroquaux commented Aug 7, 2018

Uh oh!

aabadie commented Aug 7, 2018

Uh oh!

ogrisel commented Aug 21, 2018

Uh oh!

aabadie commented Aug 22, 2018

Uh oh!

Uh oh!

FIX: reduce dataset size in compressor example #735

FIX: reduce dataset size in compressor example #735

Uh oh!

Conversation

aabadie commented Aug 7, 2018

Uh oh!

aabadie commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

GaelVaroquaux commented Aug 7, 2018

Uh oh!

GaelVaroquaux commented Aug 7, 2018

Uh oh!

aabadie commented Aug 7, 2018

Uh oh!

ogrisel commented Aug 21, 2018

Uh oh!

aabadie commented Aug 22, 2018

Uh oh!

Uh oh!

aabadie commented Aug 7, 2018 •

edited

Loading

codecov bot commented Aug 7, 2018 •

edited

Loading