Skip to content

Conversation

aabadie
Copy link
Contributor

@aabadie aabadie commented Aug 7, 2018

This PR is an attempt to fix #724.

The idea is just to use less rows in the input dataset. Other changes are related to data displays.

This fixes the documentation build on ReadTheDocs, see the generated documentation here.

Some problems: the size of the dataset is now very small, so there's a lot of noise in the measured durations. This induces a lot of variations in the measures and sometimes LZ4 compression/decompression is slightly slower than raw (which is contradicting the example comments).

Other changes are related to data displays. This fixes the documentation
build on ReadTheDocs.
@aabadie aabadie requested a review from GaelVaroquaux August 7, 2018 08:25
@aabadie
Copy link
Contributor Author

aabadie commented Aug 7, 2018

Another thing was also changed in the ReadTheDocs configuration: it now builds with CPython3.5. This is required by the LZMA compression used in the compressor example, since it was introduced in Python 3.3.

@codecov
Copy link

codecov bot commented Aug 7, 2018

Codecov Report

Merging #735 into master will decrease coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #735      +/-   ##
==========================================
- Coverage   95.31%   95.28%   -0.04%     
==========================================
  Files          42       42              
  Lines        6128     6128              
==========================================
- Hits         5841     5839       -2     
- Misses        287      289       +2
Impacted Files Coverage Δ
joblib/backports.py 93.75% <0%> (-2.09%) ⬇️
joblib/_store_backends.py 90.47% <0%> (-0.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3556ca4...ce6aa6c. Read the comment docs.

@GaelVaroquaux
Copy link
Member

That's great!! Thank you very much. Merging!

@GaelVaroquaux GaelVaroquaux merged commit daf7f8e into master Aug 7, 2018
@GaelVaroquaux
Copy link
Member

That worked, but the timings on the generate documentation are really strange and do not convey the ideas that we were hoping :(

@aabadie aabadie deleted the rtd_fix branch August 7, 2018 20:27
@aabadie
Copy link
Contributor Author

aabadie commented Aug 7, 2018

Thanks for merging @GaelVaroquaux !

the timings on the generate documentation are really strange and do not convey the ideas that we were hoping

Indeed, it's ugly.
I never had a plot like this during my tests. I just triggered a couple of builds and now it's a bit nicer but the timings are too short (around 0.02s), compression/decompression duration might be hidden by other random things.

@ogrisel
Copy link
Contributor

ogrisel commented Aug 21, 2018

Maybe try a twice as large number of rows? Would that fit the rtfd constraints?

@aabadie
Copy link
Contributor Author

aabadie commented Aug 22, 2018

Maybe try a twice as large number of rows? Would that fit the rtfd constraints?

I did some benchmarks of the impact of the nrows values used in the compressor example dataset on the memory consumed when building the documentation.
I used psrecord while building the doc from scratch (make doc-clean && make doc):

psrecord $(pgrep -f sphinx-build) --interval 1 --plot plot-500k.png

Here are some plots:

  • nrows=all, e.g 4.9e6 lines => ~2.5GB of RAM
    plot-all
  • nrows=1e6 => ~800MB of RAM
    plot-1e6
  • nrows=500k => ~450MB of RAM
    plot-500k
  • nrows=100k => ~250MB of RAM
    plot-100000
  • nrows=2k => ~200MB of RAM
    plot-2000

I also tested the builds on RTD and they all pass until 2e6 lines. The duration results are really meaningful starting from 500k lines in the dataset. I ran a couple of builds on RTD with 1e6 lines (800MB of RAM) and they passed.

Conclusion:
We can definitely increase nrows. I don't understand why I had to go that low when testing the first time. I read somewhere that RTD VMs were limited to 500MB of RAM but apparently they increased this (since using 1e6 lines builds fine now).
Are you ok to update the example with 1e6lines in the dataset ? (the result can be seen here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

readthedocs still documents joblib 0.11
3 participants