Skip to content

Conversation

lrq3000
Copy link
Member

@lrq3000 lrq3000 commented Aug 26, 2016

Add tqdm module to predict time using machine learning polynomial regression.

Implementation of what was asked in #206.

For memory, here is the model used in this implementation to predict time using machine learning polynomial regression:

image

Canonical example:

import time
from tqdm import tsrange

plot = True
tstep = 0.1

tt = tsrange(100, leave=True, miniters=1, mininterval=0, algo='numpy', order=3, plot=plot)
#tt = tsrange(100, leave=True, miniters=1, mininterval=0, algo='descent', repeat=30, plot=plot)
#tt = tsrange(100, leave=True, miniters=1, mininterval=0, algo='sklearn', order=3, repeat=30, plot=plot)

for i in tt:
    time.sleep(tstep)
    #tstep = tstep * 1.1
    tstep = tstep + 0.01

    # DEBUG
    predicted_endtime = float(tt.model.predict([tt.total]))
    tt.write('it: %i, Predicted endtime: %g vs current: %g' % (tt.n, predicted_endtime, tt.model.predict([tt.n])))

Uncomment the other tt = ... lines if you want to try the different algorithms provided.

There are three different algorithms provided:

  • Numpy: this is the analytical least square regression, so it always finds the best fit, but it can be slower if there are lots of samples to regress. Plus it requires numpy. This is the default, as it gives the best results currently. Plus I limited the number of samples used to fit, so the dimensionality shouldn't be an issue.
  • Descent: pure python implementation of minibatch gradient descent for polynomial regression. It was initially set to be the default, but the analytical approach works way better in our case.
  • Sklearn: use scikit-learn to compute a solution (iteratively, not analytically). It's similar to "descent" but it requires scikit-learn and should converge a bit faster and be more resilient against noisy samples (because it uses PassiveAggressive regressor by default, but it can be switched to stochastic gradient descent).

Ok so basically how it works:

  • At each iteration, a new sample (n=current iteration, elapsed) is generated.
  • This sample gets fuzzily added to a batch of memorized samples in the model. Why fuzzily? This isn't the best term to describe it, but basically new points are not necessarily memorized, there's a probability to store them or just to skip them. The goal was to ensure we remember old points as well as new points, but not just only new points. This should generate a better model. Why do we memorize a batch of samples? I tried with stochastic gradient descent, but with only one point the fitting gets it all wrong, we need several different samples to fit correctly, hence why we need to memorize some points, but not all else we will get out of memory if the iterable total > 10^7.
  • The model learns in batch on all memorized samples at each iteration. So it adapts itself fully at each iteration to the new shape of the curve.
  • Ending time is predicted using the model.
  • We compute the expected average rate given the predicted ending time.
  • We feed everything to format_meter.

Note on the implementation: all this is done in format_meter() for two reasons:

  1. Easier and cleaner to implement, else we would need to copy the whole code of __iter__() and update().
  2. Lighter on CPU, because we add samples/fit only when printing, which is regulated by miniters and such. So the CPU impact of the regression is greatly lowered, at the expense of missing a few points that could help converge faster.

Please try it out and tell me if it fits your need and if it works well (don't hesitate to play with the parameters, they are explained in the docstring of tqdm_regress). I will finish the TODO after received feedback.

TODO:

  • Flake8
  • Unit tests (use virtual timer and canonical cases: addition, multiplication, subtraction, etc.)
  • Readme

tqdm module to predict time using machine learning polynomial regression

Signed-off-by: Stephen L. <lrq3000@gmail.com>
@codecov-io
Copy link

codecov-io commented Aug 26, 2016

Codecov Report

❗ No coverage uploaded for pull request base (master@1104d07). Click here to learn what that means.

@@            Coverage Diff            @@
##             master     #248   +/-   ##
=========================================
  Coverage          ?   64.29%           
=========================================
  Files             ?        8           
  Lines             ?      759           
  Branches          ?      148           
=========================================
  Hits              ?      488           
  Misses            ?      270           
  Partials          ?        1
Impacted Files Coverage Δ
tqdm/init.py 100% <100%> (ø)
tqdm/_tqdm_regress.py 19.28% <19.28%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1104d07...808787b. Read the comment docs.

@coveralls
Copy link

coveralls commented Aug 26, 2016

Coverage Status

Coverage decreased (-26.09%) to 64.683% when pulling a710e5d on regress into 4b3d916 on master.

Signed-off-by: Stephen L. <lrq3000@gmail.com>
@coveralls
Copy link

Coverage Status

Coverage decreased (-26.09%) to 64.683% when pulling 95d001b on regress into 4b3d916 on master.

Signed-off-by: Stephen L. <lrq3000@gmail.com>
@coveralls
Copy link

Coverage Status

Coverage decreased (-26.3%) to 64.427% when pulling 808787b on regress into 4b3d916 on master.

@lrq3000
Copy link
Member Author

lrq3000 commented Aug 26, 2016

PS: if you are wondering whether we could implement analytical regression with numpy, the answer is: not by ourselves, way too much complicated because we would have to reimplement a lot of linear algebra on matrices. There is however a pure python library called PyLA that did all that and even features a least square linear regression function (a solve function but it's basically the same), so we could use that, but we would still rely on a third-party library (although pure python) and it would be a lot slower than numpy.

@lrq3000 lrq3000 changed the title Add tqdm_regress and tsrange Polynomial regression to predict time (tqdm_regress, tsrange) Aug 27, 2016
@lrq3000
Copy link
Member Author

lrq3000 commented Aug 27, 2016

Ah maybe we could implement an additional algorithm (Moving average regression model) as pointed here: #48 (comment)

@CrazyPython
Copy link
Contributor

@lrq3000 By default the "regression mode" is linear. This one implements polynomial - perhaps exponential should also be implemented?

@lrq3000 lrq3000 added the need-feedback 📢 We need your response (question) label Aug 28, 2016
@lrq3000
Copy link
Member Author

lrq3000 commented Aug 28, 2016

@CrazyPython Exponential is implemented, just use order=1.5 and it will compute linear, exp, sqrt and log (yes because log can always be useful in machine learning ;) ). You can also set linear mode by setting order=1, but from my tests it's not very useful to predict time (just a little better sometimes than our current implementation but still way off in exponentially increasing cases).

@CrazyPython
Copy link
Contributor

@lrq3000 oh linear = regular tqdm. put it in README?

@lrq3000
Copy link
Member Author

lrq3000 commented Aug 29, 2016

@CrazyPython Not exactly, normal tqdm is exponential moving average, so it cannot predict at all when iterations are taking longer and longer because it assumes constance. Linear regression is with tqdm_regress(order=1), it can predict a bit more in this scenario but generally the result is not a lot better than core tqdm. The prediction becomes more interesting with order=2 and the best i think with order=3. I tried exp, sqrt and log but they usually do not help much and generally tend to overfit.

@CrazyPython
Copy link
Contributor

Tag help-wanted for unit tests.

@lrq3000
Copy link
Member Author

lrq3000 commented Sep 2, 2016

@CrazyPython No need, I can make them, I can recycle my testing code into unit tests + already done unit tests for machine learning libraries, so it won't be too hard, but it's time consuming, so I'll do that only if this module has an audience.

@lrq3000 lrq3000 added the submodule ⊂ Periphery/subclasses label Sep 2, 2016
@lrq3000 lrq3000 mentioned this pull request Sep 6, 2016
4 tasks
@CrazyPython
Copy link
Contributor

@lrq3000 aye, is linting really so hard that you have to put it on the TODO? My editor (pcyharm) has it built-in w/ a simple keyboard shortcut.

@lrq3000
Copy link
Member Author

lrq3000 commented Oct 4, 2016

I don't use an IDE but a simple text editor notepad++, and my default font
is not monospace (didn't have the time to fix that). My main pb is with
spaces usually.
Le 4 Oct. 2016 02:57, "CrazyPython" notifications@github.com a écrit :

@lrq3000 https://github.com/lrq3000 aye, is linting really so hard that
you have to put it on the TODO? My editor (pcyharm) has it built-in w/ a
simple keyboard shortcut.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#248 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABES3lHrVz4GiAIOhHyYgpD9CTir3zsBks5qwaRfgaJpZM4JuTZH
.

@CrazyPython
Copy link
Contributor

@lrq3000 Oh god you have no idea the joys of PyCharm. I used to use IDLE (notepad + indentation + shortcuts for running and embedded interactive shell), and looking back, it was pretty horrible.

PyCharm is a really "smart" IDE. If you use a __future__ statement, it'll ask you if you want to enable code compatibility inspection. There are two-click operations for converting to list comprehension and switching between argument types. They are easy to access to. PEP 8 violations are highlighted inline, and so are type violations. Ctrl+Shift+L lints your code, applying necessary changes.

PyCharm Community Edition is free and open source. Inspections can be disabled/enabled at will. Easy to start using.

@casperdcl casperdcl force-pushed the master branch 4 times, most recently from 8cade97 to a65e347 Compare October 31, 2016 02:34
@lrq3000 lrq3000 added this to the >5 milestone Nov 14, 2016
@casperdcl casperdcl force-pushed the master branch 6 times, most recently from 6ec00f1 to 4b6476a Compare July 22, 2017 14:15
@casperdcl casperdcl added the p4-enhancement-future 🧨 On the back burner label Jan 24, 2019
@casperdcl
Copy link
Member

fixes #206, should go into a contrib submodule as per #198 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-feedback 📢 We need your response (question) p4-enhancement-future 🧨 On the back burner submodule ⊂ Periphery/subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants