REF Job stores; move large parts of memory.py into jobstore.py #7

dagss · 2011-04-30T20:20:59Z

This is a pretty big refactor (based on 0.5.X), introducing job stores. I had to cut a few corners in the end and hack a few small things (commented in source), since I'm not sure when I'll be able to get back to this.

AFAIK it shouldn't change any behaviour, it is a pure refactor.

It may look a bit overdesigned with the "attempt_compute_lock" which may return WAIT etc.; this is because it started out as my branch to do more failsafe + pessimistic locking. That part is ripped out of this commit (because it has a bug, and to keeps things isolated). I promise that the design looks saner in the presence of pessimistic locking + more logic against races embedded.

I will need the pessimistic locking (with WAIT etc.) in my own subclass of DirectoryStore, so it'd be good to have this API in even if it isn't used directly.

I really went through a lot of rounds with this refactoring, and in the end, I'm not 100% happy, but I submit it because I think it is a good start. The really important thing for me was to rip a lot of the logic out of MemoryFunc, so that it can also be used with caching futures-style executors (which don't use memory.cache). It could be argued that some of the logic should be further split up and refactored out of JobStore; I'm just saying that this Works For Me, and seems convenient enough to extend for custom purposes.

Tests pass for me, I haven't tested it with any production/research code.

GaelVaroquaux · 2011-04-30T21:21:55Z

The really important thing for me was to rip a lot of the logic out of MemoryFunc, so that it can also be used with caching futures-style executors

Yes, I prefer this design too.

I can't do a real review right now, but a few remarks just by glancing at the code:

I'd like 'jobstore.py' to be named 'job_store.py'. I am an underscore freak: I want underscores to separate words, if they are not CamelCase.

I'd like a few unit tests of the job_store functionality in a tests/test_job_store.py directory. It would help teasing out failures in job_store from the bigger coder base. Aslo, it would probably help increasing the test coverage of job_store which is at 88% currently.

I get some test failures. Mostly in the documentation, but they worry me, as they mean that some of the functionality is no longer working the way it should. If you run 'nosetests' in the base directory, nose will run automatically using the settings from the setup.cfg, and thus run all the tests and doctests.

======================================================================
ERROR: Failure: IndexError (list index out of range)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/loader.py", line 224, in generate
    for test in g():
  File "/home/varoquau/dev/joblib/joblib/test/test_memory.py", line 246, in test_memory_warning_collision_detection
    "cannot detect" in str(w[-1].message).lower()
IndexError: list index out of range

======================================================================
FAIL: Doctest: memory.rst
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/doctest.py", line 2163, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for memory.rst
  File "/home/varoquau/dev/joblib/doc/memory.rst", line 0

----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 218, in memory.rst
Failed example:
    func2(1)
Expected:
    memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function 'func'
    Running func(1)
Got:
    memory.py:0: JobLibCollisionWarning: Cannot detect name collisions for function 'func'
    Clearing cache /tmp/tmp6YyPJp/joblib/func-alias
    Running func(1)
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 221, in memory.rst
Failed example:
    func(1)
Expected:
    Running a different func(1)
Got:
    Clearing cache /tmp/tmp6YyPJp/joblib/func-alias
    Running a different func(1)
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 223, in memory.rst
Failed example:
    func2(1)
Expected:
    Running func(1)
Got:
    Clearing cache /tmp/tmp6YyPJp/joblib/func-alias
    Running func(1)
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 237, in memory.rst
Failed example:
    g()
Expected:
    memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function ''
    2
Got:
    memory.py:0: JobLibCollisionWarning: Cannot detect name collisions for function ''
    Clearing cache /tmp/tmp6YyPJp/joblib/
    2
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 241, in memory.rst
Failed example:
    f()
Expected:
    1
Got:
    Clearing cache /tmp/tmp6YyPJp/joblib/
    1

>>  raise self.failureException(self.format_failure(.getvalue()))
    

======================================================================
FAIL: Check that collisions impossible to detect will raise appropriate
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
AssertionError: 0 != 1
>>  raise self.failureException, \
          (None or '%r != %r' % (0, 1))

dagss · 2011-05-02T09:29:45Z

Question: The part in job_store.py that deals with not being unable to unpickle output.pkl -- which ones of the following are the motivation? a) Guard against corruption due to race conditions during pickling, b) not "get in the way" in the face of unpickling errors, c) not "get in the way" in the face of disk errors.

Depending on your answer I'd like to change the logic a bit (would that be a new pull request?). Here's the deal:

I think c) is dangerous and that one certainly shouldn't overlook output.pkl getting corrupted due to a failing disk or similar (that is, shouldn't transparently overwrite it, but leave the corrupted file in place and repeatedly issue the warning, to make user aware of fact)
When it comes to b), I'll accept not getting in the way, but if unpickling is broken, there's no need to re-store the result.

So what I'd like to do is:

After attempt_compute_lock, you're working with a temporary directory
persist_... actually does the pickling, and raises errors
commit() does an atomic rename
get_output -> load_output, and does the actual work

This has a better feel IMO. But if we want to transparently re-store result on unpickling errors, attempt_compute_lock must keep doing the unpickling.

When it comes to the missing warning in test_memory_warning_collision_detection, I agree it should be fixed, but I'm a bit confused: When run from the joblib/test directory, it doesn't trigger (which is why I didn't see it before pushing). Only from the root-level.

Yes, I tried with a new cachedir in that single test, and also with -vv, but the difference is still there. And the cache dir looks the same in both cases AFAICT.

Can you think of anything?

GaelVaroquaux · 2011-05-02T09:55:08Z

a), b) and c): the main motivation is that joblib should never fail. I
don't want to come back from holiday and see that a computation has
failed because of a pickle error (for instance due to the fact that I was
using a different version of a library at pickle time).

I think c) is dangerous and that one certainly shouldn't overlook output.pkl getting corrupted due to a failing disk or similar (that is, shouldn't transparently overwrite it, but leave the corrupted file in place and repeatedly issue the warning, to make user aware of fact)

No. Not in the default behavior. Number 1 problem with joblib in its
current usage is the fact that it blows peoples disk up. I don't see
which problem you are solving by leaving the corrupted file in place, but
I do see that you are making this problem worse. If you really want to
tackle better error management, than we need an error level (as in a
logger level) and depending on the error level,

In my view, the job store is not an ACID database, and is simply a
key-value storage that offers no guaranties on the retrieval of
previously stored info. This is what I evolved to, to favor performance
of lookup, and robustness. I know that you would like it to be different,
and I think that it is an option, but keep in mind that some people use
joblib in situations where a job takes 1mn to complete, and they do
10 000 job retrieval a minute afterward.

When it comes to b), I'll accept not getting in the way, but if unpickling is broken, there's no need to re-store the result.

So what I'd like to do is:

After attempt_compute_lock, you're working with a temporary directory

persist_... actually does the pickling, and raises errors

commit() does an atomic rename

get_output -> load_output, and does the actual work

This does seem better. I am a bit worried about the disk usage. How do
you handle disk full errors? How do you deal with capped cache size? We
are not there yet, but I do have a branch that tries to implement a size
limit for the cache, and a cache replacement policy. It failed because it
was slowing down the lookup too much, and is being rewritten. I am not
sure that I want to make controlling the disk usage even harder when we
haven't been able to control it in the current situation.

When it comes to the missing warning in test_memory_warning_collision_detection, I agree it should be fixed, but I'm a bit confused: When run from the joblib/test directory, it doesn't trigger (which is why I didn't see it before pushing). Only from the root-level.

Can you think of anything?

nose does magic with import paths. joblib uses import paths to detect
collisions. That's all I can think of, sorry.

dagss · 2011-05-02T11:06:08Z

OK, in light of your comments I pushed a change that is more explicit about leaving transactional things optionals.

It is now "job.load_or_lock", so that the API explicitly encompasses the case "load attempt failed so you must recompute"
persist_input/persist_output does the operation, so any exceptions from these are raised at the expected spot
commit/rollback then becomes essentially noops

All callers should still use commit/rollback because subclasses/other configurations can support being transactional.

dagss · 2011-05-03T10:16:46Z

[For some reason my email reply yesterday did not show up in the web interface, reposting through web interface.].

On 05/02/2011 11:55 AM, GaelVaroquaux wrote:

a), b) and c): the main motivation is that joblib should never fail. I
don't want to come back from holiday and see that a computation has
failed because of a pickle error (for instance due to the fact that I was
using a different version of a library at pickle time).

Ahh. Good example.

I think c) is dangerous and that one certainly shouldn't overlook output.pkl getting corrupted due to a failing disk or similar (that is, shouldn't transparently overwrite it, but leave the corrupted file in place and repeatedly issue the warning, to make user aware of fact)

No. Not in the default behavior. Number 1 problem with joblib in its
current usage is the fact that it blows peoples disk up. I don't see
which problem you are solving by leaving the corrupted file in place, but
I do see that you are making this problem worse. If you really want to
tackle better error management, than we need an error level (as in a
logger level) and depending on the error level,

In my view, the job store is not an ACID database, and is simply a
key-value storage that offers no guaranties on the retrieval of
previously stored info. This is what I evolved to, to favor performance
of lookup, and robustness. I know that you would like it to be different,
and I think that it is an option, but keep in mind that some people use
joblib in situations where a job takes 1mn to complete, and they do
10 000 job retrieval a minute afterward.

When it comes to b), I'll accept not getting in the way, but if unpickling is broken, there's no need to re-store the result.

So what I'd like to do is:

After attempt_compute_lock, you're working with a temporary directory

persist_... actually does the pickling, and raises errors

commit() does an atomic rename

get_output -> load_output, and does the actual work

This does seem better. I am a bit worried about the disk usage. How do
you handle disk full errors? How do you deal with capped cache size? We

I'm not sure if error handling would be different than in 0.5.X, which seems to do no exception catching in the pickling stage.

The difference in disk usage with the scheme I want to do with today is:

If two jobs computes the same result simultaneously, you will need temporary space for both of them to store results. I actually don't know what happens today w.r.t. npy-files during a race, whether it uses twice as much space today as well in the worst case.

Worst case both jobs fail because of out-of-space, when one could have succeeded.

(Really long-term, this can be mitigated with pessimistic locking.)

If a process segfaults during pickling, it simply leaves garbage that must be cleaned up. Today, that garbage would be overwritten at the next call.

Of course, we can make an effort to clean the garbage at the next call, to get the same behaviour as today.

In return:

Processes don't step on one another's foot when computing the same job, and the commit/final directory rename is atomic.
Simpler JobDirectory API (the IO is done directly in the calls, rather than waiting

Conclusion:

Given your input I'd be a bit uncomfortable myself making this change at this time. But it does depend on the current disk usage behaviour during a race condition. Do you happen to know?

If pessimistic locking proves itself down the line, it should be possible to mitigate all ill effects caused by this change.

Also fix some minor issues with how the store argument was interpreted.

…ough)

Dag Sverre Seljebotn added 2 commits April 30, 2011 22:05

REF Job stores; move large parts of memory.py into jobstore.py

6bab633

BUG Fix messup in DirectoryJob

52e01bd

Dag Sverre Seljebotn added 3 commits May 2, 2011 10:28

BUG Revert to pre-refactor behaviour for clear's verbosity

5c97aea

REF Rename jobstore -> job_store

6e4996c

TST test_job_store.py + some docs

7e9ef9b

REF job_store: API now leaves transactional qualities optional

51fa5c4

BUG Previous refactor was incomplete in docstrings/error messages

de47cb4

Dag Sverre Seljebotn added 2 commits May 5, 2011 11:50

ENH Make Memory take a store argument

fd15b82

Also fix some minor issues with how the store argument was interpreted.

BUG Fix is_computed for non-transactional logic (not used anywhere th…

796c7e9

…ough)

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 29, 2022

run joblib#7 - show pytest location

f7c7f78

fcharras added a commit to fcharras/joblib that referenced this pull request May 10, 2022

run joblib#7

8ebf47f

fcharras added a commit to fcharras/joblib that referenced this pull request May 12, 2022

run joblib#7

8ba538e

fcharras added a commit to fcharras/joblib that referenced this pull request Oct 24, 2022

pipeline run joblib#7

11906d8

fcharras added a commit to fcharras/joblib that referenced this pull request Oct 24, 2022

pipeline run joblib#7

633fc60

fcharras added a commit to fcharras/joblib that referenced this pull request Nov 17, 2022

pipeline run joblib#7

c9b02c0

fcharras added a commit to fcharras/joblib that referenced this pull request Nov 22, 2022

pipeline run joblib#7

7c47b7e

fcharras added a commit to fcharras/joblib that referenced this pull request Feb 6, 2023

pipeline run joblib#7

bb1cde2

fcharras added a commit to fcharras/joblib that referenced this pull request Feb 17, 2023

pipeline run joblib#7

4f29485

fcharras added a commit to fcharras/joblib that referenced this pull request Feb 19, 2023

pipeline run joblib#7

bd3a87f

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 11, 2023

pipeline run joblib#7

25d2b8f

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 15, 2023

pipeline run joblib#7

e3deba2

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 15, 2023

pipeline run joblib#7

1758fad

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 15, 2023

pipeline run joblib#7

5f09143

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023

pipeline run joblib#7

b706a8f

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023

pipeline run joblib#7

8b6ea08

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023

pipeline run joblib#7

254f2f8

fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023

pipeline run joblib#7

89d9db6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

REF Job stores; move large parts of memory.py into jobstore.py #7

REF Job stores; move large parts of memory.py into jobstore.py #7

Uh oh!

dagss commented Apr 30, 2011

Uh oh!

GaelVaroquaux commented Apr 30, 2011

Uh oh!

dagss commented May 2, 2011

Uh oh!

GaelVaroquaux commented May 2, 2011

Uh oh!

dagss commented May 2, 2011

Uh oh!

dagss commented May 3, 2011

Uh oh!

Uh oh!

REF Job stores; move large parts of memory.py into jobstore.py #7

Are you sure you want to change the base?

REF Job stores; move large parts of memory.py into jobstore.py #7

Uh oh!

Conversation

dagss commented Apr 30, 2011

Uh oh!

GaelVaroquaux commented Apr 30, 2011

Uh oh!

dagss commented May 2, 2011

Uh oh!

GaelVaroquaux commented May 2, 2011

Uh oh!

dagss commented May 2, 2011

Uh oh!

dagss commented May 3, 2011

Uh oh!

Uh oh!