Skip to content

Conversation

dagss
Copy link

@dagss dagss commented Apr 30, 2011

This is a pretty big refactor (based on 0.5.X), introducing job stores. I had to cut a few corners in the end and hack a few small things (commented in source), since I'm not sure when I'll be able to get back to this.

AFAIK it shouldn't change any behaviour, it is a pure refactor.

It may look a bit overdesigned with the "attempt_compute_lock" which may return WAIT etc.; this is because it started out as my branch to do more failsafe + pessimistic locking. That part is ripped out of this commit (because it has a bug, and to keeps things isolated). I promise that the design looks saner in the presence of pessimistic locking + more logic against races embedded.

I will need the pessimistic locking (with WAIT etc.) in my own subclass of DirectoryStore, so it'd be good to have this API in even if it isn't used directly.

I really went through a lot of rounds with this refactoring, and in the end, I'm not 100% happy, but I submit it because I think it is a good start. The really important thing for me was to rip a lot of the logic out of MemoryFunc, so that it can also be used with caching futures-style executors (which don't use memory.cache). It could be argued that some of the logic should be further split up and refactored out of JobStore; I'm just saying that this Works For Me, and seems convenient enough to extend for custom purposes.

Tests pass for me, I haven't tested it with any production/research code.

@GaelVaroquaux
Copy link
Member

The really important thing for me was to rip a lot of the logic out of MemoryFunc, so that it can also be used with caching futures-style executors

Yes, I prefer this design too.

I can't do a real review right now, but a few remarks just by glancing at the code:

I'd like 'jobstore.py' to be named 'job_store.py'. I am an underscore freak: I want underscores to separate words, if they are not CamelCase.

I'd like a few unit tests of the job_store functionality in a tests/test_job_store.py directory. It would help teasing out failures in job_store from the bigger coder base. Aslo, it would probably help increasing the test coverage of job_store which is at 88% currently.

I get some test failures. Mostly in the documentation, but they worry me, as they mean that some of the functionality is no longer working the way it should. If you run 'nosetests' in the base directory, nose will run automatically using the settings from the setup.cfg, and thus run all the tests and doctests.

======================================================================
ERROR: Failure: IndexError (list index out of range)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/loader.py", line 224, in generate
    for test in g():
  File "/home/varoquau/dev/joblib/joblib/test/test_memory.py", line 246, in test_memory_warning_collision_detection
    "cannot detect" in str(w[-1].message).lower()
IndexError: list index out of range

======================================================================
FAIL: Doctest: memory.rst
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/doctest.py", line 2163, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for memory.rst
  File "/home/varoquau/dev/joblib/doc/memory.rst", line 0

----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 218, in memory.rst
Failed example:
    func2(1)
Expected:
    memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function 'func'
    Running func(1)
Got:
    memory.py:0: JobLibCollisionWarning: Cannot detect name collisions for function 'func'
    Clearing cache /tmp/tmp6YyPJp/joblib/func-alias
    Running func(1)
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 221, in memory.rst
Failed example:
    func(1)
Expected:
    Running a different func(1)
Got:
    Clearing cache /tmp/tmp6YyPJp/joblib/func-alias
    Running a different func(1)
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 223, in memory.rst
Failed example:
    func2(1)
Expected:
    Running func(1)
Got:
    Clearing cache /tmp/tmp6YyPJp/joblib/func-alias
    Running func(1)
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 237, in memory.rst
Failed example:
    g()
Expected:
    memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function ''
    2
Got:
    memory.py:0: JobLibCollisionWarning: Cannot detect name collisions for function ''
    Clearing cache /tmp/tmp6YyPJp/joblib/
    2
----------------------------------------------------------------------
File "/home/varoquau/dev/joblib/doc/memory.rst", line 241, in memory.rst
Failed example:
    f()
Expected:
    1
Got:
    Clearing cache /tmp/tmp6YyPJp/joblib/
    1

>>  raise self.failureException(self.format_failure(.getvalue()))
    

======================================================================
FAIL: Check that collisions impossible to detect will raise appropriate
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
AssertionError: 0 != 1
>>  raise self.failureException, \
          (None or '%r != %r' % (0, 1))

@dagss
Copy link
Author

dagss commented May 2, 2011

  1. Question: The part in job_store.py that deals with not being unable to unpickle output.pkl -- which ones of the following are the motivation? a) Guard against corruption due to race conditions during pickling, b) not "get in the way" in the face of unpickling errors, c) not "get in the way" in the face of disk errors.

Depending on your answer I'd like to change the logic a bit (would that be a new pull request?). Here's the deal:

  • I think c) is dangerous and that one certainly shouldn't overlook output.pkl getting corrupted due to a failing disk or similar (that is, shouldn't transparently overwrite it, but leave the corrupted file in place and repeatedly issue the warning, to make user aware of fact)
  • When it comes to b), I'll accept not getting in the way, but if unpickling is broken, there's no need to re-store the result.

So what I'd like to do is:

  • After attempt_compute_lock, you're working with a temporary directory
  • persist_... actually does the pickling, and raises errors
  • commit() does an atomic rename
  • get_output -> load_output, and does the actual work

This has a better feel IMO. But if we want to transparently re-store result on unpickling errors, attempt_compute_lock must keep doing the unpickling.

  1. When it comes to the missing warning in test_memory_warning_collision_detection, I agree it should be fixed, but I'm a bit confused: When run from the joblib/test directory, it doesn't trigger (which is why I didn't see it before pushing). Only from the root-level.

Yes, I tried with a new cachedir in that single test, and also with -vv, but the difference is still there. And the cache dir looks the same in both cases AFAICT.

Can you think of anything?

@GaelVaroquaux
Copy link
Member

a), b) and c): the main motivation is that joblib should never fail. I
don't want to come back from holiday and see that a computation has
failed because of a pickle error (for instance due to the fact that I was
using a different version of a library at pickle time).

  • I think c) is dangerous and that one certainly shouldn't overlook output.pkl getting corrupted due to a failing disk or similar (that is, shouldn't transparently overwrite it, but leave the corrupted file in place and repeatedly issue the warning, to make user aware of fact)

No. Not in the default behavior. Number 1 problem with joblib in its
current usage is the fact that it blows peoples disk up. I don't see
which problem you are solving by leaving the corrupted file in place, but
I do see that you are making this problem worse. If you really want to
tackle better error management, than we need an error level (as in a
logger level) and depending on the error level,

In my view, the job store is not an ACID database, and is simply a
key-value storage that offers no guaranties on the retrieval of
previously stored info. This is what I evolved to, to favor performance
of lookup, and robustness. I know that you would like it to be different,
and I think that it is an option, but keep in mind that some people use
joblib in situations where a job takes 1mn to complete, and they do
10 000 job retrieval a minute afterward.

  • When it comes to b), I'll accept not getting in the way, but if unpickling is broken, there's no need to re-store the result.

So what I'd like to do is:

  • After attempt_compute_lock, you're working with a temporary directory
  • persist_... actually does the pickling, and raises errors
  • commit() does an atomic rename
  • get_output -> load_output, and does the actual work

This does seem better. I am a bit worried about the disk usage. How do
you handle disk full errors? How do you deal with capped cache size? We
are not there yet, but I do have a branch that tries to implement a size
limit for the cache, and a cache replacement policy. It failed because it
was slowing down the lookup too much, and is being rewritten. I am not
sure that I want to make controlling the disk usage even harder when we
haven't been able to control it in the current situation.

  1. When it comes to the missing warning in test_memory_warning_collision_detection, I agree it should be fixed, but I'm a bit confused: When run from the joblib/test directory, it doesn't trigger (which is why I didn't see it before pushing). Only from the root-level.

Can you think of anything?

nose does magic with import paths. joblib uses import paths to detect
collisions. That's all I can think of, sorry.

@dagss
Copy link
Author

dagss commented May 2, 2011

OK, in light of your comments I pushed a change that is more explicit about leaving transactional things optionals.

  • It is now "job.load_or_lock", so that the API explicitly encompasses the case "load attempt failed so you must recompute"
  • persist_input/persist_output does the operation, so any exceptions from these are raised at the expected spot
  • commit/rollback then becomes essentially noops

All callers should still use commit/rollback because subclasses/other configurations can support being transactional.

@dagss
Copy link
Author

dagss commented May 3, 2011

[For some reason my email reply yesterday did not show up in the web interface, reposting through web interface.].

On 05/02/2011 11:55 AM, GaelVaroquaux wrote:

a), b) and c): the main motivation is that joblib should never fail. I
don't want to come back from holiday and see that a computation has
failed because of a pickle error (for instance due to the fact that I was
using a different version of a library at pickle time).

Ahh. Good example.

  • I think c) is dangerous and that one certainly shouldn't overlook output.pkl getting corrupted due to a failing disk or similar (that is, shouldn't transparently overwrite it, but leave the corrupted file in place and repeatedly issue the warning, to make user aware of fact)

No. Not in the default behavior. Number 1 problem with joblib in its
current usage is the fact that it blows peoples disk up. I don't see
which problem you are solving by leaving the corrupted file in place, but
I do see that you are making this problem worse. If you really want to
tackle better error management, than we need an error level (as in a
logger level) and depending on the error level,

In my view, the job store is not an ACID database, and is simply a
key-value storage that offers no guaranties on the retrieval of
previously stored info. This is what I evolved to, to favor performance
of lookup, and robustness. I know that you would like it to be different,
and I think that it is an option, but keep in mind that some people use
joblib in situations where a job takes 1mn to complete, and they do
10 000 job retrieval a minute afterward.

  • When it comes to b), I'll accept not getting in the way, but if unpickling is broken, there's no need to re-store the result.

So what I'd like to do is:

  • After attempt_compute_lock, you're working with a temporary directory
  • persist_... actually does the pickling, and raises errors
  • commit() does an atomic rename
  • get_output -> load_output, and does the actual work

This does seem better. I am a bit worried about the disk usage. How do
you handle disk full errors? How do you deal with capped cache size? We

I'm not sure if error handling would be different than in 0.5.X, which seems to do no exception catching in the pickling stage.

The difference in disk usage with the scheme I want to do with today is:

  • If two jobs computes the same result simultaneously, you will need temporary space for both of them to store results. I actually don't know what happens today w.r.t. npy-files during a race, whether it uses twice as much space today as well in the worst case.

Worst case both jobs fail because of out-of-space, when one could have succeeded.

(Really long-term, this can be mitigated with pessimistic locking.)

  • If a process segfaults during pickling, it simply leaves garbage that must be cleaned up. Today, that garbage would be overwritten at the next call.

Of course, we can make an effort to clean the garbage at the next call, to get the same behaviour as today.

In return:

  • Processes don't step on one another's foot when computing the same job, and the commit/final directory rename is atomic.
  • Simpler JobDirectory API (the IO is done directly in the calls, rather than waiting

Conclusion:

Given your input I'd be a bit uncomfortable myself making this change at this time. But it does depend on the current disk usage behaviour during a race condition. Do you happen to know?

If pessimistic locking proves itself down the line, it should be possible to mitigate all ill effects caused by this change.

Dag Sverre Seljebotn added 2 commits May 5, 2011 11:50
Also fix some minor issues with how the store argument was
interpreted.
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 29, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request May 10, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request May 12, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request Oct 24, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request Oct 24, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request Nov 17, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request Nov 22, 2022
fcharras added a commit to fcharras/joblib that referenced this pull request Feb 6, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Feb 17, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Feb 19, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 11, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 15, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 15, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 15, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023
fcharras added a commit to fcharras/joblib that referenced this pull request Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants