Skip to content

mgr: load modules in separate python sub-interpreters #14971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 30, 2017

Conversation

tserong
Copy link
Contributor

@tserong tserong commented May 5, 2017

This provides a reasonable amount of isolation between mgr
modules. Notably, with this change, it's possible to have more
than one mgr module use cherrypy.

I consider this implementation to be a bit messy; I think it'd
be neater if the sub-interpreter was created inside MgrPyModule,
rather than in PyModules::init(), but see the comment in that
function for more details.

Also, due to the lack of documentation on python sub-interpreters,
I'm not completely confident that I'm doing everything correctly.
Notably, there's still calls to PyGILState_Ensure() and
PyEval_ReleaseThread() in PyState.cc, and apparently those two
are either not-compatible with sub-interpreters, or at least need
to be handled delicately.

TL;DR: Everything seems to work for me so far, but should be
considered experimental until further notice.

Signed-off-by: Tim Serong tserong@suse.com

@tserong
Copy link
Contributor Author

tserong commented May 5, 2017

Ping @jcsp, @tchaikov, @dmick. I tested this by applying it on top of #14946, and both 'rest' and 'dashboard' modules were able to load and run without stomping on each other.

@liewegas liewegas added the mgr label May 5, 2017
@tchaikov tchaikov self-requested a review May 5, 2017 15:25
@tserong
Copy link
Contributor Author

tserong commented May 5, 2017

One notable known problem: if a module fails to load because it can't be found (e.g. sys.path is wrong, which shouldn't happen in real life, or a module name is mis-typed in ceph.conf, which might), the python bit of mgr locks up somehow. The loop in PyModules::init() just kinda stops somehow.

@tserong tserong force-pushed the wip-mgr-py-sub-interpreter branch from c4cd997 to 8b5ceca Compare May 5, 2017 23:10
@tserong
Copy link
Contributor Author

tserong commented May 5, 2017

(That last was just a rebase)

@tserong tserong force-pushed the wip-mgr-py-sub-interpreter branch from 8b5ceca to f6a135a Compare May 10, 2017 08:49
@tserong tserong changed the title [DNM] mgr: load modules in separate python sub-interpreters mgr: load modules in separate python sub-interpreters May 10, 2017
@tserong
Copy link
Contributor Author

tserong commented May 10, 2017

OK, I'm confident in this one now.

Each MgrPyModule gets its own python sub-interpreter. The logger, the ceph_state module and sys.path need to be set up separately for each sub-interpreter, so all that happens in MgrPyModule's constructor (previously this was done on the main interpreter in PyModules::init()).

Each sub-interpreter has its own python thread state. The main interpreter also has a python thread state, but that is almost unused except for during setup and teardown of the whole beast.

Some care needs to be taken to ensure that the right thread state is active at the right time; note how the call to handle_pyerror() in PyModules::init() had to be moved inside MgrPyModle::load(). It's not too bad though, just remember that PyModules should be using the main thread state, and MgrPyModule should always be using its sub-interpreter thread state, except for very early in its constructor and very late in its destructor.

The ceph_state module (PyState.cc) naturally has no idea what context it's being run in, so uses PyThreadState_Get() when it needs to know the python thread state.

I've also added a little Gil class to make it easy to acquire and release the GIL for whichever thread state you're in.

@tserong tserong changed the title mgr: load modules in separate python sub-interpreters [DNM] mgr: load modules in separate python sub-interpreters May 11, 2017
@tserong
Copy link
Contributor Author

tserong commented May 11, 2017

Dammit, I just tried using mgr built with this changeset to load the restful module from #14457. It loads fine, but segfaults on shutdown...

@jcsp
Copy link
Contributor

jcsp commented May 11, 2017

This is a tough one to review because the subinterpreter functionality in python itself is so obscure. It would be nice to spin the Gil class out into a separate commit.

Since PyFinalize allegedly (https://docs.python.org/2.7/c-api/init.html#c.Py_EndInterpreter) will clean up all un-terminated sub interpreters, I wonder if the stuff in ~MgrPyModule could just go away -- it doesn't seem that useful to maybe terminate sub interpreters (obviously not your fault, quirk of the framework). When taking http://tracker.ceph.com/issues/19549 into account, we never truly need to do completely clean teardown, it just needs to be torn down far enough for the process to end cleanly.

@tserong
Copy link
Contributor Author

tserong commented May 11, 2017

Ok, I'll split this up into separate commits. And yeah, I wondered the same about getting rid of the attempted termination in ~MgrPyModule, given the quirks. Now that there's two of us, I'll do that as well (but leave some sort of comment).

@tserong
Copy link
Contributor Author

tserong commented May 12, 2017

Splitting out the Gil change was a really good idea. Turns out that change by itself exhibited the segfault on shutdown, even without the sub-interpreters, and now I've learned even more than I ever wanted to know about embedding Python :-)

Here's what I was missing: not only do we need separate sub-interpreters, we need to manually create an additional python thread state for each OS thread that runs the module's serve function. The PyGILState_*() APIs did that bit automatically, but because I removed those (as they're not supported with sub-interpreters), that extra thread state creation wasn't happening, which resulted in that weird segfault.

@tserong tserong force-pushed the wip-mgr-py-sub-interpreter branch 2 times, most recently from f3b82bc to 1f9b93d Compare May 12, 2017 14:43
@tserong
Copy link
Contributor Author

tserong commented May 12, 2017

OK, that's better. No weird segfaults. I think I'll avoid using the word "confident" this time and instead go with cautious optimism (and flashbacks to when I drew this thing when I first started working on HA).

My one outstanding question now, I think, is: are there any other OS threads that need another manually created python thread state?

@tserong tserong changed the title [DNM] mgr: load modules in separate python sub-interpreters mgr: load modules in separate python sub-interpreters May 12, 2017
@tserong tserong force-pushed the wip-mgr-py-sub-interpreter branch from 1f9b93d to f21ba89 Compare May 15, 2017 10:57
@tserong
Copy link
Contributor Author

tserong commented May 15, 2017

I've moved the additional thread state creation inside the Gil class, but the caller has to pass new_thread == true when we know we're in a new OS thread . IMO it'd be better to do this automatically, which we can do, if we don't mind using something that is theoretically a private member of PyThreadState (see the comments in Gil.h for details). What do you think @jcsp?

src/mgr/Gil.h Outdated
: pThreadState(ts), pNewThreadState(nullptr)
{
// Acquire the GIL, set the current thread state
PyEval_RestoreThread(pThreadState);
Copy link
Contributor

@chardan chardan May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if pThreadState (ts) is nullptr? The documentation for PyEvalRestoreThread() says that it must not be called on a NULL pointer. Similarly, as I'm reading it, before we do this we have to be sure that we haven't already been called: what happens if this ctor is invoked twice?
https://docs.python.org/2/c-api/init.html#c.PyEval_RestoreThread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pThreadState is null it probably dies horribly. I've added an assert (so it'll still die horribly, but on our terms).

As for ensuring we haven't already been called, that's (sadly) up to the caller to be aware of. I've updated the comment to hopefully clarify this a bit more. If you invoke this ctor twice in the same thread it'll deadlock, e.g. do not do this:

void func1() {
  Gil gil(pMyThreadState);
  // Do some stuff that needs the GIL here

  func2();  // this will deadlock
}
void func2() {
  Gil gil(pMyThreadState);
  // Do some stuff that needs the GIL
}

You can avoid this deadlock like so:

void func1() {
  {  // extra scope
    Gil gil(pMyThreadState);
    // Do some stuff that needs the GIL here
  }

  func2();  // this won't deadlock
}

In summary, you have to be just as careful using this class as you need to be when using PyEvalRestoreThread(). AFAICT there's no public API for figuring out whether or not the GIL is already held.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// something that's meant to be a black box.
//
if (new_thread) {
pNewThreadState = PyThreadState_New(pThreadState->interp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the old state need to be released in some way before re-assigning pNewThreadState?

Copy link
Contributor Author

@tserong tserong May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. At this point, pNewThreadState is NULL, and PyThreadState_New creates a new PyThreadState attached to the same interpreter that the existing pThreadState is attached to. It's fine for the two PyThreadState objects to exist at the same time; only one is ever active (the PyThreadState_Swap() calls activate one or the other).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're now sure pNewThreadState is nullptr, this sounds good. Thanks!

src/mgr/Gil.h Outdated
Gil(PyThreadState *ts, bool new_thread = false)
: pThreadState(ts), pNewThreadState(nullptr)
{
assert(pThreadState != nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider-also throwing an exception instead.

src/mgr/Gil.h Outdated

private:
PyThreadState *pThreadState;
PyThreadState *pNewThreadState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think initializing these in the ctor's initializer-list is fine, but since we're using C++ it's also possible to write:
PyThreadState *pNewThreadState = nullptr;
(And/or use std::unique_ptr<>.)
...I'm fine either way, but figured I should point it out.

m[len-1] = '\0';
}
dout(4) << m << dendl;
}
Copy link
Contributor

@chardan chardan May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PyArg_ParseTuple() has returned false, we're swallowing the returned error here. Certainly, fine if we didn't want to take any action, but is this intentional?

auto len = strlen(m);
if (len && m[len-1] == '\n') {
m[len-1] = '\0';
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If len is one, this assumes not-an-error (m[len-1] won't explode because len > 0 on account of the first clause). I don't see in the Python documentation whether or not that's possible, but if it is it might be worth checking for.

{
PyGILState_STATE gstate;
gstate = PyGILState_Ensure();
assert(pMainThreadState != nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider throwing an exception.

pMyThreadState = Py_NewInterpreter();
if (pMyThreadState == nullptr) {
derr << "Failed to create python sub-interpreter for '" << module_name << '"' << dendl;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pMyThreadState == nullptr, the program continues, leaving the object constructed. This seems like a critical failure to me? If so, I think it would be better to throw an exception than to leave the object in an invalid state.

Copy link
Contributor Author

@tserong tserong May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that, but couldn't see any other exceptions being thrown anywhere in the mgr codebase, so decided not to go there.

The way these things get constructed (in PyModules::init()), the next thing that's called is MgrPyModule::load() which will return with an error if the object is in that invalid state. So it's handled, if not exactly clean.

PySys_SetObject("stderr", py_logger);
PySys_SetObject("stdout", py_logger);
#else
PySys_SetObject(const_cast<char*>("stderr"), py_logger);
Copy link
Contributor

@chardan chardan May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the first #ifdef needed? (eg. will the version here with the cast always work anyway?)

#endif
}
// Populate python namespace with callable hooks
Py_InitModule("ceph_state", CephStateMethods);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the return value of this function be checked?


Py_XDECREF(pClassInstance);
PySys_SetPath((char*)(sys_path.c_str()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer C++ style casts. Even though they're annoying to type. :-)

return -EINVAL;
}

Gil gil(pMyThreadState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to wonder if it's worth calling type Gil something like "Gil_scope_guard"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a good idea - it'd make it more obvious to the reader WTF it was for.

}

Gil gil(pMyThreadState);

// Load the module
PyObject *pName = PyString_FromString(module_name.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're guaranteed that module_name.c_str() will not be NULL, but PyString_FromString() is going to make a copy, which could fail. We may want to check pName for nullptr.

gstate = PyGILState_Ensure();

// Don't need a Gil here -- this is called from MgrPyModule::load(),
// which already has one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really too bad that there's no way to detect if two are the same! :-\

derr << "Error loading module '" << module_name << "': "
<< cpp_strerror(r) << dendl;
derr << handle_pyerror() << dendl;
// Don't drop out here, load the other modules
} else {
// Success!
modules[module_name] = std::move(mod);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move() is fine here, but since it's a unique_ptr<> you can also just assign it. (No harm done.)


~Gil()
{
if (pNewThreadState != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you've written is fine, but another option to consider might be a std::unique_ptr<> with a deleter, so this explicit checking wouldn't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to stick with the explicit check, just because what I'm doing here is small, self contained and (hopefully) pretty obvious.

pClassInstance(nullptr),
pMainThreadState(main_ts_)
{
assert(pMainThreadState != nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As above-- considering throwing.)

auto rtn = PyObject_CallObject(set_fn, args);
if (rtn != nullptr) {
Py_DECREF(rtn);
dout(10) << "MonCommandCompletion::finish()" << dendl;
Copy link
Contributor

@chardan chardan May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious: couldn't we use the function's scope if we were willing to put this dout() line just before the Gil construction? Is there a reason we don't? (I'm not fully initiated into the Great Mystery of dout().)

Py_DECREF(args);

PyGILState_Release(gstate);

global_handle->notify_all("command", tag);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...ah-hah, this may be the reason for the extra scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

auto set_fn = PyObject_GetAttrString(python_completion, "complete");
assert(set_fn != nullptr);

auto pyR = PyInt_FromLong(r);
Copy link
Contributor

@chardan chardan May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these calls can fail. Granted, the usual reason will be memory errors which can be hard to handle, but I figured I'd point it out. (Whether or not you want to check them does depend on your level of pedanticism, but I personally would say that you might as well. YYMV.)

//
// See the comment below for when to set new_thread == true
//
class Gil {
Copy link
Contributor

@chardan chardan May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing I just thought of. It might also be worth disabling the copy ctor and copy assignment operators, eg.
Gil(const Gil&) = delete;
Gil& operator=(const Gil&) = delete;

@tserong
Copy link
Contributor Author

tserong commented May 18, 2017

Thanks for the detailed review @chardan. I've done a couple of tiny bits of cleanup (3e0fb3e). Your remaining comments that I haven't either made a change for, or responded to, apply to existing code (which I might have moved around for this PR). I'd like to hang back on doing anything about those until after @tchaikov and/or @jcsp are able to do some review of this PR as well.

@chardan
Copy link
Contributor

chardan commented May 18, 2017

You're very welcome! Nice work on this change. I didn't really see any serious problems, so just about all of my suggestions are... suggestions. :-) Nice job, @tserong!

tserong added 2 commits May 22, 2017 16:52
Prep work for loading modules in separate python sub-interpreters.
According to the Python C API docs, mixing the PyGILState_*() API
with sub-interpreters is unsupported, so I've replaced these calls
with a Gil class, which will acquire and release the GIL against
a specific python thread state.

Note the manual python thread state creation in MgrPyModule::serve().
The PyGILState_*() APIs would have done that for us, but we're not
using them anymore, so have to do it by hand (see
https://docs.python.org/2/c-api/init.html#non-python-created-threads
for some description of this).

Signed-off-by: Tim Serong <tserong@suse.com>
This provides a reasonable amount of isolation between mgr
modules.  Notably, with this change, it's possible to have more
than one mgr module use cherrypy without conflicts.

Each MgrPyModule gets its own python sub-interpreter.  The logger,
the ceph_state module and sys.path need to be set up separately
for each sub-interpreter, so all that happens in MgrPyModule's
constructor (previously this was done on the main interpreter
in PyModules::init()).

Each sub-interpreter has its own python thread state.  The main
interpreter also has a python thread state, but that is almost
unused except for during setup and teardown of the whole beast.

On top of that, an additional python thread state is necessary
for each OS thread that calls into python code (e.g. the serve()
method).

Some care needs to be taken to ensure that the right thread state
is active at the right time; note how the call to handle_pyerror()
in PyModules::init() had to be moved inside MgrPyModle::load().
PyModules should be using the main thread state, and MgrPyModule
should usually be using its sub-interpreter thread state, except
for very early in its constructor (before the sub-interpreter has
been created), and in the serve() method where another python
thread state is created to map to the separate OS thread that is
responsible for invoking this method.

The ceph_state module (PyState.cc) naturally has no idea what
context it's being run in, so uses PyThreadState_Get() when it
needs to know the python thread state.

Signed-off-by: Tim Serong <tserong@suse.com>
@tserong tserong force-pushed the wip-mgr-py-sub-interpreter branch from 3e0fb3e to 3e0d149 Compare May 22, 2017 07:30
@tserong
Copy link
Contributor Author

tserong commented May 22, 2017

Rebased & squashed some cleanup commits.

@tserong
Copy link
Contributor Author

tserong commented May 24, 2017

Now that #14946 is merged, this kinda needs to go in too in case someone enables the "rest" and "dashboard" modules at the same time (if both are enabled, neither will work, because of the two cherrypy's stomping on each other).

@tserong
Copy link
Contributor Author

tserong commented May 24, 2017

retest this please

@jcsp jcsp self-requested a review May 24, 2017 10:24
@liewegas liewegas merged commit 3d7a3d7 into ceph:master May 30, 2017
@tserong tserong deleted the wip-mgr-py-sub-interpreter branch June 5, 2017 06:39
@tserong tserong restored the wip-mgr-py-sub-interpreter branch January 13, 2018 09:40
@tserong tserong deleted the wip-mgr-py-sub-interpreter branch January 13, 2018 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants