-
Notifications
You must be signed in to change notification settings - Fork 441
Restrict server to one client #804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f0b8783
to
be955db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing a fix @derekhiggins. Do you mind also adding a note to the README?
be955db
to
8b3dd02
Compare
8b3dd02
to
924116a
Compare
Added |
An error I got when running multiple chat sessions and generate with this PR:
|
Another error I got when running multiple chat sessions and generate with this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still looking into this, but other than some readme nits and some errors I encountered (see comments) I have one bigger concern...
So are we really sure that we should automatically just keep firing up more servers when we've seen memory pressure, GPU fit, issues? My M3 is doing ok with this so far, but I would not be surprised if low memory laptops reboot (rumors?).
Also these temporary servers don't log so I'm just guessing they use some GPU but eventually don't fit and then ?
I haven't quite figured out if this is safe and desirable yet. Not sure it isn't, but concerned about what the OOM experience might be.
Also, I was going to look into backlog feature of uvicorn (maybe someone already did?)
[update: I think backlog is not useful here. Also, it's looking like most of my OOM concerns are from back when the model was much bigger. It's looking reasonable, but I'm still testing w/ 64G of memory].
README.md
Outdated
@@ -194,6 +194,8 @@ The full process is described graphically in the [workflow diagram](./docs/workf | |||
Press CTRL+C to shut down the server. | |||
``` | |||
|
|||
> **NOTE:** The `ilab` server can only server a single client. If two `ilab` clients try to connect to the same ilab server at the same time the second one will fail and instead attempt to start its own temporary server. This will require additional resources on the host machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s/can only server/can only serve/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammarly... probably a comma after "at the same time"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs rewording. Currently I'm seeing first sentence is not needed because it is automatically handled this new way.
Next, "will fail and ...start..." I think that's okay because it's aligned with the logged messaging, but maybe "will fail" can be omitted and "instead attempt to" can be omitted. Yep... say what it will do w/ less wishy-washy details (is just an opinionated edit attempt).
Also if you can find the right words it isn't "the second" it is all subsequent (but I don't like that word).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix when updating
This looks like the errorcase I called out in the PR description i.e.
|
I haven't seen this happen, did you do anything special to provoke it? |
I did check how much RAM was being used when proposing this and it seemed reasonable (about 2G) but I also have lots of RAM. An alternative could be to just output a message from the chat client to try again later if it gets a 503? I honestly have no idea what happens if 2 servers are run on the same GPU (and I don't have access to one to check) |
Another error... Not sure what happened. This was an ilab generate that was running along with other things. It looks like this one was NOT using a temporary server, so somehow at 68% it got bumped (maybe I ran another generate or chat that stole its server?). This doesn't seem like a big issue considering I'd say I was abusing concurrency, but it does seem contrary to the objective of not failing an ilab generate at 68% completion. Makes me wonder about the keep-alive setting.
|
Not specific to this PR, but here's a thought... Instead of doing a sneaky ensure_server() to create a secret "temporary" server with no logging, we should either never do that, or always do that (but with logging). I.e., given that we are doing this stuff, why not just have chat and generate always fire up a temporary server. Alternatively, throw an error and never do it (perhaps explicitly do it by force with a flag). To me it currently looks like we can do this always if we fix the logging (e.g. to a file). The best alternative would be to fix concurrency to actually work with multiple generate/chat clients (but that's tricky w/ current limitations). This place in the middle where we are currently investing is not a good place to be. That said, I'm still trying to convince myself that this PR is better current state. Trying to be harsh and doing more testing because of the timing. |
OK, I think I see whats going on here, the generate command has the single worker taken while its talking to the server. But generate does some If another client hits the server during this window then its not available to the generate command. The best we can do in this case is to increase
Maybe the way to go would be to do this PR (with or without the increased reties for generation), and also ensure we fix Or perhaps just abandon this PR and fix #156 so that even if generation crashes all is not lost as it can be resumed. |
ok, so to summarize, I think this PR is an improvment to the current situation but its not perfect, so we have options
I think my preference is to fix #156, and then discus the future of local servers in another forum |
Another interesting behavior. Chat decides to connect to the server (not temporary) and then can't:
|
Yeah regarding the 1, 2, 3 options: 1 - Probably. I'm happy to see running a bunch of these temporary servers isn't so bad on an M1. They coexist better than I thought. Kind of a bummer that this doesn't exactly fix everything. |
One more data point, the code to resume partially done generate is still present, simply copy a generated_merlinite....json file into regen.json and restart 'ilab generate'
So a crashed generate doesn't mean all is lost, I can push up a PR to add it back into the docs if thats what you want
when chat starts if any of the checks to the server succeed it will not start a local server then fail later when generate is using the single worker. (this is the .1 second window I mentioned above) |
|
@markstur I've been wondering why you are hitting more tracebacks then me and now I see it, See "Assessing generated samples took 1.61s"
vs mine
For you the server isn't being used by "generate" for longer windows giving the chat client more of a chance to tie it up I got lots of CPU cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to merge when the README is fixed as Mark suggested
924116a
to
c1620fa
Compare
Done, I couldn't think of a good word to avoid "subsequent", so reworded a little. |
The server frequently crashes when dealing with 2 clients simultaniously, limit it to a single worker, If a user tries to connect ilab to the server the second command will start its own local server (as it gets a http 503). We also need to disable keep-alive to prevent consecutive calls in the same client holding the only worker open and DOS'ing itself. There is still one error case, if the chat client is started but not activly chatting when generate is started. chat will fail when the model is called. Not ideal but not as bad as an hour long "ilab generate" failing. Fixes #346 Signed-off-by: Derek Higgins <derekh@redhat.com>
c1620fa
to
479b8a4
Compare
Thanks for the rigorous testing Mark! And thanks for taking the time to engage in the discussion Derek! This is really helpful. There's a lot going on here obviously, but it looks like the "happy path" of "generate and then chat" is fairly stable for now + we've added a note to the README + we've made a modification to safe guard from an active generation from crashing. I have also been thinking about reassessing |
The server frequently crashes when dealing with 2 clients simultaniously, limit it to a single worker, If a user tries to connect ilab to the server the second command will start its own local server (as it gets a http 503).
We also need to disable keep-alive to prevent consecutive calls in the same client holding the only worker open and DOS'ing itself.
There is still one error case, if the chat client is started but not activly chatting when generate is started. chat will fail when the model is called. Not ideal but not as bad as an hour long "ilab generate" failing.
Fixes #346