Skip to content

Switch from node.js domains to child process #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Jul 23, 2020

Conversation

silverbucket
Copy link
Member

@silverbucket silverbucket commented Mar 30, 2020

Due to issues with some modules when used within a node.js domain, I've switched to running the platform activity in it's own child process. This creates an extra layer of abstraction, child process rpc communication, but overall gives us more security in the safety of the main sockethub thread, while the child processes can be destroyed and restarted whenever we get an uncaught error, or any unexpected behavior.

  • New child process for each unique process (a non-persistent platform like feeds which one be one generic child process anyone can use, or a persistent platform like xmpp or irc which would have a new child process for each user[actor])
  • Introduction of PlatformInstance class, of which each instance contains references and relevant data to the child process.
  • A platform.js file which is executed for each child process and is responsible for initializing the module and interfacing between it, the job queue, and the parent sockethub process.
  • Lots of code cleanup and improved typing

Some outstanding things to resolve:

  • Handle platform.js updateCredential messages (when username or password are changed from within a sockethub session, those new credentials need to be updated for that sessions verification process.
  • non-persistent threads (eg. feeds) need to verify they are started up correctly and managed efficiently. Some work still needs to be done here.
  • When the main sockethub thread throws an exception and dies, the children will still be running. Not sure if there's a way to ensure the children die, but needs to be investigated a bit more.

Some other things of note:

This actually fixes another issue where a crashed platform would disconnect the client socket. Additionally, the handling of any failure outside of the context of a job (AS command sent from client) are also reported back to the client as part of the cleanup of that thread. This means that in some cases, the client can receive an error message twice.

  • A job failure that also causes an uncaught exception will send a message back about the job completion, and the uncaught exception.
  • An uncaught exception that does not happen during a job execution will send a message back about the uncaught exception.

In neither case is the client socket disconnected, and the next time a command is sent to that platform a new process will be spawned if it doesn't already exist.

@silverbucket silverbucket added type:feature kredits-3 Large contribution labels Mar 30, 2020
@silverbucket silverbucket requested review from raucao and galfert March 30, 2020 18:00
@silverbucket silverbucket self-assigned this Mar 30, 2020
 - sockethub-platform-xmpp@3.2.2-alpha.0
 - sockethub@3.2.0-alpha.0
@silverbucket
Copy link
Member Author

published to npm: sockethub@3.2.0-alpha.0

@raucao
Copy link
Contributor

raucao commented Mar 30, 2020

First try in Hyperchannel: Everything works now! 🎉

@raucao
Copy link
Contributor

raucao commented Mar 30, 2020

Just to document it here: memory usage of the XMPP platform worker process is ~14MB on my machine just now.

So that would be about 72 platform instances per GB of memory (not counting inevitable size increase from adding platform features). Depending on how many platforms we want to spin up per Kosmos Chat user, this can quickly add up, of course. But memory is also cheap nowadays. You can get 64GB machines at Hetzner for about 40 EUR/month for example. So I wouldn't worry about this at all for our own infra. But you would obviously hit limits on cheapo cloud VMs much sooner (for people self-hosting for their family or company e.g.).

Copy link
Contributor

@raucao raucao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice implementation! Just added a couple of nitpicks.

import randToken from 'rand-token';
import Store from 'secure-store-redis';
import hash from 'object-hash';
const { fork } = require('child_process');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an import here, too?


import crypto from './crypto';
import services from './services';
import SharedResources from './shared-resources';
import config from './config';
import {platform} from "os";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's always use spaces between curly braces. Should probably be enforced by ESLint at some point soon.

@raucao
Copy link
Contributor

raucao commented Mar 30, 2020

Actually, after reloading Hyperchannel, I wasn't able to connect at all anymore. And it looks like the platform instance isn't terminated, even though the log says so. I can still see the process running, after Sockethub has reported that it terminated it.

Edit: I killed the process manually then, and the next connection worked fine again.
Edit 2: Just to confirm, it is reproducible. Every reload of the Web client will cause this to happen.

@silverbucket
Copy link
Member Author

Actually, after reloading Hyperchannel, I wasn't able to connect at all anymore. And it looks like the platform instance isn't terminated, even though the log says so. I can still see the process running, after Sockethub has reported that it terminated it.

Edit: I killed the process manually then, and the next connection worked fine again.
Edit 2: Just to confirm, it is reproducible. Every reload of the Web client will cause this to happen.

Yeah I'm seeing it as well. This sent me down a path toward considering some modifications to the worker as it might be adding unnecessary complexity after the refactor to processes. I'm not totally sure yet, need to think about it some more.

@raucao
Copy link
Contributor

raucao commented May 5, 2020

Could you maybe post an update on what the recent commits mean? Would be cool if someone could help out here...

@silverbucket
Copy link
Member Author

@galfert thanks for catching these, they are fixed.

@galfert
Copy link
Contributor

galfert commented Jun 15, 2020

Cool, the second error is gone now. But I'm still seeing the invalid property: "sessionSecret" warning for a couple of other event messages. E.g. for completed events where the @type is either join or observe.

Also, at least for XMPP, when the connection fails (e.g. because of invalid credentials), the whole Sockethub server crashes:

UNCAUGHT EXCEPTION IN PLATFORM

Error: failed connecting galfert@kosmos.org/hyperchannel : XMPP authentication failure
    at EventEmitter.error (/sockethub/packages/sockethub-platform-xmpp/index.js:556:15)
    at EventEmitter.emit (events.js:210:5)
    at Client.<anonymous> (/sockethub/packages/sockethub-platform-xmpp/node_modules/simple-xmpp/lib/simple-xmpp.js:477:20)
    at Client.emit (events.js:210:5)
    at Client._handleAuthState (/sockethub/packages/sockethub-platform-xmpp/node_modules/node-xmpp-client/lib/Client.js:295:10)
    at Client._handleStanza (/sockethub/packages/sockethub-platform-xmpp/node_modules/node-xmpp-client/lib/Client.js:233:12)
    at Client.onStanza (/sockethub/packages/sockethub-platform-xmpp/node_modules/node-xmpp-client/lib/Client.js:221:8)
    at Connection.emit (events.js:210:5)
    at Connection.onStanza (/sockethub/packages/sockethub-platform-xmpp/node_modules/node-xmpp-core/lib/Connection.js:377:10)
    at StreamParser.<anonymous> (/sockethub/packages/sockethub-platform-xmpp/node_modules/node-xmpp-core/lib/Connection.js:231:10)
    at StreamParser.emit (events.js:210:5)
    at SaxLtx.<anonymous> (/sockethub/packages/sockethub-platform-xmpp/node_modules/@xmpp/streamparser/index.js:69:14)
    at SaxLtx.emit (events.js:210:5)
    at SaxLtx._handleTagOpening (/sockethub/packages/sockethub-platform-xmpp/node_modules/ltx/lib/parsers/ltx.js:40:12)
    at SaxLtx.write (/sockethub/packages/sockethub-platform-xmpp/node_modules/ltx/lib/parsers/ltx.js:159:18)
    at StreamParser.write (/sockethub/packages/sockethub-platform-xmpp/node_modules/@xmpp/streamparser/index.js:134:17)
    at Connection.onData (/sockethub/packages/sockethub-platform-xmpp/node_modules/node-xmpp-core/lib/Connection.js:310:17)
    at TLSSocket.emit (events.js:215:7)
    at addChunk (_stream_readable.js:309:12)
    at readableAddChunk (_stream_readable.js:290:11)
    at TLSSocket.Readable.push (_stream_readable.js:224:10)
    at TLSWrap.onStreamRead (internal/stream_base_commons.js:182:23)

UNCAUGHT EXCEPTION

TypeError: Cannot create property 'context' on string 'Error: failed connecting galfert@kosmos.org/hyperchannel : XMPP authentication failure'
    at PlatformInstance.sendToClient (/sockethub/packages/sockethub/dist/platform-instance.js:39:25)
    at ChildProcess.message (/sockethub/packages/sockethub/dist/platform-instance.js:78:26)
    at ChildProcess.emit (events.js:210:5)
    at emit (internal/child_process.js:876:12)
    at processTicksAndRejections (internal/process/task_queues.js:81:21)
Terminating platform threads...
close even triggered e54b0b9
platforms:  undefined
TypeError: Cannot read property 'values' of undefined
    at Timeout._onTimeout (/sockethub/packages/sockethub/bin/sockethub:14:62)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
Unable to reliably terminate platforms, please check your process list for orphaned `platform.js` threads.

@silverbucket
Copy link
Member Author

@galfert thanks for catching these, can you try again? should be fixed this time.

@galfert
Copy link
Contributor

galfert commented Jun 16, 2020

Yes, warnings are gone and server doesn't crash anymore on auth failures 👍

I can also connect to both IRC and XMPP at the same time 🎉

Sending messages also works, but the client doesn't seem to receive messages anymore. E.g. the list of users of a room when asking for attendance or just normal chat messages.

When sending a message from another IRC client, I can see this in the SH logs:

sockethub:platform:irc:6ad516a calling sendToClient for galfert@irc.freenode.net [ 'moxego@irc.freenode.net' ] +51s
sockethub:platform sending to client +51s

But in Hyperchannel, I don't see any incoming message.

I'm not 100% sure, but I think yesterday I received the room attendance messages at least. Hadn't tried chat messages yet, because of the errors.

@raucao
Copy link
Contributor

raucao commented Jun 30, 2020

@silverbucket Did you see @galfert's comment?

@silverbucket
Copy link
Member Author

Yeah, I saw the comment, and talked with @galfert about it on a call a few weeks ago, I just haven't had any time to sit down and work on anything, but hopefully very soon.

@silverbucket
Copy link
Member Author

Alright, got the incoming messages (platform to client direction) working. Things should work fine, but the checkboxes in the PR desc still need to be addressed (or in the case of the 3rd one, decided upon). I will work on 1 & 2 very soon, would love to have some help/suggestions on 3, it's a standard parent/process issue and nothing sockethub specific.

@galfert
Copy link
Contributor

galfert commented Jul 9, 2020

Yay, I can confirm that I can send and receive messages with Hyperchannel now, in both IRC and XMPP, without problems.

Great work! 👍

@raucao
Copy link
Contributor

raucao commented Jul 22, 2020

Some outstanding things to resolve:

@silverbucket Are those really important to solve in this same PR, instead of smaller, new ones? I don't see anything in the list that seems like a real blocker for us to deploy a potential new master from this for Kosmos Chat.

(Merging to master doesn't have to mean the same as tagging a release.)

@silverbucket
Copy link
Member Author

@skddc the first and second issues (updating changed credentials and handling single-instance platforms) should be fixed as this PR breaks that functionality. The second is almost complete. I will hopefully have time to complete these two issues next week as I'm out of town this weekend.

@raucao
Copy link
Contributor

raucao commented Jul 22, 2020

Why not just create issues for those and document them as known issues in the README? As long as you don't tag a new release, it won't impact anyone using that functionality at the moment.

In fact, kredits naturally incentivizes smaller, more focused PRs (that are easier to review and merge as well), by limiting the largest label to 5000. Smaller PRs also make it much easier to work with more people in parallel, as anyone wanting to use this branch for feature development, would now have to branch off an unstable branch.

@galfert
Copy link
Contributor

galfert commented Jul 23, 2020

I would also prefer to merge this PR now, because it's working much better than the current master already.

Also, I think I would like to give task number 3 (killing the child processes when the main process dies) a try tomorrow. And I would rather create a new PR against master than against this PR.

@silverbucket
Copy link
Member Author

silverbucket commented Jul 23, 2020

Sounds good to me! Going to merge this PR, I created those 3 issues and linked them to a v3.2 project, along with other related issues. Thanks guys for all the help with this!

@silverbucket silverbucket merged commit 2fd7f73 into master Jul 23, 2020
@galfert galfert deleted the feature/domain_to_process branch July 23, 2020 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants