Skip to content

Race condition in onSnapshot code crashes the process #2280

@jerzozwierz

Description

@jerzozwierz

Bottom line up front

When the conditions are just right, it's possible for a grpc stream (underlying an onSnapshot query) to emit an error event while not having any error handlers, killing the node process. It's rare, but our server's heavy traffic causes it to happen multiple times per day.

  1. Is this a client library issue or a product issue?
    Client library. A transient error isn't properly handled by the library under certain conditions.

  2. Did someone already solve this?
    None of my searches have yielded results.

  3. Do you have a support contract?
    I don't

Environment details

  • OS: Debian GNU/Linux 12 (inside Docker)
  • Node.js version: 20.18.0
  • npm version: 10.8.2
  • @google-cloud/firestore version: 7.4.0

Steps to reproduce

While I can't write code that reliably reproduces the problem (it's a low probability race condition), I can point to a code path that is hit and support my hypothesis with logs emitted by our server.

https://github.com/googleapis/nodejs-firestore/blob/main/dev/src/watch.ts#L483-L513

// inside Watch.initStream() method
...
        return this.firestore
          .requestStream( // (1)
            'listen',
            /* bidirectional= */ true,
            request,
            this.requestTag
          )
          .then(backendStream => {
            if (!this.isActive) { // (2)
              logger(
                'Watch.initStream',
                this.requestTag,
                'Closing inactive stream'
              );
              backendStream.emit('end');
              return; // (3)
            }
            logger('Watch.initStream', this.requestTag, 'Opened new stream');
            this.currentStream = backendStream;
            this.resetIdleTimeout();

            this.currentStream!.on('data', (proto: api.IListenResponse) => {
              this.resetIdleTimeout();
              this.onData(proto);
            })
              .on('error', err => { // (4)
                if (this.currentStream === backendStream) {
                  this.currentStream = null;
                  this.maybeReopenStream(err);
                }
              })
  1. First thing that must happen is a query being closed very soon after opening. This way, before the async call (1), the .isActive field is flipped back to false. Subsequently, the code enters the if clause (2). Note that, in this path, we exit the function at (3) and backendStream doesn't get an error handler like it does in (4).
  2. Secondly, the newly opened backendStream must emit an error (it can be anything, like a transient networking issue). The fact that it receives an end event earlier doesn't preclude it from doing so.
  3. Consequently, that emitted error crashes the node process.

Logs (I have selectively enabled certain firestore's internal logs to be formatted and logged to console):

T21:39:11.502801006Z Sending request: { [redacted] } { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T21:39:11.503399310Z Marking stream as healthy { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T21:39:11.503738877Z Stream ready { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T21:39:11.504049930Z Closing inactive stream { requestTag: 'rvKwm', fun: '[Watch.initStream]' }
T21:39:11.504533921Z Received stream end { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T22:39:12.017741217Z Received stream error: Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)
T22:39:12.017790903Z     at callErrorFromStatus (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
T22:39:12.017796528Z     at Object.onReceiveStatus (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:420:73)
T22:39:12.017799765Z     at Object.onReceiveStatus (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
T22:39:12.017802736Z     at /usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
T22:39:12.017805701Z     at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
T22:39:12.017808736Z for call at
T22:39:12.017812711Z     at ServiceClientImpl.makeBidiStreamRequest (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:404:32)
T22:39:12.017815664Z     at ServiceClientImpl.<anonymous> (/usr/src/app/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
T22:39:12.017819232Z     at /usr/src/app/node_modules/@google-cloud/firestore/build/src/v1/firestore_client.js:239:29
T22:39:12.017839883Z     at /usr/src/app/node_modules/google-gax/build/src/streamingCalls/streamingApiCaller.js:46:28
T22:39:12.017842602Z     at /usr/src/app/node_modules/google-gax/build/src/normalCalls/timeout.js:44:16
T22:39:12.017845234Z     at StreamProxy.setStream (/usr/src/app/node_modules/google-gax/build/src/streamingCalls/streaming.js:419:24)
T22:39:12.017847960Z     at StreamingApiCaller.call (/usr/src/app/node_modules/google-gax/build/src/streamingCalls/streamingApiCaller.js:54:16)
T22:39:12.017863772Z     at /usr/src/app/node_modules/google-gax/build/src/createApiCall.js:112:30
T22:39:12.017866637Z     at runNextTicks (node:internal/process/task_queues:60:5)
T22:39:12.017869664Z     at listOnTimeout (node:internal/timers:545:9)
T22:39:12.017872719Z     at process.processTimers (node:internal/timers:519:7) {
T22:39:12.017875762Z        code: 13,
T22:39:12.017878562Z        details: 'Received RST_STREAM with code 2 (Internal server error)',
T22:39:12.017881255Z        metadata: Metadata { internalRepr: Map(0) {}, options: {} }
T22:39:12.017883953Z } { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T22:39:12.017886915Z Received stream end { requestTag: 'rvKwm', fun: '[Firestore._initializeStream]' }
T22:39:12.019839360Z node:events:496
T22:39:12.019872407Z     throw er; // Unhandled 'error' event
T22:39:12.019876840Z ^
T22:39:12.019879811Z
T22:39:12.019882613Z Error: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

Metadata

Metadata

Assignees

Labels

api: firestoreIssues related to the googleapis/nodejs-firestore API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions