Skip to content

Can't use/pickle datastore Clients on google dataflow (apache beam) #3191

@mikelambert

Description

@mikelambert

I am attempting to use datastore.Client() from within a google cloud dataflow (apache beam) pipeline.

It attempts to pickle objects being passed around (lexically or arguments) to processing stages, but unfortunately the Client is not pickleable:

  File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
    self.args = pickler.loads(pickler.dumps(self.args))
  File "lib/apache_beam/internal/pickler.py", line 212, in loads
    return dill.loads(s)
  File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 277, in loads
    return load(file)
  File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1089, in load_newobj
    obj = cls.__new__(cls, *args)
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in grpc._cython.cygrpc.Channel.__cinit__ (src/python/grpcio/grpc/_cython/cygrpc.c:4022)
TypeError: __cinit__() takes at least 2 positional arguments (0 given)

I believe the correct fix is to discard the Connection when serializing, and rebuild it when deserialized.

I could attempt to recreate the Client within each processing pipeline, but that can cause O(Records) Client creations...and since in my testing I see:

DEBUG:google_auth_httplib2:Making request: POST https://accounts.google.com/o/oauth2/token

printed on each creation, then I imagine Google SRE would really prefer we not do this O(N) times.

This is a tricky cross-team interaction issue (only occurs for those pickling Clients, in my case: google-cloud-datastore and apache-beam google-dataflow), so not sure the proper place to file this. I've cross-posted it to the apache beam JIRA as well https://issues.apache.org/jira/browse/BEAM-1788, though the issue is in the google cloud datastore code.

Mac 10.12.3, Python 2.7.12, google-cloud-dataflow 0.23.0

Metadata

Metadata

Labels

api: datastoreIssues related to the Datastore API.status: will not fixInvalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions