-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I am attempting to use datastore.Client()
from within a google cloud dataflow (apache beam) pipeline.
It attempts to pickle objects being passed around (lexically or arguments) to processing stages, but unfortunately the Client is not pickleable:
File "lib/apache_beam/transforms/ptransform.py", line 474, in __init__
self.args = pickler.loads(pickler.dumps(self.args))
File "lib/apache_beam/internal/pickler.py", line 212, in loads
return dill.loads(s)
File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 277, in loads
return load(file)
File "/Users/me/Library/Python/2.7/lib/python/site-packages/dill/dill.py", line 266, in load
obj = pik.load()
File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/local/Cellar/python/2.7.12_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1089, in load_newobj
obj = cls.__new__(cls, *args)
File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 35, in grpc._cython.cygrpc.Channel.__cinit__ (src/python/grpcio/grpc/_cython/cygrpc.c:4022)
TypeError: __cinit__() takes at least 2 positional arguments (0 given)
I believe the correct fix is to discard the Connection when serializing, and rebuild it when deserialized.
I could attempt to recreate the Client within each processing pipeline, but that can cause O(Records) Client creations...and since in my testing I see:
DEBUG:google_auth_httplib2:Making request: POST https://accounts.google.com/o/oauth2/token
printed on each creation, then I imagine Google SRE would really prefer we not do this O(N) times.
This is a tricky cross-team interaction issue (only occurs for those pickling Clients, in my case: google-cloud-datastore and apache-beam google-dataflow), so not sure the proper place to file this. I've cross-posted it to the apache beam JIRA as well https://issues.apache.org/jira/browse/BEAM-1788, though the issue is in the google cloud datastore code.
Mac 10.12.3, Python 2.7.12, google-cloud-dataflow 0.23.0