feat: SparkConnect support #506

SemyonSinchenko · 2025-02-07T12:40:38Z

Work is in progress but the overall design is in the final state:

subproject graphframes-connect
GraphFrames API in protobuf: graphframes-connect/src/main/protobuf/graphframes.proto
Proto->GraphFrames->Proto logic: graphframes-connect/src/main/scala/org/apache/spark/sql/graphframes/GraphFramesConnectUtils.scala
Connect Relation Plugin: graphframes-connect/src/main/scala/org/apache/spark/sql/graphframes/GraphFramesConnect.scala

For JVM part generation is built in to the build.sbt; for Python (and possible other clients) buf (buf.yaml, buf.gen.yaml)

Close #447

codecov-commenter · 2025-02-08T14:42:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.

Project coverage is 90.35%. Comparing base (bc487ef) to head (fc8ebae).
Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
src/main/scala/org/graphframes/lib/Pregel.scala	25.00%	3 Missing ⚠️
...cala/org/graphframes/lib/ConnectedComponents.scala	66.66%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #506      +/-   ##
==========================================
- Coverage   91.43%   90.35%   -1.09%     
==========================================
  Files          18       18              
  Lines         829      902      +73     
  Branches       52       96      +44     
==========================================
+ Hits          758      815      +57     
- Misses         71       87      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SemyonSinchenko · 2025-02-08T18:59:48Z

@rjurney Hello!

At the moment PR provides a (mostly) working version of GrahFrames API for PySpark Connect. There are a lot of open questions from my side:

I would like to change tests to work for both connect and classic but it can be tricky because current tests are very old; at the moment I just copied the whole tests suite and slightly modified it;
I do not know what is the best way to integrate Connect API to the current PySpark API: at the moment I implemented everything as a separate GraphFrameConnect class that has (almost) the same API like the GraphFrame;

How to try it: run the script in dev folder. Connect to the Spark Connect Server with SparkSession.builder.remote("sc://localhost:15002").getOrCreate() and use graphframes.connect.graphframes_client.GraphFrameConnect.

rjurney · 2025-02-09T04:50:57Z

@rjurney Hello!

At the moment PR provides a (mostly) working version of GrahFrames API for PySpark Connect. There are a lot of open questions from my side:

I would like to change tests to work for both connect and classic but it can be tricky because current tests are very old; at the moment I just copied the whole tests suite and slightly modified it;

I do not know what is the best way to integrate Connect API to the current PySpark API: at the moment I implemented everything as a separate GraphFrameConnect class that has (almost) the same API like the GraphFrame;

How to try it: run the script in dev folder. Connect to the Spark Connect Server with SparkSession.builder.remote("sc://localhost:15002").getOrCreate() and use graphframes.connect.graphframes_client.GraphFrameConnect.

My PR #473 converts the unit tests to pytest tests. That should help there. My PR is a monster, I will work on breaking it up into pieces tomorrow and that should give you the ability to pull the test-related code into your PR.

It would be nice if there wasn't a completely different API for Spark Connect. How different are the implementations? I'll take a look at the PR tomorrow and see.

This is very cool work, thanks for it!

SemyonSinchenko · 2025-02-09T05:53:31Z

It would be nice if there wasn't a completely different API for Spark Connect. How different are the implementations? I'll take a look at the PR tomorrow and see.

It is the same, just implemented as a separate class (python/graphframes/connect/graphframe_client.py)

rjurney · 2025-02-11T08:19:20Z

Is it feasible to add connect support to GraphFrame? Especially if it could detect a Connect connection and 'just work' or take an argument?

SemyonSinchenko · 2025-02-11T09:19:28Z

Is it feasible to add connect support to GraphFrame? Especially if it could detect a Connect connection and 'just work' or take an argument?

Of course. That is the idea, but the question is how to do it? In Spark 3.x devs made a decision to make two versions of DataFrame (pyspark.sql.DataFrame and pyspark.sql.connect.dataframe.DataFrame). In the Spark 4.x it is changed and there are three versions at the moment:

Classic DataFrame: https://github.com/apache/spark/blob/master/python/pyspark/sql/classic/dataframe.py
Connect DataFrame: https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/dataframe.py
DataFrame that contains dispatch mechanism: https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py

I did not want to touch the existing code in this PR. I would like to finalize the implementation as a standalone GraphFrameConnect class that has exactly the same API as an existing GraphFrame. And we can deal with an integration in the next iteration.

What do you think about it?
@rjurney

SemyonSinchenko · 2025-02-11T09:26:44Z

To be honest, I like the idea of dispatch like in PySpark, but it will require big changes in the current GraphFrame, tests, build, etc. I think it would be better to do in GraphFrames 1.0 because I would like to slightly change the public API.

For example, due to Spark Connect limitations it maybe tricky to return both loss and DataFrame from svdPlusPlus. I would like to change the signature and do not return the loss by default. The same about PageRank: I would like to make computation of the edge weights optional with default to False.

rjurney · 2025-02-11T13:42:27Z

Is it feasible to add connect support to GraphFrame? Especially if it could detect a Connect connection and 'just work' or take an argument?

Of course. That is the idea, but the question is how to do it? In Spark 3.x devs made a decision to make two versions of DataFrame (pyspark.sql.DataFrame and pyspark.sql.connect.dataframe.DataFrame). In the Spark 4.x it is changed and there are three versions at the moment:

Classic DataFrame: https://github.com/apache/spark/blob/master/python/pyspark/sql/classic/dataframe.py

Connect DataFrame: https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/dataframe.py

DataFrame that contains dispatch mechanism: https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py

I did not want to touch the existing code in this PR. I would like to finalize the implementation as a standalone GraphFrameConnect class that has exactly the same API as an existing GraphFrame. And we can deal with an integration in the next iteration.

What do you think about it? @rjurney

Ohhhh wow, okay. Now this all makes sense :) I would not have thought that. Everything makes sense now. I think you're doing what makes sense.

SemyonSinchenko · 2025-02-19T08:07:01Z

Questions / Topics

Should we store the generated python code in the repository or not? Apache Spark is storing python code but JVM code is generated during the build. At the moment, my PR follows the way of Apache Spark.
Should we have a separate GraphFramesConnect or should we use runtime dispatch instead? At the moment my implementation offers a separate class, but Apache Spark uses dispatch.
Should a GraphFrames Connect be a part of graphframes distribution or should we provide a separate JAR instead?
Should be a Python Connect client be a part of graphframes distribution, or should it be an extra (like pip install graphframes[connect])?

rjurney · 2025-02-25T15:41:43Z

@SemyonSinchenko thanks for your work here, this is just awesome. I am going to give it a thorough review this weekend, sooner if I can. Will have questions borne of my lack of Scala knowledge as much as anything :)

SemyonSinchenko · 2025-03-04T07:31:49Z

@rjurney What is the plan for the review? Will you review this changes, or should I try to ask someone else to review this?

rjurney · 2025-03-04T16:32:52Z

@SemyonSinchenko TODAY

rjurney · 2025-03-04T18:04:48Z

@SemyonSinchenko can you describe any changes to the user experience of GraphFrames? Today I am using databricks-connect via VSCode for PySpark on Databricks and I can't use GraphFrame.pageRank because I am on Connect. Will it work with this PR? What about checkpointing code? That doesn't work either. Just wondering.

SemyonSinchenko · 2025-03-04T18:40:27Z

@SemyonSinchenko can you describe any changes to the user experience of GraphFrames? Today I am using databricks-connect via VSCode for PySpark on Databricks and I can't use GraphFrame.pageRank because I am on Connect. Will it work with this PR? What about checkpointing code? That doesn't work either. Just wondering.

@rjurney

For classic users nothing should changes, only some minor / questinable things. For example, I introduced the build of JAR to the process of building the graphframes PySpark. Maybe I should remove it. Should I?

For connect users it is different. They should add graphframes-connect JAR to the Spark Connect Server part and add a spark conf like spark.connect.extensions.relation.classes=org.apache.spark.sql.graphframes.GraphFramesConnect to configuration of their Spark Connect Server. That is how the connect plugin system is supposed to be used.

Answers:

Will it work on Databricks? Most probably not, but I cannot be sure for 100%. The story is that Databricks Spark is a fork of Apache Spark and it does not match 100% to the Apache Spark. As I can assume based on the delta-io code for PySpark Connect, guys from Databricks backported the signature of plugins from the Spark 4.0 to their branch for DBRs 14.x and 15.x. There is a strong reason for that, the same reason why it was changed from 3.x to 4.x. But my code is targeting Apache Spark 3.4.x and 3.5.x with a workaround, related to the shading rules, to avoid the problem with plugins. So, in my understanding, even with my plugin, GraphFrames won't work on DBR 14.x and DBR 15.x (Spark 3.5.x), but I'm not 100% sure. My question is, should we maintain the DBR compatibility? And if so, how to do it without an access to the source code of DBRs? At the moment there is even no documentation about how to add Spark Connect plugins to the Databricks clusters and you should provide the right configuration before the start of runtime (I'm assuming that the only way is by init scripts that have terrible documentation).
I workarounded the checkpoint problem by the usage of the spark.checkpoint.dir variable. You can see it, for example, in these lines. For classic the behavior is the same.

SemyonSinchenko · 2025-03-04T19:41:48Z

@rjurney JFYI. Because I know about the breaking change in connect plugins system from Apache Spark 3.5.x to 4.0.x (the change, that in my understanding is already in the DBR 14.x and 15.x), I separated all the plugins logic from the plugin itself: the plugin itself is less than five logical lines of code to simplify the migration (and DBR shim) as much as possible. In Apache Spark 4.0.x the signature is changed to Optional<LogicalPlan> transform(byte[] relation, SparkConnectPlanner planner); and my workaround with shading of GraphFrames proto message is not needed anymore.

rjurney · 2025-03-10T02:38:36Z

@SemyonSinchenko so is it ready for review then?

SemyonSinchenko · 2025-03-10T08:24:59Z

@SemyonSinchenko so is it ready for review then?

@rjurney Yes, it is. I still need to resolve merge conflicts, but 99% of them are related to the tests, not to the logic. Also one additional method will be introduced to wrap also a recently added power iteration clustering.

Classic tests are passed; ++some changes

SemyonSinchenko · 2025-03-10T21:09:04Z

@rjurney I resolved all the conflicts.

An important changes you should check:

I added the GraphFrames JAR to the python package to avoid JVM-errors;
I moved tests to python/tests to avoid storing them inside the package;
I significantly simplify tests by removing all the artifacts of unittests;

At the moment only part of tests are turned on for Connect because other part of tests is based on the examples that are called through py4j.

rjurney · 2025-03-10T21:10:30Z

Wow, awesome! I'll review it now.

rjurney

@SemyonSinchenko I'm going ahead and approving this, although I would appreciate a brief explanation of where dispatch is implemented because I can't tell what the user experience of this new approach i.e. normal vs connect.

.github/workflows/python-ci.yml

.gitignore

build.sbt

graphframes-connect/src/main/protobuf/graphframes.proto

python/graphframes/connect/graphframe_client.py

rjurney · 2025-03-16T20:53:29Z

python/graphframes/connect/proto/graphframes_pb2.py

+_sym_db = _symbol_database.Default()
+
+
+DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(


@SemyonSinchenko what is this?

Wait, do not look on this. This code is generated automatically by the protoc (buf)!

python/graphframes/connect/proto/graphframes_pb2.pyi

rjurney · 2025-03-16T20:58:36Z

python/graphframes/graphframe.py

-        return _from_java_gf(jdf, self._spark)
-
-    def filterEdges(self, condition: Union[str, Column]) -> "GraphFrame":
+    def filterEdges(self, condition: str | Column) -> "GraphFrame":


What version of Python introduced this? Ahhh 3.5. Cool, I learned about it :)

I think it is quite new, but it was backported in all supported versions, so all that you need to do is to add a line:

from __future__ import annotations

https://peps.python.org/pep-0563/

rjurney · 2025-03-16T20:58:54Z

python/graphframes/graphframe.py

-            raise TypeError("condition should be string or Column")
-        return _from_java_gf(jdf, self._spark)
+
+        return GraphFrame._from_impl(self._impl.filterEdges(condition=condition))


Is this exception handling handled elsewhere?

In the proposed design, all the exception are the responsibility of the implementation. So, answer is yes, it is handled in impl.

rjurney · 2025-03-16T21:27:18Z

@SemyonSinchenko regarding the JAR, I think it is right to include it and think we should add it to MANIFEST.in.

python/pyproject.toml

rjurney

lgtm again

SemyonSinchenko added 3 commits February 4, 2025 11:15

Merge remote-tracking branch 'refs/remotes/origin/master'

d1972fc

wip

c158815

wip

ea11df6

SemyonSinchenko added 3 commits February 8, 2025 16:50

wip

0ddd5bd

wip

da7eccc

The first working version

fb784a3

SemyonSinchenko changed the title ~~[WIP] feat: SparkConnect support~~ feat: SparkConnect support Feb 8, 2025

SemyonSinchenko marked this pull request as ready for review February 8, 2025 18:54

SemyonSinchenko mentioned this pull request Feb 18, 2025

Create GraphFrame.typeDegree() to provide degree by relationship type #519

Open

Merge remote-tracking branch 'refs/remotes/graphframes/master'

9f8905f

Merge remote-tracking branch 'refs/remotes/graphframes/master'

d58ed2a

SemyonSinchenko mentioned this pull request Feb 22, 2025

GraphFrame API in SparkR #108

Open

WIP

20f7575

SemyonSinchenko changed the title ~~feat: SparkConnect support~~ [WIP-DO-NOT-MERGE] feat: SparkConnect support Feb 23, 2025

SemyonSinchenko added 7 commits February 23, 2025 22:53

Working version?

eee7b7b

Merge remote-tracking branch 'refs/remotes/graphframes/master'

130b12e

Fix tests

7e325aa

Fix tests

c47a57e

Fix CI typo

fc8ebae

Fix typo in CI

f13c754

Fix wget's verbose + GHA bug

a21b5aa

SemyonSinchenko mentioned this pull request Feb 26, 2025

Convert nose / unittest tests to pytest #524

Merged

SemyonSinchenko added 6 commits March 10, 2025 21:22

Merge main

7fd1f23

Classic tests are passed; ++some changes

Typo

cc60bcb

Fix merge-artifacts

5528d65

Fix merge artifacts

f88e19a

Apply pre-commit rules

59897fb

Add the missing method

97054b0

SemyonSinchenko changed the title ~~[DO-NOT-MERGE] feat: SparkConnect support~~ feat: SparkConnect support Mar 10, 2025

SemyonSinchenko added 2 commits March 10, 2025 22:34

Restore accidently deleted part of CI

90a326f

Typo

c8bcf43

rjurney approved these changes Mar 16, 2025

View reviewed changes

Fixes from comments

9d7f714

rjurney reviewed Mar 17, 2025

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

rjurney approved these changes Mar 17, 2025

View reviewed changes

Pin the pyspark version <4.0 and re-generate lock

5a91659

rjurney merged commit 1e702c2 into graphframes:master Mar 17, 2025
5 checks passed

rjurney mentioned this pull request Mar 18, 2025

Merge Mercury Graph with GraphFrames BBVA/mercury-graph#40

Open

SemyonSinchenko deleted the 447-spark-connect branch April 6, 2025 09:14

SemyonSinchenko mentioned this pull request Jun 27, 2025

Support Grameframe Spark Backend. thomasp85/tidygraph#86

Open

		_sym_db = _symbol_database.Default()


		DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(

feat: SparkConnect support #506

feat: SparkConnect support #506

Uh oh!

Conversation

SemyonSinchenko commented Feb 7, 2025

Uh oh!

codecov-commenter commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SemyonSinchenko commented Feb 8, 2025

Uh oh!

rjurney commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SemyonSinchenko commented Feb 9, 2025

Uh oh!

rjurney commented Feb 11, 2025

Uh oh!

SemyonSinchenko commented Feb 11, 2025

Uh oh!

SemyonSinchenko commented Feb 11, 2025

Uh oh!

rjurney commented Feb 11, 2025

Uh oh!

SemyonSinchenko commented Feb 19, 2025

Uh oh!

rjurney commented Feb 25, 2025

Uh oh!

SemyonSinchenko commented Mar 4, 2025

Uh oh!

rjurney commented Mar 4, 2025

Uh oh!

rjurney commented Mar 4, 2025

Uh oh!

SemyonSinchenko commented Mar 4, 2025

Uh oh!

SemyonSinchenko commented Mar 4, 2025

Uh oh!

rjurney commented Mar 10, 2025

Uh oh!

SemyonSinchenko commented Mar 10, 2025

Uh oh!

SemyonSinchenko commented Mar 10, 2025

Uh oh!

rjurney commented Mar 10, 2025

Uh oh!

rjurney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjurney Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjurney Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

rjurney Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

rjurney commented Mar 16, 2025

Uh oh!

Uh oh!

rjurney left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 8, 2025 •

edited

Loading

rjurney commented Feb 9, 2025 •

edited

Loading