Skip to content

Conversation

SemyonSinchenko
Copy link
Collaborator

What changes were proposed in this pull request?

  • cbt-ci-release
  • updates in build.sbt
  • new databricks-connect subproject

Why are the changes needed?

Part of #510

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.85%. Comparing base (bc487ef) to head (71040e0).
Report is 31 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #598      +/-   ##
==========================================
- Coverage   91.43%   89.85%   -1.58%     
==========================================
  Files          18       21       +3     
  Lines         829     1065     +236     
  Branches       52      119      +67     
==========================================
+ Hits          758      957     +199     
- Misses         71      108      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SemyonSinchenko
Copy link
Collaborator Author

To check the result:

  1. ./build/sbt
  2. root / publishLocal
  3. connect / publishLocal
  4. databricksConnect / publishLocal

@SemyonSinchenko SemyonSinchenko requested a review from rjurney May 27, 2025 12:36
@SemyonSinchenko
Copy link
Collaborator Author

With a single command:

./build/sbt "root/publishLocal; connect/publishLocal; databricksConnect/publishLocal"

@james-willis
Copy link
Collaborator

Is there an issue for why we need databricks-specific logic? should we link that explanation here in the PR?

@SemyonSinchenko
Copy link
Collaborator Author

SemyonSinchenko commented May 27, 2025

Is there an issue for why we need databricks-specific logic? should we link that explanation here in the PR?

mrpowers-io/tsumugi-spark#55 (comment)

For unknown for me reason, databricks engineers are using an own shading rule that is different from an OSS one.

Copy link
Collaborator

@james-willis james-willis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dont like the idea of having a databricks specific distribution. However I don't have a better option here. I asked around a little bit if there is something else we can do that is vendor agnostic.

I will approve once the databricks typo is fixed

@SemyonSinchenko
Copy link
Collaborator Author

@WeichenXu123 Hello! Do you know, can we use the word "databricks" in the name of published artifacts (graphframes-databricks-connect) without violation of the trademark? Is there a better way to allow databricks users to use graphframes on "shared access" clusters?

cc: @rjurney

@MrPowers
Copy link

LGTM!

Copy link
Collaborator

@rjurney rjurney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SemyonSinchenko
Copy link
Collaborator Author

I'm waiting for the approve from @dmatrix about can we use the word "databricks" in the name of released artifacts, code and documentation.

@WeichenXu123
Copy link
Contributor

@WeichenXu123 Hello! Do you know, can we use the word "databricks" in the name of published artifacts (graphframes-databricks-connect) without violation of the trademark? Is there a better way to allow databricks users to use graphframes on "shared access" clusters?

cc: @rjurney

I suggest to avoid using "databricks" in the name, but using name like graphframes-spark-connect.

but I think supporting Spark connect is not sufficient to make it work on Databricks shared cluster, we also need to whitelist a bunch of JVM methods in the lib for running on Databricks shared cluster (and needs our security team's approval for doing that)

@Kimahriman
Copy link
Contributor

Does the API change from protobuf.Any to Array[Byte] in Spark 4 fix the shading/binary compatibility issue, at least in Spark 4+? Might be a good compromise if it does to prevent a dedicated artifact for a closed source fork

@SemyonSinchenko
Copy link
Collaborator Author

Does the API change from protobuf.Any to Array[Byte] in Spark 4 fix the shading/binary compatibility issue, at least in Spark 4+? Might be a good compromise if it does to prevent a dedicated artifact for a closed source fork

I did not try but for me it looks like this change in Spark 4.0.x should resolve the mentioned problem.

1. Dropped a databricksConnect subproject
2. Shading rule is a variable from now with a default eq to oss
3. Building for databricks can be done by overriding the variable
@SemyonSinchenko
Copy link
Collaborator Author

SemyonSinchenko commented Jun 6, 2025

@james-willis @Kimahriman @rjurney @dmatrix

Hello! After some discussions with engineers from Databricks and based on your comments I did the following updates:

  1. I removed databricksConnect as a separate artifact
  2. I changed the proto shading pattern from literal to variable
  3. Anyone who wants a binary compatibility with Databricks Runtimes can build GraphFrames Connect with a command ./build/sbt -Dvendor.name=dbx connect/assembly

Please, take a look on the PR when you have time.
P.S. As I already mentioned, I'm not an experienced person in the topic of publishing anything to the sonatype central and this is my first experience with it. Please keep this in mind during the review. Thanks!

cc: @WeichenXu123

@james-willis
Copy link
Collaborator

I looked into what sedona is doing here and it seems that it shades in its own version of dependencies when databricks has some kind of conflict:

https://github.com/apache/sedona/blob/e8c0d5af41670757c6d319d9c0d393b0154ec929/common/pom.xml#L143

Copy link
Collaborator

@james-willis james-willis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will making users compile from source for dbx impede adoption?

@SemyonSinchenko
Copy link
Collaborator Author

will making users compile from source for dbx impede adoption?

Running Spark Connect plugins on Databricks is not an easy task anyway. I think we should be fine. Spark 4.0 should resolve this.

@SemyonSinchenko SemyonSinchenko merged commit 06ee372 into graphframes:master Jun 9, 2025
5 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 510-scala-publish-ci branch July 19, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants