Skip to content

Conversation

SemyonSinchenko
Copy link
Collaborator

What changes were proposed in this pull request?

  • Refactor scala-publish.yml to clarify release and snapshot publishing conditions.
  • Adjust docs.yml trigger to specifically include the main branch.
  • Remove unused Sonatype import from build.sbt.
  • Enhance developer metadata and maintainers list in build.sbt.
  • Update dependencies and assembly configuration to address shading and exclude non-connect classes for the Uber JAR.
  • Introduce custom POM post-processing for correct dependency scope adjustments.

Why are the changes needed?

Close #614

- Refactor `scala-publish.yml` to clarify release and snapshot publishing conditions.
- Adjust `docs.yml` trigger to specifically include the `main` branch.
- Remove unused Sonatype import from `build.sbt`.
- Enhance developer metadata and maintainers list in `build.sbt`.
- Update dependencies and assembly configuration to address shading and exclude non-connect classes for the Uber JAR.
- Introduce custom POM post-processing for correct dependency scope adjustments.
@SemyonSinchenko
Copy link
Collaborator Author

@Kimahriman sorry for tagging, but can I ask you to take a look? You helped a lot in realizing of all the problems this PR aims to fix. Thanks in advance!

@Kimahriman
Copy link
Contributor

Can you describe exactly what you're trying to achieve with the assembly/shading?

@SemyonSinchenko
Copy link
Collaborator Author

Can you describe exactly what you're trying to achieve with the assembly/shading?

@Kimahriman Only one thing: renaming com.google.protobuf.* to org.sparkproject.protobuf.*
Otherwise SparkConnect Plugin won't work in Spark 3.5.x

Last time you found a problem that graphframes is included to the assembly, as well as slf4j. In this PR I'm trying to fix it.

I run publishLocal and inspected the resulted JAR and POM:

  • JAR contains only connect-classes (not graphframes or slf4j), if I decompile the plugin it has a right signature (with org.sparkproject.protobuf.Any
  • POM has graphframes as a runtime dependency

@Kimahriman
Copy link
Contributor

Ok that's what I thought just wanted to make sure. What was your experience with sbt-shading? Another option to simplify a lot of things would be only support the connect stuff in Spark 4, that's what delta is doing with their connect support.

It seems like there should be a simpler way to do this but don't know what it would be

@SemyonSinchenko
Copy link
Collaborator Author

Ok that's what I thought just wanted to make sure. What was your experience with sbt-shading? Another option to simplify a lot of things would be only support the connect stuff in Spark 4, that's what delta is doing with their connect support.

It seems like there should be a simpler way to do this but don't know what it would be

I tried sbt-shading yesterday and failed -- no docs, no issues, code is dark and contains zero comments inside. The plugin itself looks semi-abandoned and I failed to realize how to make it work in the way the published JAR contains renamed classes.

that's what delta is doing with their connect support

Actually I spent some time today in attempts to realize how build.sbt is done in delta and it looks like they are using assembly too for publishing: they are overwriting Compile / packageBin := assembly.value in the same way.

Based on what I found on different resources, it looks like most of projects uses assembly for shading, not sbt-shading plugin.

@Kimahriman
Copy link
Contributor

Ok think I figured out the graphframes shading issue. If you print out the classpath for jars to exclude (and use the assembly classpath):

     assembly / assemblyExcludedJars := {
      val cp = (assembly / fullClasspath).value
      val allowedPrefixes = Set("protobuf-java")
      cp.filter { f =>
        println(f)
        !allowedPrefixes.exists(prefix => f.data.getName.startsWith(prefix))
      }
    },

you get

Attributed(/.../graphframes/graphframes-connect/target/scala-2.12/classes)
Attributed(/.../graphframes/target/scala-2.12/classes)
Attributed(/.../Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.18/scala-library-2.12.18.jar)
Attributed(/.../Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.24.4/protobuf-java-3.24.4.jar)

The root project is included directly via classes instead of a jar, so you can't exclude the jar. But you can set

exportJars := true,

on the root project and it will use the jar instead in the connect project which lets you exclude it in assembly

Attributed(/.../graphframes/graphframes-connect/target/scala-2.12/classes)
Attributed(/.../graphframes/target/scala-2.12/graphframes-spark3_2.12-0.9.0-SNAPSHOT.jar)
Attributed(/.../Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.18/scala-library-2.12.18.jar)
Attributed(/.../Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.24.4/protobuf-java-3.24.4.jar)

So that should fix/remove the needed for the custom POM handling

@Kimahriman
Copy link
Contributor

Actually it's still not perfect because the sbt-protoc plugin adds protobuf-java as a compile dependency, so you end up with that in addition to the shaded jars in the graphframes connect package:

After running publishM2:

 % coursier resolve org.graphframes:graphframes-connect-spark3_2.12:0.9.0-SNAPSHOT
com.google.protobuf:protobuf-java:3.24.4:default
org.graphframes:graphframes-connect-spark3_2.12:0.9.0-SNAPSHOT:runtime
org.graphframes:graphframes-spark3_2.12:0.9.0-SNAPSHOT:runtime
org.scala-lang:scala-library:2.12.18:default
org.slf4j:slf4j-api:2.0.16:default

@SemyonSinchenko
Copy link
Collaborator Author

After running publishLocal I can check manually the POM and dependencies looks like this at the moment:

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.18</version>
        </dependency>
        <dependency>
            <groupId>org.graphframes</groupId>
            <artifactId>graphframes-spark3_2.12</artifactId>
            <version>0.9.0-SNAPSHOT</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>com.google.protobuf</groupId>
            <artifactId>protobuf-java</artifactId>
            <version>3.24.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>2.0.16</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_2.12</artifactId>
            <version>3.0.8</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.github.zafarkhaja</groupId>
            <artifactId>java-semver</artifactId>
            <version>0.10.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-connect_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

I think that by the pomPostProcess I can actually drop com.google.protobuf::protobuf-java at all because:

  • we are shading it in the code anyway, so there is no usage of com.google.protobuf
  • shaded dependency will be in CP anyway simply because we are using the same shading pattern like Spark itself (or vendors).

@SemyonSinchenko
Copy link
Collaborator Author

Tbh I cannot get why it put slf4j as a default while it should be provided like in POM...

 % coursier resolve org.graphframes:graphframes-connect-spark3_2.12:0.9.0-SNAPSHOT
com.google.protobuf:protobuf-java:3.24.4:default
org.graphframes:graphframes-connect-spark3_2.12:0.9.0-SNAPSHOT:runtime
org.graphframes:graphframes-spark3_2.12:0.9.0-SNAPSHOT:runtime
org.scala-lang:scala-library:2.12.18:default
org.slf4j:slf4j-api:2.0.16:default

@Kimahriman
Copy link
Contributor

Yeah not sure why that is. Turns out Delta's use of assembly isn't actually right either. They use it to shade jackson, but the Kernel modules that do the shading still have it as a runtime dependency as well:

% coursier resolve io.delta:delta-kernel-api:4.0.0                                                                                      
com.fasterxml.jackson.core:jackson-annotations:2.13.5:default
com.fasterxml.jackson.core:jackson-core:2.13.5:default
com.fasterxml.jackson.core:jackson-databind:2.13.5:default
com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.13.5:default
io.delta:delta-kernel-api:4.0.0:default
org.roaringbitmap:RoaringBitmap:0.9.25:default
org.roaringbitmap:shims:0.9.25:default
org.slf4j:slf4j-api:1.7.36:default

I found this old post that I think describes the correct way to use the assembly plugin to shade, basically building an intermediate project just for the assembly jar, and then a separate project to actually publish with the right dependencies. I think I have it working if you want me to try to make a PR to compare

Added a post-processing to mark protobuf scope to "provided" because it is a part of Apache Spark itself.
@SemyonSinchenko
Copy link
Collaborator Author

Meanwhile, I fixed the scope of the protobuf-java. At the moment it looks like this after publishLocal:

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.18</version>
        </dependency>
        <dependency>
            <groupId>org.graphframes</groupId>
            <artifactId>graphframes-spark3_2.12</artifactId>
            <version>0.9.0-SNAPSHOT</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>com.google.protobuf</groupId>
            <artifactId>protobuf-java</artifactId>
            <version>3.24.4</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>2.0.16</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_2.12</artifactId>
            <version>3.0.8</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.github.zafarkhaja</groupId>
            <artifactId>java-semver</artifactId>
            <version>0.10.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-connect_2.12</artifactId>
            <version>3.5.5</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

The size of the output JAR is 224.2Kb:
image

I always forgot that GF is uses master as a default branch...
@SemyonSinchenko
Copy link
Collaborator Author

I will merge this to enable SNAPSHOTs publishing. The root problem was fixed in #617

@SemyonSinchenko SemyonSinchenko merged commit 066260f into graphframes:master Jul 2, 2025
5 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 614-sbt-shading-plugin branch July 19, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: use shade plugin for publishing of GFConnect instead of assembly plugin
3 participants