Skip to content

chore: Cleanup assembly and shading #617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 2, 2025

Conversation

Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Resolves #614

Since the sbt-assembly plugin is meant for creating fat/uber JARs, it doesn't do anything about modifying POMs for published libraries to take into account the things that are shaded. So this creates an intermediate project for the connect shading, and then a final project with the correct dependencies and shaded JAR for actual publishing.

Why are the changes needed?

Fix connect artifact so only protobuf is shaded.

Comment on lines -119 to -128
// Assembly settings
assembly / test := {}, // No tests in assembly
assemblyPackageScala / assembleArtifact := false,
assembly / assemblyMergeStrategy := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x if x.endsWith("module-info.class") => MergeStrategy.discard
case x =>
val oldStrategy = (assembly / assemblyMergeStrategy).value
oldStrategy(x)
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because I don't think there's any need to run assembly on the root project? Unless you want to keep the ability to manual build a fat JAR

@Kimahriman
Copy link
Contributor Author

POM from publishM2:

<?xml version='1.0' encoding='UTF-8'?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.graphframes</groupId>
    <artifactId>graphframes-connect-spark4_2.13</artifactId>
    <packaging>jar</packaging>
    <description>graphframes-connect</description>
    <url>https://graphframes.io/</url>
    <version>0.9.0-SNAPSHOT</version>
    <licenses>
        <license>
            <name>Apache-2.0</name>
            <url>https://opensource.org/licenses/Apache-2.0</url>
            <distribution>repo</distribution>
        </license>
    </licenses>
    <name>graphframes-connect</name>
    <organization>
        <name>org.graphframes</name>
        <url>https://graphframes.io/</url>
    </organization>
    <scm>
        <url>https://github.com/graphframes/graphframes</url>
        <connection>scm:git@github.com:graphframes/graphframes.git</connection>
    </scm>
    <developers>
        <developer>
            <id>rjurney</id>
            <name>Russell Jurney</name>
            <url>https://github.com/rjurney</url>
            <email>russell.jurney@gmail.com</email>
        </developer>
        <developer>
            <id>SemyonSinchenko</id>
            <name>Sem</name>
            <url>https://github.com/SemyonSinchenko</url>
            <email>ssinchenko@apache.org</email>
        </developer>
    </developers>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.13.12</version>
        </dependency>
        <dependency>
            <groupId>org.graphframes</groupId>
            <artifactId>graphframes-spark4_2.13</artifactId>
            <version>0.9.0-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>2.0.16</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_2.13</artifactId>
            <version>3.0.8</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.github.zafarkhaja</groupId>
            <artifactId>java-semver</artifactId>
            <version>0.10.2</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>

@Kimahriman
Copy link
Contributor Author

% jar -tvf /.../.m2/repository/org/graphframes/graphframes-connect-spark4_2.13/0.9.0-SNAPSHOT/graphframes-connect-spark4_2.13-0.9.0-SNAPSHOT.jar
   374 Fri Jan 01 00:00:00 EST 2010 META-INF/MANIFEST.MF
     0 Fri Jan 01 00:00:00 EST 2010 META-INF/
  3042 Fri Jan 01 00:00:00 EST 2010 graphframes.proto
     0 Fri Jan 01 00:00:00 EST 2010 org/
     0 Fri Jan 01 00:00:00 EST 2010 org/apache/
     0 Fri Jan 01 00:00:00 EST 2010 org/apache/spark/
     0 Fri Jan 01 00:00:00 EST 2010 org/apache/spark/sql/
     0 Fri Jan 01 00:00:00 EST 2010 org/apache/spark/sql/graphframes/
  2740 Fri Jan 01 00:00:00 EST 2010 org/apache/spark/sql/graphframes/GraphFramesConnect.class
 19689 Fri Jan 01 00:00:00 EST 2010 org/apache/spark/sql/graphframes/GraphFramesConnectUtils$.class
  1205 Fri Jan 01 00:00:00 EST 2010 org/apache/spark/sql/graphframes/GraphFramesConnectUtils.class
     0 Fri Jan 01 00:00:00 EST 2010 org/graphframes/
     0 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/
     0 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/
  2172 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/AggregateMessages$1.class
 13940 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/AggregateMessages$Builder.class
 11513 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/AggregateMessages.class
   571 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/AggregateMessagesOrBuilder.class
  2046 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/BFS$1.class
 14180 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/BFS$Builder.class
 11509 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/BFS.class
   576 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/BFSOrBuilder.class
  2181 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ColumnOrExpression$1.class
   947 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ColumnOrExpression$2.class
  9927 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ColumnOrExpression$Builder.class
  2252 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ColumnOrExpression$ColOrExprCase.class
 11755 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ColumnOrExpression.class
   661 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ColumnOrExpressionOrBuilder.class
  2190 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ConnectedComponents$1.class
  9521 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ConnectedComponents$Builder.class
 11235 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ConnectedComponents.class
   420 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ConnectedComponentsOrBuilder.class
  2199 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/DropIsolatedVertices$1.class
  6977 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/DropIsolatedVertices$Builder.class
  9410 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/DropIsolatedVertices.class
   227 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/DropIsolatedVerticesOrBuilder.class
  2118 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterEdges$1.class
 10904 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterEdges$Builder.class
 10359 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterEdges.class
   412 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterEdgesOrBuilder.class
  2145 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterVertices$1.class
 10955 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterVertices$Builder.class
 10434 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterVertices.class
   418 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FilterVerticesOrBuilder.class
  2055 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Find$1.class
  8319 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Find$Builder.class
 10139 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Find.class
   316 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/FindOrBuilder.class
  2145 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/GraphFramesAPI$1.class
  1783 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/GraphFramesAPI$2.class
 55038 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/GraphFramesAPI$Builder.class
  3407 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/GraphFramesAPI$MethodCase.class
 22310 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/GraphFramesAPI.class
  4042 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/GraphFramesAPIOrBuilder.class
 12110 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Graphframes.class
  2163 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/LabelPropagation$1.class
  7598 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/LabelPropagation$Builder.class
  9704 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/LabelPropagation.class
   246 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/LabelPropagationOrBuilder.class
  2091 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PageRank$1.class
 12293 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PageRank$Builder.class
 11699 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PageRank.class
   513 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PageRankOrBuilder.class
  2271 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ParallelPersonalizedPageRank$1.class
 15629 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ParallelPersonalizedPageRank$Builder.class
 12120 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ParallelPersonalizedPageRank.class
   762 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ParallelPersonalizedPageRankOrBuilder.class
  2235 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PowerIterationClustering$1.class
  9601 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PowerIterationClustering$Builder.class
 11412 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PowerIterationClustering.class
   431 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PowerIterationClusteringOrBuilder.class
  2073 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Pregel$1.class
 26478 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Pregel$Builder.class
 15724 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Pregel.class
  1524 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/PregelOrBuilder.class
  2118 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/SVDPlusPlus$1.class
 10279 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/SVDPlusPlus$Builder.class
 11876 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/SVDPlusPlus.class
   384 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/SVDPlusPlusOrBuilder.class
  2136 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ShortestPaths$1.class
 14361 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ShortestPaths$Builder.class
 10842 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ShortestPaths.class
   675 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/ShortestPathsOrBuilder.class
  2145 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StringOrLongID$1.class
   886 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StringOrLongID$2.class
  9877 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StringOrLongID$Builder.class
  2162 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StringOrLongID$IdCase.class
 11599 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StringOrLongID.class
   637 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StringOrLongIDOrBuilder.class
  2262 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StronglyConnectedComponents$1.class
  7774 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StronglyConnectedComponents$Builder.class
  9979 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StronglyConnectedComponents.class
   268 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/StronglyConnectedComponentsOrBuilder.class
  2136 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/TriangleCount$1.class
  6879 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/TriangleCount$Builder.class
  9235 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/TriangleCount.class
   213 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/TriangleCountOrBuilder.class
  2091 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Triplets$1.class
  6809 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Triplets$Builder.class
  9110 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/Triplets.class
   203 Fri Jan 01 00:00:00 EST 2010 org/graphframes/connect/proto/TripletsOrBuilder.class
     0 Fri Jan 01 00:00:00 EST 2010 org/sparkproject/
     0 Fri Jan 01 00:00:00 EST 2010 org/sparkproject/connect/
     0 Fri Jan 01 00:00:00 EST 2010 org/sparkproject/connect/protobuf/

@SemyonSinchenko
Copy link
Collaborator

Can we still exclude the org.sparkproject.connect.protobuf from the output JAR anyhow? It increase the size of the JAR from ~400Kb (in my PR) to ~2Mb (in this PR).
image

And what is most important, it will be always in CP for any application running on top of spark because it is a part of spark itself...

@Kimahriman
Copy link
Contributor Author

Oh yeah was able to get that working by just excluding all JARs. It's annoying Spark shades this, as normally you could just directly use import org.sparkproject.connect.protobuf.* directly but obviously can't in the generated sources from protoc.

@Kimahriman
Copy link
Contributor Author

% ll -h ../graphframes-connect/target/scala-2.13/graphframes-connect-spark4_2.13-0.9.0-SNAPSHOT.jar
-rw-r--r--  1 abinford  staff   210K Jul  2 11:20 ../graphframes-connect/target/scala-2.13/graphframes-connect-spark4_2.13-0.9.0-SNAPSHOT.jar

@Kimahriman
Copy link
Contributor Author

Actually figured out how to simplify even more, don't need the extra project

@Kimahriman
Copy link
Contributor Author

<?xml version='1.0' encoding='UTF-8'?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.graphframes</groupId>
    <artifactId>graphframes-connect-spark4_2.13</artifactId>
    <packaging>jar</packaging>
    <description>graphframes-connect</description>
    <url>https://graphframes.io/</url>
    <version>0.9.0-SNAPSHOT</version>
    <licenses>
        <license>
            <name>Apache-2.0</name>
            <url>https://opensource.org/licenses/Apache-2.0</url>
            <distribution>repo</distribution>
        </license>
    </licenses>
    <name>graphframes-connect</name>
    <organization>
        <name>org.graphframes</name>
        <url>https://graphframes.io/</url>
    </organization>
    <scm>
        <url>https://github.com/graphframes/graphframes</url>
        <connection>scm:git@github.com:graphframes/graphframes.git</connection>
    </scm>
    <developers>
        <developer>
            <id>rjurney</id>
            <name>Russell Jurney</name>
            <url>https://github.com/rjurney</url>
            <email>russell.jurney@gmail.com</email>
        </developer>
        <developer>
            <id>SemyonSinchenko</id>
            <name>Sem</name>
            <url>https://github.com/SemyonSinchenko</url>
            <email>ssinchenko@apache.org</email>
        </developer>
    </developers>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.13.12</version>
        </dependency>
        <dependency>
            <groupId>org.graphframes</groupId>
            <artifactId>graphframes-spark4_2.13</artifactId>
            <version>0.9.0-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-graphx_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>2.0.16</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.scalatest</groupId>
            <artifactId>scalatest_2.13</artifactId>
            <version>3.0.8</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.github.zafarkhaja</groupId>
            <artifactId>java-semver</artifactId>
            <version>0.10.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-connect_2.13</artifactId>
            <version>4.0.0</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>
</project>

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Thanks a lot @Kimahriman !!!

@SemyonSinchenko SemyonSinchenko merged commit e0180ba into graphframes:master Jul 2, 2025
5 checks passed
SemyonSinchenko added a commit to SemyonSinchenko/graphframes that referenced this pull request Jul 2, 2025
SemyonSinchenko added a commit that referenced this pull request Jul 2, 2025
* **Update Scala CI workflows and build configurations**

- Refactor `scala-publish.yml` to clarify release and snapshot publishing conditions.
- Adjust `docs.yml` trigger to specifically include the `main` branch.
- Remove unused Sonatype import from `build.sbt`.
- Enhance developer metadata and maintainers list in `build.sbt`.
- Update dependencies and assembly configuration to address shading and exclude non-connect classes for the Uber JAR.
- Introduce custom POM post-processing for correct dependency scope adjustments.

* Add missing developer email

* Specify the scope for protobuf-java

Added a post-processing to mark protobuf scope to "provided" because it is a part of Apache Spark itself.

* Take everything from #617

* main -> master

I always forgot that GF is uses master as a default branch...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: use shade plugin for publishing of GFConnect instead of assembly plugin
2 participants