Skip to content

Conversation

james-willis
Copy link
Collaborator

@james-willis james-willis commented Jul 15, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

  • Bump graphframes version to 0.9.0
  • deal with changes in how graphframes assigns connected component IDs
  • fix Geostats tests that have always been broken

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@james-willis james-willis changed the title [GH-2085] use graphframes 0.9.0 for spark 3.5+ [WIP] [GH-2085] use graphframes 0.9.0 for spark 3.5+ Jul 15, 2025
@james-willis james-willis force-pushed the graphframes-0.9.0 branch 2 times, most recently from 7012ad7 to ad55a66 Compare July 15, 2025 02:20
@james-willis james-willis changed the title [WIP] [GH-2085] use graphframes 0.9.0 for spark 3.5+ [GH-2085] use graphframes 0.9.0 for spark 3.5+ Jul 15, 2025
@github-actions github-actions bot added the docs label Jul 15, 2025
@james-willis james-willis marked this pull request as ready for review July 15, 2025 02:31
@james-willis james-willis requested a review from jiayuasu as a code owner July 15, 2025 02:31
@james-willis james-willis marked this pull request as draft July 15, 2025 02:33
@james-willis
Copy link
Collaborator Author

We should wait until 0.9.0 is release and we can point to the non-snapshot version before merging this.

If we disagree with the change in behavior, this PR gives us a chance to discuss before 0.9.0 is released.

@Kimahriman
Copy link
Contributor

You don't just want to set the config to maintain the old behavior? It seems odd that the output of the dbscan function is based on a config a user could set for a transitive dependency they don't even know is used for dbscan

@james-willis
Copy link
Collaborator Author

I really dislike the way the spark configs are global configurations and so modifying a spark config changes the state of the whole cluster.

I'm thinking of adding a setter to graphframes 0.9.0 so I can set this value on the connectedComponents instance without changing spark configs.

Then the output schema of DBSCAN is not based on the user's setting of the graphframe spark config (one way or the other).

@Kimahriman
Copy link
Contributor

Yeah it's definitely not the cleanest thing, and I was thinking the same thing a setting on the connected components object would be much cleaner

@github-actions github-actions bot removed the docs label Jul 15, 2025
@james-willis james-willis changed the title [GH-2085] use graphframes 0.9.0 for spark 3.5+ [GH-2085] use graphframes 0.9.0 Jul 17, 2025
Copy link
Contributor

@Kimahriman Kimahriman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to get this in and a 1.8.0 release with Spark 4 support!

@james-willis james-willis marked this pull request as ready for review July 20, 2025 17:15
@jiayuasu jiayuasu added this to the sedona-1.8.0 milestone Jul 21, 2025
@jiayuasu jiayuasu merged commit 7a79925 into apache:master Jul 21, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bump graphframes version to 0.9.0
3 participants