feat: add an experimental implementation of SP in GF #558

SemyonSinchenko · 2025-03-22T19:50:16Z

What changes were proposed in this pull request?

An experimental no-GraphX implementation if the ShortestPaths
Corresponding tests
Small changes in .gitignore

Why are the changes needed?

GraphX is deprecated in Spark 4.0; close #557

SemyonSinchenko · 2025-03-22T20:00:53Z

@SauronShepherd @rjurney Hi! Take a look when you have time! Thanks!

SauronShepherd · 2025-03-23T03:51:27Z

Some points:

Is it right to use the Experimental Spark annotation? Isn't it something internal of Spark? (only wondering)
check(lmarks, "landmarks") shouldn't be calculated first and then used in the match statement?
In the ConnectedComponents class, constants are defined at the begining of the class while in the ShortesPaths are located at the end.
Differences in comments between ConnectedComponents and ShortesPaths:
- No comment in the supportedAlgorithms definition.
- No comment in the setAlgorithm method.
- No comment in the run method.
toSeq is redundant in landmarks(value.asScala.toSeq) (not one of your changes, but now that you're updating the class ...).
Why isn't the maximum number of iterations left to the user to configure?
earlyStopping = true could have performance implications when the number of iterations is well known. Besides, couldn't that be useful for some users, to set a maximum threshold for finding shortest paths?
Pregel.src(DISTANCE_ID) and Pregel.dst(DISTANCE_ID) appear three times each; it might be clearer to define two constants and reuse them.
aggregateArrayOfDistanceMaps is a one-line method that's used only in one place. Is it necessary?
lit(landmarks.head) is used twice in the same line.
I don't understand the logic behind the initDistancesMap method: why split the landmarks to end up applying the same logic to both parts? (I'm probably missing something)
Word typos:
- sended
- undireceted
- non null => non-null
Instead of repeating most of the two GraphX unit tests for the new algorithm, it would be clearer and simpler to reuse the same code for both algorithms as much as possible.

SemyonSinchenko · 2025-03-23T16:31:21Z

Is it right to use the Experimental Spark annotation? Isn't it something internal of Spark? (only wondering)

Deleted. I just wanted to mark somehow the method as exprimental.

check(lmarks, "landmarks") shouldn't be calculated first and then used in the match statement?

Fixed

In the ConnectedComponents class, constants are defined at the begining of the class while in the ShortesPaths are located at the end.

Differences in comments between ConnectedComponents and ShortesPaths:

I moved it to the trait.

toSeq is redundant in landmarks(value.asScala.toSeq) (not one of your changes, but now that you're updating the class ...).

Without it I'm getting error (see CI fail). May be fixed by importing javaConversions, but let's leave it for the next PRs? There are a lot of such a things inside the code, it may be a good first issue.

Why isn't the maximum number of iterations left to the user to configure?

That is how GraphX impl works at the moment. It may be hard to expect from user a good estimation, it is not something like PageRank. A wrong choice of maxIter will tend that you will get a valid result and you cannot even check the correctness. For example, algo returns you that one of vertices has distance 5 to one of landmakrs, how would you check that it is correct? Maybe on the next iteration it would be 4...

earlyStopping = true could have performance implications when the number of iterations is well known. Besides, couldn't that be useful for some users, to set a maximum threshold for finding shortest paths?

See above. I think shortestPaths is not the algorithm when we can leave it to user.

Pregel.src(DISTANCE_ID) and Pregel.dst(DISTANCE_ID) appear three times each; it might be clearer to define two constants and reuse them.

Fixed

aggregateArrayOfDistanceMaps is a one-line method that's used only in one place. Is it necessary?

I do not like long one-liners tbh. Btw, fixed.

lit(landmarks.head) is used twice in the same line.

Fixed

I don't understand the logic behind the initDistancesMap method: why split the landmarks to end up applying the same logic to both parts? (I'm probably missing something)

I added an explanation as a comment, please see in the code.

Instead of repeating most of the two GraphX unit tests for the new algorithm, it would be clearer and simpler to reuse the same code for both algorithms as much as possible.

Tbh, I want to update tests. I'm going to do it as part as introduction of LDBC checks because for LDBC graphs we have ground truth. Does it sound good to you?

@SauronShepherd

SauronShepherd · 2025-03-23T16:53:37Z

I just wanted to mark somehow the method as exprimental.

I liked the idea and didn't know such exprimental annotation existed in Spark. It's only that ... it seemed somehow odd to me using it, because it's like something internal of Spark. That's all.

toSeq is redundant

My IDE (IntelliJ) highlighted that. I drop the toSeq and no error was thrown, but maybe there's some problem ... don't worry about this.

A wrong choice of maxIter will tend that you will get a valid result and you cannot even check the correctness. For example, algo returns you that one of vertices has distance 5 to one of landmakrs, how would you check that it is correct? Maybe on the next iteration it would be 4...

I thought this was like the feature in Neo4J, that you could set the number of maximum "jumps".

See above. I think shortestPaths is not the algorithm when we can leave it to user.

I still think it would be nice for the the user to be able to define a threshold. But i'm not an expert on graphs.

aggregateArrayOfDistanceMaps is a one-line method that's used only in one place. Is it necessary?

I do not like long one-liners tbh. Btw, fixed.

I do not like using single-line methods that are only used in one place. I think they make the code harder to read, but maybe this is more a matter of personal preference than anything else.

I'm going to do it as part as introduction of LDBC checks because for LDBC graphs we have ground truth. Does it sound good to you?

Yes, of course. We don't need to improve everything in one go, do we?

I've seen that you've introduced some changes in the ConnectedComponents class. I wasn't expecting you to update something functionally completely out of the scope of this PR. Besides, I was waiting for my last changes (the removal of the counts) to be merged before introducing more changes to this class. Was this necessary for this PR to pass?

SauronShepherd · 2025-03-23T16:56:40Z

I've just had a look at your changes in ConnectedComponents and the new mixins.scala. Cool! 👍

@rjurney, could you please merge this PR when you get a chance? (once the tests get fixed)

SemyonSinchenko · 2025-03-27T18:06:12Z

@rjurney What do you think about this PR?

…aths-no-graphx

SemyonSinchenko · 2025-04-01T16:13:27Z

@SauronShepherd @rjurney Hi! This one is ready for review, the only change from the last round is adding early stopping. This implementation should be considered as experimental because it is still slower compared to the GraphX-based one. I'm going to continue my work on the performance improvement as part of overall Pregel revisiting.

rjurney · 2025-04-02T02:57:54Z

@SemyonSinchenko you’re so productive on the code base that I think I’m going to continue the documentation work I did with motifs by starting a Pregel tutorial to go along with the motif finding one. It is a really powerful feature that I’ve been trying to learn to mentally use for a few years now but haven’t done much with it. This could open up the API to a lot more people and give them another reason to use GraphFrames and your improvements. Thoughts?

created #562

rjurney

lgtm

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

.gitignore

src/main/scala/org/graphframes/lib/ShortestPaths.scala

SemyonSinchenko · 2025-04-02T05:57:50Z

I'm going to merge this one after #370

rjurney · 2025-04-02T16:34:42Z

cool

…aths-no-graphx

Add an experimental implementation of SP in GF

2bca985

SemyonSinchenko mentioned this pull request Mar 22, 2025

[EPIC] Alternative implementations of GraphX based algorithms #556

Open

6 tasks

SemyonSinchenko added 2 commits March 23, 2025 17:19

From comments

0c5c539

Revert removing toSeq and fix head

106d450

Fix failing tests

f1a3d1c

Fix failing tests

f4eebe8

SemyonSinchenko requested a review from rjurney March 27, 2025 18:06

SemyonSinchenko added 2 commits April 1, 2025 18:05

Merge remote-tracking branch 'graphframes/master' into 557-shortest-p…

16d7671

…aths-no-graphx

Add early stopping

e91d14f

rjurney approved these changes Apr 2, 2025

View reviewed changes

src/main/scala/org/graphframes/lib/ConnectedComponents.scala Show resolved Hide resolved

.gitignore Show resolved Hide resolved

src/main/scala/org/graphframes/lib/ShortestPaths.scala Show resolved Hide resolved

Merge remote-tracking branch 'graphframes/master' into 557-shortest-p…

847680c

…aths-no-graphx

SemyonSinchenko merged commit dd21ef6 into graphframes:master Apr 3, 2025
5 checks passed

SemyonSinchenko deleted the 557-shortest-paths-no-graphx branch April 6, 2025 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add an experimental implementation of SP in GF #558

feat: add an experimental implementation of SP in GF #558

Uh oh!

SemyonSinchenko commented Mar 22, 2025

Uh oh!

SemyonSinchenko commented Mar 22, 2025

Uh oh!

SauronShepherd commented Mar 23, 2025

Uh oh!

SemyonSinchenko commented Mar 23, 2025 •

edited

Loading

Uh oh!

SauronShepherd commented Mar 23, 2025

Uh oh!

SauronShepherd commented Mar 23, 2025 •

edited

Loading

Uh oh!

SemyonSinchenko commented Mar 27, 2025

Uh oh!

SemyonSinchenko commented Apr 1, 2025

Uh oh!

rjurney commented Apr 2, 2025 •

edited

Loading

Uh oh!

rjurney left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SemyonSinchenko commented Apr 2, 2025

Uh oh!

rjurney commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

feat: add an experimental implementation of SP in GF #558

feat: add an experimental implementation of SP in GF #558

Uh oh!

Conversation

SemyonSinchenko commented Mar 22, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

SemyonSinchenko commented Mar 22, 2025

Uh oh!

SauronShepherd commented Mar 23, 2025

Uh oh!

SemyonSinchenko commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SauronShepherd commented Mar 23, 2025

Uh oh!

SauronShepherd commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SemyonSinchenko commented Mar 27, 2025

Uh oh!

SemyonSinchenko commented Apr 1, 2025

Uh oh!

rjurney commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjurney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SemyonSinchenko commented Apr 2, 2025

Uh oh!

rjurney commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

SemyonSinchenko commented Mar 23, 2025 •

edited

Loading

SauronShepherd commented Mar 23, 2025 •

edited

Loading

rjurney commented Apr 2, 2025 •

edited

Loading