Skip to content

Conversation

SemyonSinchenko
Copy link
Collaborator

What changes were proposed in this pull request?

  • An experimental no-GraphX implementation if the ShortestPaths
  • Corresponding tests
  • Small changes in .gitignore

Why are the changes needed?

GraphX is deprecated in Spark 4.0; close #557

@SemyonSinchenko
Copy link
Collaborator Author

@SauronShepherd @rjurney Hi! Take a look when you have time! Thanks!

@SauronShepherd
Copy link
Contributor

Some points:

  1. Is it right to use the Experimental Spark annotation? Isn't it something internal of Spark? (only wondering)
  2. check(lmarks, "landmarks") shouldn't be calculated first and then used in the match statement?
  3. In the ConnectedComponents class, constants are defined at the begining of the class while in the ShortesPaths are located at the end.
  4. Differences in comments between ConnectedComponents and ShortesPaths:
    • No comment in the supportedAlgorithms definition.
    • No comment in the setAlgorithm method.
    • No comment in the run method.
  5. toSeq is redundant in landmarks(value.asScala.toSeq) (not one of your changes, but now that you're updating the class ...).
  6. Why isn't the maximum number of iterations left to the user to configure?
  7. earlyStopping = true could have performance implications when the number of iterations is well known. Besides, couldn't that be useful for some users, to set a maximum threshold for finding shortest paths?
  8. Pregel.src(DISTANCE_ID) and Pregel.dst(DISTANCE_ID) appear three times each; it might be clearer to define two constants and reuse them.
  9. aggregateArrayOfDistanceMaps is a one-line method that's used only in one place. Is it necessary?
  10. lit(landmarks.head) is used twice in the same line.
  11. I don't understand the logic behind the initDistancesMap method: why split the landmarks to end up applying the same logic to both parts? (I'm probably missing something)
  12. Word typos:
    • sended
    • undireceted
    • non null => non-null
  13. Instead of repeating most of the two GraphX unit tests for the new algorithm, it would be clearer and simpler to reuse the same code for both algorithms as much as possible.

@SemyonSinchenko
Copy link
Collaborator Author

SemyonSinchenko commented Mar 23, 2025

Is it right to use the Experimental Spark annotation? Isn't it something internal of Spark? (only wondering)

Deleted. I just wanted to mark somehow the method as exprimental.

check(lmarks, "landmarks") shouldn't be calculated first and then used in the match statement?

Fixed

In the ConnectedComponents class, constants are defined at the begining of the class while in the ShortesPaths are located at the end.

Differences in comments between ConnectedComponents and ShortesPaths:

I moved it to the trait.

toSeq is redundant in landmarks(value.asScala.toSeq) (not one of your changes, but now that you're updating the class ...).

Without it I'm getting error (see CI fail). May be fixed by importing javaConversions, but let's leave it for the next PRs? There are a lot of such a things inside the code, it may be a good first issue.

Why isn't the maximum number of iterations left to the user to configure?

That is how GraphX impl works at the moment. It may be hard to expect from user a good estimation, it is not something like PageRank. A wrong choice of maxIter will tend that you will get a valid result and you cannot even check the correctness. For example, algo returns you that one of vertices has distance 5 to one of landmakrs, how would you check that it is correct? Maybe on the next iteration it would be 4...

earlyStopping = true could have performance implications when the number of iterations is well known. Besides, couldn't that be useful for some users, to set a maximum threshold for finding shortest paths?

See above. I think shortestPaths is not the algorithm when we can leave it to user.

Pregel.src(DISTANCE_ID) and Pregel.dst(DISTANCE_ID) appear three times each; it might be clearer to define two constants and reuse them.

Fixed

aggregateArrayOfDistanceMaps is a one-line method that's used only in one place. Is it necessary?

I do not like long one-liners tbh. Btw, fixed.

lit(landmarks.head) is used twice in the same line.

Fixed

I don't understand the logic behind the initDistancesMap method: why split the landmarks to end up applying the same logic to both parts? (I'm probably missing something)

I added an explanation as a comment, please see in the code.

Instead of repeating most of the two GraphX unit tests for the new algorithm, it would be clearer and simpler to reuse the same code for both algorithms as much as possible.

Tbh, I want to update tests. I'm going to do it as part as introduction of LDBC checks because for LDBC graphs we have ground truth. Does it sound good to you?

@SauronShepherd

@SauronShepherd
Copy link
Contributor

I just wanted to mark somehow the method as exprimental.

I liked the idea and didn't know such exprimental annotation existed in Spark. It's only that ... it seemed somehow odd to me using it, because it's like something internal of Spark. That's all.

toSeq is redundant

My IDE (IntelliJ) highlighted that. I drop the toSeq and no error was thrown, but maybe there's some problem ... don't worry about this.

A wrong choice of maxIter will tend that you will get a valid result and you cannot even check the correctness. For example, algo returns you that one of vertices has distance 5 to one of landmakrs, how would you check that it is correct? Maybe on the next iteration it would be 4...

I thought this was like the feature in Neo4J, that you could set the number of maximum "jumps".

See above. I think shortestPaths is not the algorithm when we can leave it to user.

I still think it would be nice for the the user to be able to define a threshold. But i'm not an expert on graphs.

aggregateArrayOfDistanceMaps is a one-line method that's used only in one place. Is it necessary?

I do not like long one-liners tbh. Btw, fixed.

I do not like using single-line methods that are only used in one place. I think they make the code harder to read, but maybe this is more a matter of personal preference than anything else.

I'm going to do it as part as introduction of LDBC checks because for LDBC graphs we have ground truth. Does it sound good to you?

Yes, of course. We don't need to improve everything in one go, do we?

I've seen that you've introduced some changes in the ConnectedComponents class. I wasn't expecting you to update something functionally completely out of the scope of this PR. Besides, I was waiting for my last changes (the removal of the counts) to be merged before introducing more changes to this class. Was this necessary for this PR to pass?

@SauronShepherd
Copy link
Contributor

SauronShepherd commented Mar 23, 2025

I've just had a look at your changes in ConnectedComponents and the new mixins.scala. Cool! 👍

@rjurney, could you please merge this PR when you get a chance? (once the tests get fixed)

@SemyonSinchenko
Copy link
Collaborator Author

@rjurney What do you think about this PR?

@SemyonSinchenko SemyonSinchenko requested a review from rjurney March 27, 2025 18:06
@SemyonSinchenko
Copy link
Collaborator Author

@SauronShepherd @rjurney Hi! This one is ready for review, the only change from the last round is adding early stopping. This implementation should be considered as experimental because it is still slower compared to the GraphX-based one. I'm going to continue my work on the performance improvement as part of overall Pregel revisiting.

@rjurney
Copy link
Collaborator

rjurney commented Apr 2, 2025

@SemyonSinchenko you’re so productive on the code base that I think I’m going to continue the documentation work I did with motifs by starting a Pregel tutorial to go along with the motif finding one. It is a really powerful feature that I’ve been trying to learn to mentally use for a few years now but haven’t done much with it. This could open up the API to a lot more people and give them another reason to use GraphFrames and your improvements. Thoughts?

created #562

Copy link
Collaborator

@rjurney rjurney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SemyonSinchenko
Copy link
Collaborator Author

I'm going to merge this one after #370

@rjurney
Copy link
Collaborator

rjurney commented Apr 2, 2025

cool

@SemyonSinchenko SemyonSinchenko merged commit dd21ef6 into graphframes:master Apr 3, 2025
5 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 557-shortest-paths-no-graphx branch April 6, 2025 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: ShortestPaths without GraphX
3 participants