Skip to content

Conversation

SemyonSinchenko
Copy link
Collaborator

What changes were proposed in this pull request?

  • LDBC testing utilities
  • LDBC suites for undirected graphs

Why are the changes needed?

Close #568

WARNING This PR contains failed tests! It looks like I found bugs in implementations of PR and CDLP, but fixing this should be done as separate PRs!!!

@SemyonSinchenko
Copy link
Collaborator Author

@rjurney @SauronShepherd @james-willis Hi guys! What do you think about it? Should it be merged with creating a follow-up tickets, or should it be left as pending until all the tests are passed? At the moment there are problems in:

  • PageRank
  • LabelPropagation

Alternatively, I can disable these tests for now to allow contributors enable them manually.

@SemyonSinchenko SemyonSinchenko changed the title feat: ddd LDBC tests feat: add LDBC tests Apr 9, 2025
@SemyonSinchenko
Copy link
Collaborator Author

And, of course, the most interesting thing, is that with scala 2.13 CDLP test passed, with 2.12 failed.

@SemyonSinchenko
Copy link
Collaborator Author

See #572 and #571 :

  • GrpahX PR is manually normalized
  • GraphX CDLP is skipped for 2.12

@SauronShepherd
Copy link
Contributor

I'd like to have a look at it next weekend. I love challenging mysteries!!!

@SemyonSinchenko
Copy link
Collaborator Author

I'd like to have a look at it next weekend. I love challenging mysteries!!!

But what about this PR? If you will work one of found issues, it would be much easier to have these tests in main.

For example, I want to work on the GraphX-free implementation of LabelPropagation and I want to have these tests too.

If you are fine with this PR overall, let's merge it? cc: @rjurney

@SauronShepherd
Copy link
Contributor

But what about this PR? If you will work one of found issues, it would be much easier to have these tests in main.

Yes, of course. I think I've mixed up this new feature with the open issue.

@SemyonSinchenko
Copy link
Collaborator Author

I fixed small things (access modifiers, typo, etc.) and this PR is ready for review.

Copy link
Collaborator

@rjurney rjurney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@SemyonSinchenko SemyonSinchenko merged commit 45f9d8c into graphframes:master Apr 11, 2025
5 checks passed
@SauronShepherd
Copy link
Contributor

SauronShepherd commented Apr 12, 2025

Why is the new test suite named TestLDBCCases? This breaks the existing naming convention in GraphFrames, where none of the test classes include the words 'Test' or 'Cases' and they all end with 'Suite'.

@SauronShepherd
Copy link
Contributor

SauronShepherd commented Apr 12, 2025

Why is the test internally downloading the datasets? Why are we forcing the user to install a command-line tool like zstd? If we don’t want to include the datasets directly in GraphFrames, wouldn’t it be better to simply provide instructions for users to download the data themselves?

Also, why are the datasets being placed in the target folder? I’d prefer to locate them in the project root for better visibility and consistency. Additionally, rebuilding the project or changing the Scala version could trigger unnecessary re-downloads of the datasets.

@rjurney
Copy link
Collaborator

rjurney commented Apr 12, 2025

@SauronShepherd these are good points

@SauronShepherd
Copy link
Contributor

Instead of explicitly defining the algorithms to traverse in each test, wouldn’t it be better to retrieve them from the algorithm itself? After all, what could be better than the algorithm itself in determining its own supported types?

I mean this ...

Seq("graphframes", "graphx").foreach { algo =>

@SemyonSinchenko
Copy link
Collaborator Author

After all, what could be better than the algorithm itself in determining its own supported types?

GraphX is deprecated in Apache Spark 4.0 and it is buggy (#571); it is very hard to support for GraphX implementations such things like weights, directed/undirected, properties, etc. To be honest I was hoping that after finishing #556 with a similar like GraphX performance, we can dpeprecate all the GraphX-related things here. For example, we can mark it deprecated in GraphFrames 1.0. In that case, is there a reason to make tests overcomplicated? What is bad with Seq("graphframes", "graphx").foreach { algo =>? At the moment it allows, for example, to disable LP-test for GraphX in Scala 2.12. If there won't be algo, how else can it be done?

Why is the test internally downloading the datasets? Why are we forcing the user to install a command-line tool like zstd? If we don’t want to include the datasets directly in GraphFrames, wouldn’t it be better to simply provide instructions for users to download the data themselves?

Also, why are the datasets being placed in the target folder? I’d prefer to locate them in the project root for better visibility and consistency. Additionally, rebuilding the project or changing the Scala version could trigger unnecessary re-downloads of the datasets.

Overall answer is the following: LDBC contains not only toy datasets, but also graphs of different size. And even LDBC XS is already bigger than 100 Mb and cannot fit into GitHub limits (that are about 50 Mb as I remember). Of course we can store small files in resources and download big ones in runtime if you prefer that way. But because the structure of all the LDBC datasets is the same, I made a decision to do it like it is done. zstd barely needs to be downloaded, it should be a part of almost all the POSIX-compatible distributions... Instead of making a 50 lines README about how to download, unpack and read LDBC cases, I decided to write 20 lines of scala code. If you want to see it in a different way, could you please create a follow-up ticket and describe how you see it? I will be happy to implement it. The only reason this PR was merged so fast is to allow me, you and other works on tasks, related to this PR.

Why is the new test suite named TestLDBCCases? This breaks the existing naming convention in GraphFrames, where none of the test classes include the words 'Test' or 'Cases' and they all end with 'Suite'.

I can fix it. Please, could you create a follow-up ticket? Thanks!

@rjurney
Copy link
Collaborator

rjurney commented Apr 12, 2025

It is definitely desirable to download these datasets at runtime. I do this for the motif finding tutorial.

@SauronShepherd
Copy link
Contributor

SauronShepherd commented Apr 12, 2025

It is definitely desirable to download these datasets at runtime. I do this for the motif finding tutorial.

We're talking about <2KB files. Do we really want to force users to install a command tool (zstd) for that? Besides, it doesn't work on Windows. Maybe we should provide the "manual way" to download the files if some error is thrown, what do you think?

GraphX is deprecated in Apache Spark 4.0 and it is buggy (#571); it is very hard to support for GraphX implementations such things like weights, directed/undirected, properties, etc.

What that has to do with the fact that the tests had not only to know but also enumerate the types of implementations used in the algorithm they're testing?

is there a reason to make tests overcomplicated?

Adding a common method in the algorithm classes that simply returns a list with the types of implementations each algorithm class supports, so the tests traverses these lists and not an enumeration... is really that overcomplicated? We already have a getAlgorithm in the WithAlgorithmChoice trait ... why not adding a getSupportedAlgorithms?

Besides, imagine we add a new type of implementation in one of the algorithms ... do we also have to update the enumerated list in the test?

zstd barely needs to be downloaded, it should be a part of almost all the POSIX-compatible distributions

It's not available on Windows and the Ubuntu WSL I have installed.

@SemyonSinchenko
Copy link
Collaborator Author

We already have a getAlgorithm in the WithAlgorithmChoice trait ... why not adding a getSupportedAlgorithms?

I'm fine with the idea, can you open a PR?

@SauronShepherd
Copy link
Contributor

lgtm

@SauronShepherd
Copy link
Contributor

We already have a getAlgorithm in the WithAlgorithmChoice trait ... why not adding a getSupportedAlgorithms?

I'm fine with the idea, can you open a PR?

I'm investigating the scala 2.12 issue. Let's wait for any changes to fix that and, if there's no changes in GraphFrames, I'll open a PR with the getSupportedAlgorithms new method. Okay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: use LDBC test data (small)
3 participants