Skip to content

Conversation

rjurney
Copy link
Collaborator

@rjurney rjurney commented Feb 21, 2025

Do-over of #518 since I mistakenly merged it to rjurney/build-upgrades rather than master.

This is a sub-PR of the monster PR #473. It is the code that corresponds to #511. It needs to get merged after #512 and before the docs PR #511.

These changes do the following:

  1. Allows users to download the Stack Exchange data dump via a CLI at graphframes.tutorials.download. [Thought: this Click usage can be the basis for future CLI commands for a graphframes command? Just an idea.]
  2. Convert the XML to a Parquet file graphframes.tutorials.stackexchange
  3. Build a test knowledge graph out of the data dump graphframes.tutorials.stackexchange. No longer requires case sensitivity for id / Id fields.
  4. Run some motifs on that knowledge graph graphframes.tutorials.motif - this is here to provide for future unit testability of tutorials and as a Github browsable reference that matches the Motif Finding tutorial in 3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511.

In addition:

  1. The Stack Exchange knowledge graph dataset Nodes.parquet and Edges.parquet this PR creates can be wired into the unit tests for a more realistic setting in a near future PR by me. We could put the python/graphframes/tutorials/data/ folder under python/data or python/graphframes/data to accommodate this. We need a real dataset for our unit tests, I don't have confidence in changes to algorithms like connected components or PageRank without real data and known outcomes.

Why are the changes needed?

These changes are needed to make the docs in #511 work. Otherwise that PR's new Motif Finding Tutorial won't work. Merge me first :)

…s.txt and split out requirements-dev.txt. Version bumps.
…who just pasted or tried to run the code without a new SparkSession.
…m SparkSession.sparkContext. Use click.echo instead of print
@rjurney rjurney merged commit 16be614 into master Feb 21, 2025
9 checks passed
@rjurney rjurney deleted the rjurney/motif-tutorial-code-min branch April 15, 2025 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant