Quote column names with backticks #370

EnricoMi · 2020-08-17T16:25:03Z

This guards against columns that contain dots. These column names must be quoted to avoid that Spark parses them again.

Fixes following exception when your vertices DataFrame contains columns with dots (e.g. "a.name"):

Cannot resolve column name "a.name" among (id, a.name); did you mean to quote the `a.name` column?;
org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.name" among (id, a.name); did you mean to quote the `a.name` column?;
	at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:270)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
	at org.apache.spark.sql.Dataset.col(Dataset.scala:1353)
	at org.apache.spark.sql.Dataset.apply(Dataset.scala:1320)
	at org.graphframes.GraphFrame$.$anonfun$nestAsCol$1(GraphFrame.scala:820)
	...

Fixes this issue also in other places.

codecov-commenter · 2020-08-17T16:37:40Z

Codecov Report

Merging #370 into master will increase coverage by 0.02%.
The diff coverage is 76.92%.

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage   91.26%   91.28%   +0.02%     
==========================================
  Files          18       18              
  Lines         870      872       +2     
  Branches       48       56       +8     
==========================================
+ Hits          794      796       +2     
  Misses         76       76

Impacted Files	Coverage Δ
src/main/scala/org/graphframes/GraphFrame.scala	`89.27% <71.42%> (+0.09%)`	⬆️
.../scala/org/graphframes/lib/GraphXConversions.scala	`84.61% <75.00%> (ø)`
...main/scala/org/graphframes/lib/ShortestPaths.scala	`96.00% <100.00%> (ø)`
...main/scala/org/graphframes/lib/TriangleCount.scala	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7209013...32e45a5. Read the comment docs.

EnricoMi · 2020-11-16T19:47:03Z

@mengxr @felixcheung how do you feel about this fix?

codecov-io · 2020-11-16T20:59:37Z

Codecov Report

Merging #370 (b6d5812) into master (2a0eeb0) will increase coverage by 0.02%.
The diff coverage is 76.92%.

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage   91.26%   91.28%   +0.02%     
==========================================
  Files          18       18              
  Lines         870      872       +2     
  Branches       48       56       +8     
==========================================
+ Hits          794      796       +2     
  Misses         76       76

Impacted Files	Coverage Δ
src/main/scala/org/graphframes/GraphFrame.scala	`89.27% <71.42%> (+0.09%)`	⬆️
.../scala/org/graphframes/lib/GraphXConversions.scala	`84.61% <75.00%> (ø)`
...main/scala/org/graphframes/lib/ShortestPaths.scala	`96.00% <100.00%> (ø)`
...main/scala/org/graphframes/lib/TriangleCount.scala	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a0eeb0...b6d5812. Read the comment docs.

EnricoMi · 2021-02-19T10:55:41Z

@mengxr @felixcheung @WeichenXu123 ping

rjurney · 2023-09-07T15:12:59Z

Oh thank you, I hate transforming to underscores!

SemyonSinchenko · 2025-02-27T12:45:29Z

@EnricoMi Hello! Is there any chance you can update the PR and resolve conflicts? It would be a nice feature for GraphFrames!

EnricoMi · 2025-02-27T14:41:00Z

Rebased. Thanks for looking into this.

SemyonSinchenko · 2025-02-27T16:03:32Z

@rjurney @bjornjorgensen Hello! What do you think about this one? For me it looks very cool, I think it is very nice to have feature for GraphFrames!

bjornjorgensen · 2025-02-27T16:18:48Z

cool thing.. but its something that should have been in spark not here..

bjornjorgensen · 2025-02-27T16:25:52Z

.. and it is a breaking change to.

SemyonSinchenko · 2025-02-27T16:43:38Z

.. and it is a breaking change to.

To be honest I don't see why is it a breaking change? For end users everything should be exactly the same, until they are doing something similar on their side. From the other side complex pipelines with graphframes with column-names generation and similar things it should be much more smooth with that change.

bjornjorgensen · 2025-02-27T20:23:03Z

.. and it is a breaking change to.

To be honest I don't see why is it a breaking change? For end users everything should be exactly the same, until they are doing something similar on their side. From the other side complex pipelines with graphframes with column-names generation and similar things it should be much more smooth with that change.

What you don't see any braking changes, but you explain them in the 3 next lines!

I don't understand what this have to do with graphframes as this project code..

EnricoMi · 2025-02-27T21:03:03Z

.. and it is a breaking change to.

You mean if column names are backticked already? That should be doable.

Any other use case that breaks due to this change?

bjornjorgensen · 2025-02-27T21:17:53Z

.. and it is a breaking change to.

You mean if column names are backticked already? That should be doable.

yes, and I think that it's very common to have.

Any other use case that breaks due to this change?

Not that I can think of right now.

bjornjorgensen · 2025-02-27T21:49:18Z

like pandas json_normalize uses sep='.' as default. https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

EnricoMi · 2025-02-28T09:36:26Z

but its something that should have been in spark not here..
.. and it is a breaking change to.

Looking more closely, this is not a breaking change. And Spark API expects the calling site to properly escape column names if they contain dots (unless they reference nested columns). Spark won't change that, otherwise, that would be a breaking change on their side. Since Spark 4.0.0, they even provide methods in QuotingUtils that can be used in user code (similar to quote implemented here).

Backticks are only required when referencing a column by its name (select, col, ...). A column name in the schema is not escaped by backticks. This pull requests only touches code where Graphframes references columns by their name, and that name is taken from the schema (not the user calling Graphframes API). Hence, the user cannot provide backticed column names, Graphframes is responsible for properly escaping the column names.

Without this fix, users of Graphframes have to workaround this by renaming affected columns:
https://github.com/G-Research/spark-dgraph-connector/blob/09dacbf/src/main/scala/uk/co/gresearch/spark/dgraph/graphframes/package.scala#L42-L44

SemyonSinchenko · 2025-03-04T07:30:58Z

@bjornjorgensen Did you have a chance to check the latest commits? It seems to me that it should cover all of your concerns and nothing else is blocking merging. Thanks in advance!

bjornjorgensen · 2025-03-04T18:08:45Z

hm... well I think its ok..

If we did have branches for graphframes we didn't needed this for branch 4

I think we can have it, but if there are some user problems with it we will need to remove it.

SemyonSinchenko · 2025-03-04T18:16:39Z

@rjurney What do you think about it? For me it is ready to merge.

EnricoMi · 2025-03-04T19:09:02Z

If we did have branches for graphframes we didn't needed this for branch 4

With "branch 4" I presume you refer to Spark4. What is different there? Do you imply backticks are not needed with Spark4? How are ambiguous column names resolved there?

bjornjorgensen · 2025-03-05T11:15:47Z

If we did have branches for graphframes we didn't needed this for branch 4

With "branch 4" I presume you refer to Spark4. What is different there? Do you imply backticks are not needed with Spark4? How are ambiguous column names resolved there?

I was thinking about this "Note: This can be replaced with org.apache.spark.sql.catalyst.util.QuotingUtils.quoteIfNeeded

once support for Spark 3 has been dropped"
and yes, branch 4 is a branch for spark 4

EnricoMi · 2025-03-05T11:23:10Z

Since Spark 4.0.0, they even provide methods in QuotingUtils that can be used in user code

With "user code" I meant code that uses the Spark API, which in our case is Graphframes code. So once Spark3 support is dropped by Graphframes, the quote implementation of this PR can be removed and replaced with calling into QuotingUtils. So most of this PR remains.

SemyonSinchenko · 2025-03-28T17:31:12Z

@bjornjorgensen Were your concerns resolved by the latest commits? If so, could you approve the PR?

cc: @rjurney @SauronShepherd could you take a look too? If you are fine with it, I would like to resolve conflicts and merge it. I see it as a very important feature.

bjornjorgensen · 2025-03-28T17:53:08Z

LGTM

This guards against columns that contain dots. These column names must be quoted to avoid that Spark parses them again.

EnricoMi · 2025-03-28T18:42:35Z

Conflict resolved.

SauronShepherd · 2025-03-28T20:10:02Z

Shouldn't we add some tests to check that GraphFrames actually works fine with nested columns?

SemyonSinchenko · 2025-03-28T20:49:48Z

Shouldn't we add some tests to check that GraphFrames actually works fine with nested columns?

There are two tests in this PR already, do you think there should be more?

SauronShepherd · 2025-03-28T20:55:09Z

Sorry, I looked for the word "test" in the comments, didn't find anything and assumed there weren't any tests. Let me check out the PR and I'll confirm you tomorrow, but 2 tests seems reasonsable. El vie, 28 mar 2025, 21:50, Sem ***@***.***> escribió:

…

Shouldn't we add some tests to check that GraphFrames actually works fine with nested columns? There are two tests in this PR already, do you think there should be more? — Reply to this email directly, view it on GitHub <#370 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCN675GZIR6IC5ADWZJPFT2WWRXFAVCNFSM6AAAAABX7XQB3KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONRSGQ3DQMRWGU> . You are receiving this because you were mentioned.Message ID: ***@***.***> [image: SemyonSinchenko]*SemyonSinchenko* left a comment (graphframes/graphframes#370) <#370 (comment)> Shouldn't we add some tests to check that GraphFrames actually works fine with nested columns? There are two tests in this PR already, do you think there should be more? — Reply to this email directly, view it on GitHub <#370 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCN675GZIR6IC5ADWZJPFT2WWRXFAVCNFSM6AAAAABX7XQB3KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONRSGQ3DQMRWGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SemyonSinchenko · 2025-04-01T16:16:44Z

@SauronShepherd Hi! Were your concerns resolved? I would like to merge this one to avoid even more merge conflicts, because this PR changes a lot of files.

SauronShepherd · 2025-04-02T07:46:58Z

Sorry, I didn't have time to look at it. Go ahead.

SemyonSinchenko · 2025-04-02T08:36:49Z

Thanks for the contribution @EnricoMi !

EnricoMi · 2025-04-02T08:44:30Z

Thanks for merging!

EnricoMi mentioned this pull request Aug 18, 2020

Make GraphFrames work with columns with dots G-Research/spark-dgraph-connector#41

Open

EnricoMi force-pushed the backtick-quote-column-names branch from 32e45a5 to b6d5812 Compare November 16, 2020 19:44

EnricoMi force-pushed the backtick-quote-column-names branch from b6d5812 to ca9fd56 Compare February 19, 2021 11:04

EnricoMi force-pushed the backtick-quote-column-names branch 2 times, most recently from 5084ca8 to 104b030 Compare February 27, 2025 14:39

SemyonSinchenko approved these changes Feb 27, 2025

View reviewed changes

EnricoMi force-pushed the backtick-quote-column-names branch from 104b030 to 5103ed5 Compare February 28, 2025 09:17

EnricoMi added 4 commits March 28, 2025 19:41

Quote column names with backticks

c0fbfae

This guards against columns that contain dots. These column names must be quoted to avoid that Spark parses them again.

Add links to Spark 4 QuotingTools

25f09fc

Support columns containing backquotes

dca1427

Remove 'GraphX with dots' test as not needed

7a60f7a

EnricoMi force-pushed the backtick-quote-column-names branch from 5103ed5 to 7a60f7a Compare March 28, 2025 18:42

Simplify map(col)

ddeb397

SemyonSinchenko self-requested a review March 28, 2025 18:50

SemyonSinchenko approved these changes Mar 28, 2025

View reviewed changes

SemyonSinchenko mentioned this pull request Apr 2, 2025

feat: add an experimental implementation of SP in GF #558

Merged

SemyonSinchenko merged commit 2f3db30 into graphframes:master Apr 2, 2025
5 checks passed

EnricoMi deleted the backtick-quote-column-names branch April 2, 2025 08:44

Quote column names with backticks #370

Quote column names with backticks #370

Uh oh!

Conversation

EnricoMi commented Aug 17, 2020

Uh oh!

codecov-commenter commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

EnricoMi commented Nov 16, 2020

Uh oh!

codecov-io commented Nov 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

EnricoMi commented Feb 19, 2021

Uh oh!

rjurney commented Sep 7, 2023

Uh oh!

SemyonSinchenko commented Feb 27, 2025

Uh oh!

EnricoMi commented Feb 27, 2025

Uh oh!

SemyonSinchenko commented Feb 27, 2025

Uh oh!

bjornjorgensen commented Feb 27, 2025

Uh oh!

bjornjorgensen commented Feb 27, 2025

Uh oh!

SemyonSinchenko commented Feb 27, 2025

Uh oh!

bjornjorgensen commented Feb 27, 2025

Uh oh!

EnricoMi commented Feb 27, 2025

Uh oh!

bjornjorgensen commented Feb 27, 2025

Uh oh!

bjornjorgensen commented Feb 27, 2025

Uh oh!

EnricoMi commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SemyonSinchenko commented Mar 4, 2025

Uh oh!

bjornjorgensen commented Mar 4, 2025

Uh oh!

SemyonSinchenko commented Mar 4, 2025

Uh oh!

EnricoMi commented Mar 4, 2025

Uh oh!

bjornjorgensen commented Mar 5, 2025

Uh oh!

EnricoMi commented Mar 5, 2025

Uh oh!

SemyonSinchenko commented Mar 28, 2025

Uh oh!

bjornjorgensen commented Mar 28, 2025

Uh oh!

EnricoMi commented Mar 28, 2025

Uh oh!

SauronShepherd commented Mar 28, 2025

Uh oh!

SemyonSinchenko commented Mar 28, 2025

Uh oh!

SauronShepherd commented Mar 28, 2025 via email

Uh oh!

SemyonSinchenko commented Apr 1, 2025

Uh oh!

SauronShepherd commented Apr 2, 2025

Uh oh!

Uh oh!

SemyonSinchenko commented Apr 2, 2025

Uh oh!

EnricoMi commented Apr 2, 2025

Uh oh!

Uh oh!

codecov-commenter commented Aug 17, 2020 •

edited

Loading

codecov-io commented Nov 16, 2020 •

edited

Loading

EnricoMi commented Feb 28, 2025 •

edited

Loading