Skip to content

Formatter removes parts of the grammar and is not idempotent #25

@matthiasbalke

Description

@matthiasbalke

I just applied a local build of my spotless antlr4 formatter on antlr/grammars-v4 several times in a row and everytime the sources changed. To me It looks like the formatter is removing stuff. So I did the same using the standalone JAR to check if it is a bug in my spotless integration, but the result was the same.

Here's what I did:

  1. checkout https://github.com/antlr/grammars-v4
  2. downloaded the released standalone formatter
  3. move the jar into checkout directory
  4. apply formatter by calling java -jar antlr4-formatter-standalone-1.1.0.jar -dir .
  5. commit the changes
  6. apply formatter again java -jar antlr4-formatter-standalone-1.1.0.jar -dir .
  7. expect to have no modified files, as the sources are already formatted

But afterwards the following files where modified:

  • antlr4/ANTLRv4LexerPythonTarget.g4
  • antlr4/examples/twocomments.g4
  • apex/apex.g4
  • c/C.g4
  • csharp/CSharpPreprocessorParser.g4
  • csharp/CSharpSharwell/CSharpPreprocessorParser.g4
  • java8-js/Java8.JavaScriptTarget.g4
  • java8-ts/Java8.TypeScriptTarget.g4
  • java8/Java8.g4
  • java9/Java9.g4
  • javascript/JavaScriptParser.g4
  • php/Python/PhpLexer.g4
  • plsql/PlSqlParser.g4
  • python2-js/Python2.g4
  • python2/Python2.g4
  • python3-py/Python3.g4
  • rexx/RexxLexer.g4
  • swift3/Swift3.g4
  • system_verilog/SysVerilogHDL.g4
  • tsql/TSqlParser.g4
  • ucb-logo/UCBLogo.g4
  • z/ZLexer.g4

Taking a look at java8/Java8.g4 the first formatting removes some comments (which I think it shouldn't)

Shortened Diff:

  // §3.10.6 Escape Sequences for Character and String Literals
 
-fragment
-EscapeSequence
-	:	'\\' [btnfr"'\\]
-	|	OctalEscape
-    |   UnicodeEscape // This is not in the spec but prevents having to preprocess the input
-	;
-
-fragment
-OctalEscape
-	:	'\\' OctalDigit
-	|	'\\' OctalDigit OctalDigit
-	|	'\\' ZeroToThree OctalDigit OctalDigit
-	;
-
-fragment
-ZeroToThree
-	:	[0-3]
-	;
+fragment EscapeSequence
+   : '\\' [btnfr"'\\] | OctalEscape | UnicodeEscape
+   ;
+
+
+fragment OctalEscape
+   : '\\' OctalDigit | '\\' OctalDigit OctalDigit | '\\' ZeroToThree OctalDigit OctalDigit
+   ;
+
+
+fragment ZeroToThree
+   : [0-3]
+   ;

but the second formatting removes parts of the grammar:
Complete Diff:

diff --git a/java8/Java8.g4 b/java8/Java8.g4
index 0596456..e16ba95 100644
--- a/java8/Java8.g4
+++ b/java8/Java8.g4
@@ -2047,20 +2047,13 @@ fragment JavaLetter
    // covers all characters above 0x7F which are not a surrogate ~ [\u0000-\u007F\uD800-\uDBFF]{Character.isJavaIdentifierStart(_input.LA(-1))}?|
    // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF[\uD800-\uDBFF][\uDC00-\uDFFF]{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?;

-
    fragment JavaLetterOrDigit
       : [a-zA-Z0-9$_] |
       // covers all characters above 0x7F which are not a surrogate ~ [\u0000-\u007F\uD800-\uDBFF]{Character.isJavaIdentifierPart(_input.LA(-1))}?|
       // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF[\uD800-\uDBFF][\uDC00-\uDFFF]{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?;
-
       //
       // Additional symbols not defined in the lexical specification
-      //
-
-      AT
-         : '@'
-         ;
-
+      // AT

       ELLIPSIS
          : '...'

I think we should extract the non working grammar parts and write unit tests for each. This way we are testing Antlr4ParseTreeListenerImpl.java which is anyway very complex and could use some refactoring afterwards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions