Straighten out handling of char and string constants #2990

ojwb · 2024-08-08T02:52:54Z

Fixes #904
Fixes #1907
Fixes #2579

(Yes I know there's still a placeholder date in CHANGES.current...)

@wsfulton I'd welcome your input on this.

The basic idea is that .val in the parser and the value attribute are always the C/C++ source code for an expression, so if the target language just wants to emit that in the generated C/C++ wrapper then it can just write it out without needing to do anything special for T_STRING, T_CHAR, etc.

Then .stringval in the parser and the stringval attribute are always the actual value of the string or character in a DOH String. If a backend wants to emit the string or character as a string or character in the target language it can take this and apply appropriate escaping and quoting.

I've split %(hexescape)s into %(csharpescape)s for C# and %(goescape)s for Go as both need hex escaping but with different rules. Maybe the language-specific escaping should actually move to the relevant language module instead - we'd probably have to use something like

Printv(f, "\"", csharp_escape(stringval), "\"", NIL);

instead of

Printf(f, "\"%(csharpescape)s\"", stringval);

but that doesn't actually seem worse.

Fixes #904 Fixes #1907 Fixes #2579

ojwb · 2024-08-08T02:55:16Z

CI passed when I pushed the branch. Then it started to run it all again when I opened the PR, which seems a huge waste of resources so I've killed those jobs, but that means it wrongly shows as failing CI.

wsfulton · 2024-08-14T07:41:54Z

CI passed when I pushed the branch. Then it started to run it all again when I opened the PR, which seems a huge waste of resources so I've killed those jobs, but that means it wrongly shows as failing CI.

but necessary as it's testing different things. The Difference Between Push and Pull Request Triggers in Github should help understand. Perhaps only noticeable when developing in branches in the SWIG repo, which is not the norm of developing/pushing in your own repo.

I have just started digesting the pull request, but it would help with a clearer definition and expanded examples of what stringval is. Looking at the docs/comments for it in parser.y in the Define struct, the given examples show the quotes stripped off, but you mentioned elsewhere it is not just this. It still looks like it just contains the quotes stripped off and each C escape sequence converted into an ASCII code. I think some more examples/explanations might help here for example:

C escape sequences that are 'viewable' ASCII such as \" \' \\
C escape sequences such as \n \r - maybe not a good idea as the values are not viewable and cannot easily be shown in your table as is
a C++11 raw string literal
a C++11 string literal such as L and u8 prefixes
more unicode examples using various C escape sequences quoting

Beginning the table with these 3 lines in the table would help drive it home:

    //  val      stringval
    // --------- -----------
    // "bar"     bar
    // "b\x61r"  bar
    // "b\141r"  bar

wsfulton · 2024-08-14T18:54:02Z

Maybe the language-specific escaping should actually move to the relevant language module instead ...

I think these escapings you have can be used from within typemaps??? If so, might be useful, so I'd leave it as is. Also it makes sense to leave these similar functions in the one file (misc.c).

Source/Modules/go.cxx

Source/Swig/scanner.c

Source/Modules/lang.cxx

wsfulton · 2024-08-14T19:00:40Z

I started comparing the test-suite before and after the changes. Quick observations:

MzScheme - missing new lines in lots of tests, such as in abstract_typedef2.
C# - escaping changes in C/C++ files, eg constants_wrap.c. The generated values no longer match those in Java/Python, so seem unnecessary.

If those get fixed, I'll start another big comparison of the outputs review.

ojwb · 2024-08-14T21:15:59Z

but necessary as it's testing different things. The Difference Between Push and Pull Request Triggers in Github should help understand. Perhaps only noticeable when developing in branches in the SWIG repo, which is not the norm of developing/pushing in your own repo.

I understand the difference, but in this particular case the branch merged to current master would be exactly the same tree of files (because the branch was one commit on top of current master) and SWIG's CI doesn't do anything significantly different for the branch push and PR cases so it didn't seem worth burning hours of CPU time on.

wsfulton · 2024-08-14T21:28:13Z

but in this particular case the branch merged to current master would be exactly the same tree of files (because the branch was one commit on top of current master)

From what I can see expression-value-overhaul branched off master at 74a8aaf and master has 4 commits not on the expression-value-overhaul branch. So adding the single commit on expression-value-overhaul onto the head of master is not the same as building expression-value-overhaul.

ojwb · 2024-08-14T21:34:28Z

Those 4 commits were all pushed AFTER I pushed my branch though, so the PR CI run I cancelled would not have included them.

ojwb · 2024-08-14T22:16:43Z

I have just started digesting the pull request, but it would help with a clearer definition and expanded examples of what stringval is. Looking at the docs/comments for it in parser.y in the Define struct, the given examples show the quotes stripped off, but you mentioned elsewhere it is not just this. It still looks like it just contains the quotes stripped off and each C escape sequence converted into an ASCII code.

That's approximately it. I think the most helpful way to think of stringval is that it's the actual value of the string literal, and what the compiler does to get that is essentially to take the characters of the source code from between the quotes and interpret C escape sequences. There's also string literal concatenation which means the first step is not as simple as just "quotes stripped off", though the value in val is also after string literal concatenation too, so that doesn't show in the current table. And there's also the preprocessor stringification operator, so the quotes may not even appear in the source code. Thinking of stringval in terms of "stripping off quotes" seems unhelpful to me (really val is stringval with escaping and quotes added!) I think I should add a C/C++ source code column to the table so I can try to cover these nuances with examples (also because I just realised that the current "b\x61r" example is a bad one because val can't actually ever be that because it's created by escaping stringval, which would be bar and SWIG won't escape an a so val would be "bar" - effectively escape sequences from the source code get normalised by the process that creates the value for val).

I feel I have a concept in my head but I haven't completely cracked how to explain it in a clear enough way, but I'm not sure quite what I'm failing to convey. Sorry - I suspect this is frustrating for both of us.

Possibly a helpful way to think of stringval is that it is the byte sequence you'll get on stdout if you compile and run the following program (for the string literal case):

#include <unistd.h>
#define LITERAL "b\x61r"
int main() { return write(1, LITERAL, sizeof(LITERAL) - 1) < sizeof(LITERAL) - 1; }

Outputs bar (with no newline). Here stringval is bar, val is "bar" (because val is actually created by escaping and quoting the value from stringval).

And showing stringification and string concatenation:

#include <unistd.h>
#define S(X) #X
#define LITERAL S(b\x61) S(r)
int main() { return write(1, LITERAL, sizeof(LITERAL) - 1) < sizeof(LITERAL) - 1; }

Outputs bar (with no newline). Again stringval is bar, val is "bar".

I think some more examples/explanations might help here for example:
* C escape sequences that are 'viewable' ASCII such as `\" \' \\`

OK.

* C escape sequences such as `\n \r` - maybe not a good idea as the values are not viewable and cannot easily be shown in your table as is

Yes, I avoided examples such as \n as it's hard to write the stringval value without some sort of escaping, at which point it seems more confusing than helpful.

* a C++11 raw string literal

* a C++11 string literal such as L and u8 prefixes

OK.

* more unicode examples using various C escape sequences quoting

Are you thinking of something like a "plain" C string with UTF-8 sequence byte values individually encoded as \x escapes?

That's a bit of a tricky case as we have no character set encoding information (I don't think we're particularly clever in handling wide string literals or u8 literals either currently, but this patch shouldn't make handling of those cases any worse, it improves some cases, and it lays the foundations for doing better with them - I'm also wary of scope creep as this patch already touches quite a lot of code).

Beginning the table with these 3 lines in the table would help drive it home:
    //  val      stringval
    // --------- -----------
    // "bar"     bar
    // "b\x61r"  bar
    // "b\141r"  bar

OK.

ojwb · 2024-08-14T23:51:50Z

I started comparing the test-suite before and after the changes. Quick observations:
* MzScheme - missing new lines in lots of tests, such as in abstract_typedef2.

All due to one missing \n - will be fixed by cf0ab23.

* C# - escaping changes in C/C++ files, eg constants_wrap.c.

The C# change is such as \x1 -> \x0001 - the \x escape in C# takes up to 4 contiguous following characters so this change is to avoid mishandling a case such as (in C/C++ source) "\1a" which SWIG currently mishandles (with %csconstit currently generates C# code"\x1a"which is a string containing byte 26, when it should be byte 1 followed by ana- this branch generates"\x0001a"` instead, which works correctly).

For %(escape)s I added a fix for a similar bug with octal escaping, but there I check if the next character is an octal digit and only emit 3 octal digits if it is, which is admittedly inconsistent in approach. I can do something similar for %(csescape)s if you want, but to me emitting "\x000face" vs "\xfile" seems less good than simply always emitting 4 character hex escapes for C#, even if that results in different-but-equivalent generated C# code to what SWIG currently emits (for cases that already work that is - for broken cases the generated code does need to change). Also happy to go the other way and always emit 3 octal digits in %(escape)s.

I see I added a regression test for the octal case but not the hex one - I'll add one.

I also see SWIG currently mishandles e.g. "\18" - it looks like it incorrectly allows digits 8 and 9 in octal values and also doesn't appear to limit the octal sequence to 3 digits either. Seems to also be the case for octal integer literals from a quick look at the code. I'll try to fix.

The generated values no longer match those in Java/Python, so seem unnecessary.

Different target languages have different string literal escape sequences, so in general I wouldn't expect to always get the exact same escaping across target languages (to achieve that there would need to be a common subset that (a) has the same interpretation in every target language and (b) is capable of representing every possible string literal value, but I don't think such a subset actually exists...)

ojwb · 2024-08-15T00:23:08Z

Outputs bar (with no newline). Here stringval is bar, val is "bar" (because val is actually created by escaping and quoting the value from stringval).

I've noted this point about val being created from stringval explicitly now. Will be in a322735.

I think some more examples/explanations might help here for example:
* C escape sequences that are 'viewable' ASCII such as `\" \' \\`

Added.

* C escape sequences such as `\n \r` - maybe not a good idea as the values are not viewable and cannot easily be shown in your table as is
Yes, I avoided examples such as \n as it's hard to write the stringval value without some sort of escaping, at which point it seems more confusing than helpful.

Leaving this.

* a C++11 raw string literal

* a C++11 string literal such as L and u8 prefixes

I've added examples of the L prefix.

I need to do some checking with C++11 raw string literals and the u8 prefix to make sure the examples I give are actually correct (and that they work and we have suitable test coverage for them).

* more unicode examples using various C escape sequences quoting
Are you thinking of something like a "plain" C string with UTF-8 sequence byte values individually encoded as \x escapes?

That's a bit of a tricky case as we have no character set encoding information (I don't think we're particularly clever in handling wide string literals of u8 literals either currently, but this patch shouldn't make handling of those cases any worse, it improves some cases, and it lays the foundations for doing better with them - I'm also wary of scope creep as this patch already touches quite a lot of code).

Leaving this for a follow-up change if that's OK.

Beginning the table with these 3 lines in the table would help drive it home:
    //  val      stringval
    // --------- -----------
    // "bar"     bar
    // "b\x61r"  bar
    // "b\141r"  bar

Added.

ojwb · 2024-08-15T02:42:14Z

I also see SWIG currently mishandles e.g. "\18" - it looks like it incorrectly allows digits 8 and 9 in octal values and also doesn't appear to limit the octal sequence to 3 digits either. Seems to also be the case for octal integer literals from a quick look at the code. I'll try to fix.

I've pushed fixes to git master for these bugs since they're only tangentially related to the changes here.

This can be handled separately.

Also note that there may currently be bugs handling embedded zero bytes.

ojwb · 2024-08-15T05:04:28Z

I've backed out the nextchar() change as the more I think about it, the more sense it makes to handle that separately.

I'll shortly push that plus changes to add C++11 u8 and raw string examples to the stringval comment (a side note is that u8 string literals are const char8_t[N] as of C++20 but SWIG currently implements the C++11 version of them so I just treated them that way in the comment for now).

I've also merged master to the branch.

wsfulton · 2024-08-15T07:57:19Z

* C# - escaping changes in C/C++ files, eg constants_wrap.c.
The C# change is such as \x1 -> \x0001 - the \x escape in C# takes up to 4 contiguous following characters so this change is to avoid mishandling a case such as (in C/C++ source) "\1a" which SWIG currently mishandles (with %csconstit currently generates C# code"\x1a"which is a string containing byte 26, when it should be byte 1 followed by ana- this branch generates"\x0001a"` instead, which works correctly).

Motivation for the changes in the .cs files is great. I just don't think the change in the constants_wrap.c file, such as:
result = (char)('\1'); to result = (char)('\x0001'); are correct. The C code generated for C# should match all the other languages as it used to.

Regarding the explanations for stringval, much better now thanks. Also since I made my comments I've reviewed the code and worked out that stringval is the byte sequence for the string as you also subsequently suggested. So I see how the term 'raw' came about, it is the real raw string contents. But the term 'string' is quite overloaded when once start to think beyond ASCII. There are different types of unicode strings where the byte sequence is interpreted in different ways depending on the string encoding. I think what we have in stringval is the bytes sequence parsed and the new table in parser.y is great for ASCII characters but does not really address unicode (which is in its own right a minefield in my opinion, and probably left for another day).

Rather than responding to all your other comments, in summary all good.

I've still to do some more checking before and after the change. Python and MzScheme look okay now. Besides the odd testcase bug fix due to changes in the branch, this is a clean refactor and it's not hard to compare using the test-suite from the expression-value-overhaul branch and builds from master and the branch (using make partialcheck-test-suite).

ojwb · 2024-08-15T20:52:57Z

The C# change is such as \x1 -> \x0001 - the \x escape in C# takes up to 4 contiguous following characters so this change is to avoid mishandling a case such as (in C/C++ source) "\1a" which SWIG currently mishandles (with %csconstit currently generates C# code"\x1a"which is a string containing byte 26, when it should be byte 1 followed by ana- this branch generates"\x0001a"` instead, which works correctly).

Motivation for the changes in the .cs files is great. I just don't think the change in the constants_wrap.c file, such as: result = (char)('\1'); to result = (char)('\x0001'); are correct. The C code generated for C# should match all the other languages as it used to.

Oh sorry, I'd missed this was in the generated C/C++ code. That definitely does seem wrong - presumably I'm using %(csescape)s somewhere I shouldn't.

Regarding the explanations for stringval, much better now thanks. Also since I made my comments I've reviewed the code and worked out that stringval is the byte sequence for the string as you also subsequently suggested. So I see how the term 'raw' came about, it is the real raw string contents

That would have made sense, except before my changes it was actually val that held the byte sequence for string and character types, while rawval held the value turned back into C/C++ source code (but with quirky differences between T_STRING and the other string and char types). So on the node, the value attribute was C/C++ source code except for string and char types where it was the byte sequence and so every target language ended up with code to conditionally use rawval or value depending on either checking the type or just whether rawval was set.

But the term 'string' is quite overloaded when once start to think beyond ASCII. There are different types of unicode strings where the byte sequence is interpreted in different ways depending on the string encoding. I think what we have in stringval is the bytes sequence parsed and the new table in parser.y is great for ASCII characters but does not really address unicode (which is in its own right a minefield in my opinion, and probably left for another day).

Yes, I'm not totally happy with stringval but I really think we need to move away from rawval because otherwise we are liable to trip ourselves up with the essentially now reversed semantics compared to before, especially when working with patches from before this change. I also don't think rawval is a great name, as what's "raw" here is open to interpretation (what appears in the parsed input is "raw" as it's not been processed by SWIG, but the byte-level representation of the string is also "raw" as it's not been escaped or quoted).

I've not thought a lot about improving handling of Unicode strings, but we could eventually have stringval plus stringenc, or perhaps split stringval into variants for different encodings.

Rather than responding to all your other comments, in summary all good.

Great.

I've still to do some more checking before and after the change. Python and MzScheme look okay now. Besides the odd testcase bug fix due to changes in the branch, this is a clean refactor and it's not hard to compare using the test-suite from the expression-value-overhaul branch and builds from master and the branch (using make partialcheck-test-suite).

OK (and thanks - I did try hard to keep this clean).

Re your earlier comment:

I think these escapings you have can be used from within typemaps??? If so, might be useful, so I'd leave it as is. Also it makes sense to leave these similar functions in the one file (misc.c).

I'm not sure if they are useful in typemaps, but if we expect them to be used in this way perhaps we need a further tweak to the octal escaping in %(escape)s - currently this always emits a 3 digit sequence if the next character is an octal digit, but there's a corner case if you use e.g. %(escape)s%(escape)s and the first string ends with a character which gets escaped as one or two octal digits and the second string starts with an octal digit. This isn't a problem for any code in SWIG (currently at least), but %(escape)s%(escape)s could potentially be used in this way in a typemap.

Perhaps we should also use a 3 digit sequence for the last character in the string (if it needs octal escaping). However that will mean we always do this in the character constant case (e.g. '\001' instead of '\1' in the C/C++ wrapper code).

It could also just be noted as a limitation.

Thoughts?

The value attribute is now always meant to be the value as valid C/C++ code so don't update it to use C# escaping.

ojwb · 2024-08-16T00:14:14Z

Oh sorry, I'd missed this was in the generated C/C++ code. That definitely does seem wrong - presumably I'm using %(csescape)s somewhere I shouldn't.

We were updating the value attribute with a value passed through %(csescape)s. I think we need to keep the value attribute as the C/C++ encoding of the value, so I've adjusted to just keep the C# encoded version in a local variable. This fixes the cases you highlighted and doesn't affect any others.

I'll push in a bit once I've checked for any other similar changes to the value attribute.

Update: All other places that Setattr(..., "value", ...) are OK so I've pushed.

ojwb · 2024-08-16T01:52:55Z

Reviewing all the comments, I think what's not been completely resolved in the branch now is:

Handling of non-ASCII strings - this patch should make the situation no worse and a little better (it fixes some issues with handling wide string and wide character literals; also it implements C++11 concatenation of adjacent wide and normal string literals) and we both seem to think this is something to handle as a separate change.
Changing nextchar() to return EOF so we can distinguish a zero byte from the end of the string - handle separately (also string literals containing embedded zero bytes are buggy in more ways than just this currently, and are a rarer case).
Should %(escape)s always use 3 digits when emitting an octal escape at the end of the string?
wsfulton to do more checking of output.

BTW, what's your workflow for comparing partialcheck output before and after? I've yet to find a way that doesn't feel clunky to work with, but maybe I'm missing something obvious. Do you just use a second SWIG source tree (as a git worktree maybe)?

wsfulton · 2024-08-16T20:14:00Z

In Java (and C#, maybe others) for the string_constants.i testcase I see

  public final static String ZS1 = "\0";

is now

  public final static String ZS1 = "";

when wrapping

#define ZS1 "\0"

Maybe an esoteric corner case, but this might be a regression as they aren't quite the same thing. But on the other hand, there is a good change going from:

  public final static String ES2 = "\0";

to now

  public final static String ES2 = "";

for

%constant ES2="";

wsfulton · 2024-08-16T21:19:57Z

The Go output is quite different for constants. It looks like the value is obtained from the C/C++ layer instead of outputting a constant in Go code. This is fine and probably should always have been the default. Is there an equivalent to %javaconst to go back to what was generated before? Was this intentional? If so, it probably needs documenting in Go.html.

Otherwise, I have finished my regression checks. Once my comments from today are addressed, feel free to merge to master.

I see I added a regression test for the octal case but not the hex one - I'll add one.

Reminder, I can't see this yet.

Regarding unicode, I don't enjoy this area nor feel particularly competent in it too, so don't have much to add. It does seem to be an area in general that could do with some focus and more testing so would encourage anyone keen on it to do so.

Should %(escape)s always use 3 digits when emitting an octal escape at the end of the string?

Probably not too important. We could have a compromise and if there is an octal escape at the end of the string, always emit 3 digits. Or add some more syntax something like %(escape)3s to force 3 digits or add an alternative implementation with a new name like %(octal3escape)s or %(escape3)s.

BTW, what's your workflow for comparing partialcheck output before and after? I've yet to find a way that doesn't feel clunky to work with, but maybe I'm missing something obvious. Do you just use a second SWIG source tree (as a git worktree maybe)?

It's a little clunky, but not hard and varies depending on what I'm doing. Often I check for just a one language for example, java:

make clean-java-test-suite && make partialcheck-java-test-suite
rm -rf ~/swig/partialgoodts/java
cp -rf Examples/test-suite/java ~/swig/partialgoodts/java # to grab a new baseline
# make some changes/switch branch or whatever to get a new version, possibly build with make
make clean-java-test-suite && make partialcheck-java-test-suite
meld Examples/test-suite/java ~/swig/partialgoodts/java &

Once a baseline has been taken, it is then easy to keep repeating the comparison by running the test-suite and using meld.

ojwb · 2024-08-16T21:31:39Z

I think getting all the cases involving embedded zero bytes to work correctly is going to require going through the code handling these values to ensure it doesn't trip over these bytes (it's easy to truncate the string at them by treating it as a C string).

For example, we can't do Printf(out, "%s%s", in1, in2) to concatenate strings which may contain zero bytes (which is effectively what currently happens when doing a concatenation of adjacent string literals). Instead we need to do Append(out, in1); Append(out, in2); or something equivalent (or maybe adjust %s on a DOH String to append the full string rather than stopping at an embedded zero byte).

Getting "" right in exchange for getting "\0" wrong seems an improvement overall because the former is a much more common case, even if the latter case has regressed compare to previous releases (I think the problem here is that both end up internally as an empty string and so getting both right requires sorting out clean handling of embedded zero bytes in string literals all the way from the scanner to the parser, and ultimately also through to the target languages which use stringval).

Are you happy to leave this as it now is for now and I'll address it with a follow-up patch?

ojwb · 2024-08-16T22:07:06Z

The Go output is quite different for constants. It looks like the value is obtained from the C/C++ layer instead of outputting a constant in Go code. This is fine and probably should always have been the default. Is there an equivalent to %javaconst to go back to what was generated before? Was this intentional? If so, it probably needs documenting in Go.html.

I didn't intend to change that. I'll investigate.

Otherwise, I have finished my regression checks. Once my comments from today are addressed, feel free to merge to master.

OK.

I see I added a regression test for the octal case but not the hex one - I'll add one.

Reminder, I can't see this yet.

Oh yes, will do.

Regarding unicode, I don't enjoy this area nor feel particularly competent in it too, so don't have much to add. It does seem to be an area in general that could do with some focus and more testing so would encourage anyone keen on it to do so.

I find it hard to get enthused about dealing with wide character Unicode, but would like to have UTF-8 handled well at least. Perhaps just converting all string and character literal values to be UTF-8 internally would make most sense (and the target language can then map them back if it wants wide characters, with shared helper functions to do that). If we try to handle different representations inside SWIG then the target languages all have to deal with all the representations which actually seems more complex overall.

Should %(escape)s always use 3 digits when emitting an octal escape at the end of the string?

Probably not too important. We could have a compromise and if there is an octal escape at the end of the string, always emit 3 digits.

Um, that's exactly what I was suggesting as an option above...

Or add some more syntax something like %(escape)3s to force 3 digits or add an alternative implementation with a new name like %(octal3escape)s or %(escape3)s.

%(escape)3s already has a meaning though - it's like %3s (so left pads the string with spaces to width 3 if shorter) but escaping the string as well. I didn't test this actually works, but there's code in SWIG already doing similar stuff:

Source/Swig/tree.c:     Printf(stdout, "%-12s - \"%(escape)-0.80s%s\"\n", k, o, trunc);

I'm leaning towards just noting %(escape)s%(escape)s as something to beware of. We don't document %(escape)s for use in user typemaps or elsewhere in interface files currently, and I'm not really sure if they'd actually be useful to users, and even if they are, if you'd ever want to escape one string after another like that (and it'd be as easy to add a special case for octal escaping the final character or a new escaping function later if we find a motivating case).

BTW, what's your workflow for comparing partialcheck output before and after? I've yet to find a way that doesn't feel clunky to work with, but maybe I'm missing something obvious. Do you just use a second SWIG source tree (as a git worktree maybe)?

It's a little clunky, but not hard and varies depending on what I'm doing. Often I check for just a one language for example, java:

Thanks - that's pretty much what I've been doing, but feeling I was missing a simpler way. Perhaps it's worth a make rule to automate.

This was causing most non-integer constants not to be wrapped using a native Go constant.

In Go, \' isn't valid in a double quoted string, while \" isn't valid in a single quoted rune, so to avoid needing two different escaping functions we always escape both using hex escapes.

ojwb · 2024-08-17T03:16:42Z

The Go wrapping change was a bug - fixed by 5c6704f, and now it wraps slightly more cases using Go constants than before. I've also add a tweak to make it use casts more like it did before (5c6704f) - I don't think this matters, but keeping it more like before seems helpful.

C# hex escaping regression test added by 846340a.

There's also a bug with string escaping in Go as \' isn't valid in a double-quoted string while \" isn't valid in a single-quoted string. To avoid needing two different go-escape functions I've changed %(goescape)s to use hex escaping for these as that should work for both contexts (8a642a3).

wsfulton · 2024-08-17T05:59:17Z

Makes sense to address DOH's String embedded \0 separately in the issue you raised above. Great to see this get over the line too.

ojwb · 2024-09-02T19:33:16Z

The handling of string literals containing zero bytes was addressed by #2997 (which fixed the ZS1 testcase to work like it did before the changes here).

The nextchar() change I dropped from this is now #3010, along with some other fixes so that zero bytes in source files work better (not a common case, but SWIG's existing behaviour in this situation seems problematic).

Straighten out handling of char and string constants

b0a85fd

Fixes #904 Fixes #1907 Fixes #2579

ojwb mentioned this pull request Aug 12, 2024

Straighten out handling of numeric and boolean literals #2995

Closed

wsfulton reviewed Aug 14, 2024

View reviewed changes

Source/Modules/go.cxx Outdated Show resolved Hide resolved

Source/Swig/scanner.c Outdated Show resolved Hide resolved

Source/Modules/lang.cxx Outdated Show resolved Hide resolved

ojwb added 2 commits August 15, 2024 10:49

Fix (existing) leak of attribute value

09cecc3

[mzscheme] Add missing newlines to C/C++ wrapper

cf0ab23

ojwb added 2 commits August 15, 2024 11:55

Resolve "FIXME leak" comments

7a388e3

Expand on stringval documentation

a322735

ojwb added 3 commits August 15, 2024 14:50

Merge branch 'master' into expression-value-overhaul

b7a637c

Back out clean-up of nextchar() return value

c8f4106

This can be handled separately.

Add u8 and raw string examples for stringval

dc987e2

Also note that there may currently be bugs handling embedded zero bytes.

ojwb added 2 commits August 16, 2024 09:26

Avoid reading attribute twice

a21415d

[csharp] Fix char encoding in C/C++ wrapper

29965c7

The value attribute is now always meant to be the value as valid C/C++ code so don't update it to use C# escaping.

This was referenced Aug 16, 2024

C backend #2086

Merged

Fortran-2003 target language #1195

Closed

ojwb mentioned this pull request Aug 16, 2024

CI builds for CFFI (as of 9 May 2021) #2200

Draft

ojwb added 4 commits August 17, 2024 14:08

Add regression test for C# hex escaping bug

846340a

[go] Fix constant wrapping bug

5c6704f

This was causing most non-integer constants not to be wrapped using a native Go constant.

[go] Resolve remaining FIXME comment

e2d5350

[go] Use hex escapes for quotes in %(goescape)s

8a642a3

In Go, \' isn't valid in a double quoted string, while \" isn't valid in a single quoted rune, so to avoid needing two different escaping functions we always escape both using hex escapes.

ojwb closed this in 7a8c9fd Aug 17, 2024

ojwb mentioned this pull request Aug 17, 2024

Sort out handling of string literals with embedded zero bytes #2996

Closed

ojwb deleted the expression-value-overhaul branch September 1, 2024 04:10

ojwb added a commit to vadz/swig that referenced this pull request Sep 11, 2024

Update C backend for swig#2990

6848b3e

Straighten out handling of char and string constants #2990

Straighten out handling of char and string constants #2990

Uh oh!

Conversation

ojwb commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ojwb commented Aug 8, 2024

Uh oh!

wsfulton commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsfulton commented Aug 14, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wsfulton commented Aug 14, 2024

Uh oh!

ojwb commented Aug 14, 2024

Uh oh!

wsfulton commented Aug 14, 2024

Uh oh!

ojwb commented Aug 14, 2024

Uh oh!

ojwb commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ojwb commented Aug 14, 2024

Uh oh!

ojwb commented Aug 15, 2024

Uh oh!

ojwb commented Aug 15, 2024

Uh oh!

ojwb commented Aug 15, 2024

Uh oh!

wsfulton commented Aug 15, 2024

Uh oh!

ojwb commented Aug 15, 2024

Uh oh!

ojwb commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ojwb commented Aug 16, 2024

Uh oh!

wsfulton commented Aug 16, 2024

Uh oh!

wsfulton commented Aug 16, 2024

Uh oh!

ojwb commented Aug 16, 2024

Uh oh!

ojwb commented Aug 16, 2024

Uh oh!

ojwb commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsfulton commented Aug 17, 2024

Uh oh!

ojwb commented Sep 2, 2024

Uh oh!

Uh oh!

ojwb commented Aug 8, 2024 •

edited

Loading

wsfulton commented Aug 14, 2024 •

edited

Loading

ojwb commented Aug 14, 2024 •

edited

Loading

ojwb commented Aug 16, 2024 •

edited

Loading

ojwb commented Aug 17, 2024 •

edited

Loading