-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Straighten out handling of char and string constants #2990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI passed when I pushed the branch. Then it started to run it all again when I opened the PR, which seems a huge waste of resources so I've killed those jobs, but that means it wrongly shows as failing CI. |
but necessary as it's testing different things. The Difference Between Push and Pull Request Triggers in Github should help understand. Perhaps only noticeable when developing in branches in the SWIG repo, which is not the norm of developing/pushing in your own repo. I have just started digesting the pull request, but it would help with a clearer definition and expanded examples of what stringval is. Looking at the docs/comments for it in parser.y in the
Beginning the table with these 3 lines in the table would help drive it home: // val stringval
// --------- -----------
// "bar" bar
// "b\x61r" bar
// "b\141r" bar |
I think these escapings you have can be used from within typemaps??? If so, might be useful, so I'd leave it as is. Also it makes sense to leave these similar functions in the one file (misc.c). |
I started comparing the test-suite before and after the changes. Quick observations:
If those get fixed, I'll start another big comparison of the outputs review. |
I understand the difference, but in this particular case the branch merged to current master would be exactly the same tree of files (because the branch was one commit on top of current master) and SWIG's CI doesn't do anything significantly different for the branch push and PR cases so it didn't seem worth burning hours of CPU time on. |
From what I can see expression-value-overhaul branched off master at 74a8aaf and master has 4 commits not on the expression-value-overhaul branch. So adding the single commit on expression-value-overhaul onto the head of master is not the same as building expression-value-overhaul. |
Those 4 commits were all pushed AFTER I pushed my branch though, so the PR CI run I cancelled would not have included them. |
That's approximately it. I think the most helpful way to think of I feel I have a concept in my head but I haven't completely cracked how to explain it in a clear enough way, but I'm not sure quite what I'm failing to convey. Sorry - I suspect this is frustrating for both of us. Possibly a helpful way to think of #include <unistd.h>
#define LITERAL "b\x61r"
int main() { return write(1, LITERAL, sizeof(LITERAL) - 1) < sizeof(LITERAL) - 1; } Outputs And showing stringification and string concatenation: #include <unistd.h>
#define S(X) #X
#define LITERAL S(b\x61) S(r)
int main() { return write(1, LITERAL, sizeof(LITERAL) - 1) < sizeof(LITERAL) - 1; } Outputs
OK.
Yes, I avoided examples such as
OK.
Are you thinking of something like a "plain" C string with UTF-8 sequence byte values individually encoded as That's a bit of a tricky case as we have no character set encoding information (I don't think we're particularly clever in handling wide string literals or
OK. |
All due to one missing
The C# change is such as For I see I added a regression test for the octal case but not the hex one - I'll add one. I also see SWIG currently mishandles e.g.
Different target languages have different string literal escape sequences, so in general I wouldn't expect to always get the exact same escaping across target languages (to achieve that there would need to be a common subset that (a) has the same interpretation in every target language and (b) is capable of representing every possible string literal value, but I don't think such a subset actually exists...) |
I've noted this point about
Added.
Leaving this.
I've added examples of the I need to do some checking with C++11 raw string literals and the
Leaving this for a follow-up change if that's OK.
Added. |
I've pushed fixes to git master for these bugs since they're only tangentially related to the changes here. |
This can be handled separately.
Also note that there may currently be bugs handling embedded zero bytes.
I've backed out the I'll shortly push that plus changes to add C++11 u8 and raw string examples to the stringval comment (a side note is that I've also merged master to the branch. |
Motivation for the changes in the .cs files is great. I just don't think the change in the constants_wrap.c file, such as: Regarding the explanations for Rather than responding to all your other comments, in summary all good. I've still to do some more checking before and after the change. Python and MzScheme look okay now. Besides the odd testcase bug fix due to changes in the branch, this is a clean refactor and it's not hard to compare using the test-suite from the expression-value-overhaul branch and builds from master and the branch (using make partialcheck-test-suite). |
Oh sorry, I'd missed this was in the generated C/C++ code. That definitely does seem wrong - presumably I'm using
That would have made sense, except before my changes it was actually
Yes, I'm not totally happy with I've not thought a lot about improving handling of Unicode strings, but we could eventually have
Great.
OK (and thanks - I did try hard to keep this clean). Re your earlier comment:
I'm not sure if they are useful in typemaps, but if we expect them to be used in this way perhaps we need a further tweak to the octal escaping in Perhaps we should also use a 3 digit sequence for the last character in the string (if it needs octal escaping). However that will mean we always do this in the character constant case (e.g. It could also just be noted as a limitation. Thoughts? |
The value attribute is now always meant to be the value as valid C/C++ code so don't update it to use C# escaping.
We were updating the I'll push in a bit once I've checked for any other similar changes to the Update: All other places that |
Reviewing all the comments, I think what's not been completely resolved in the branch now is:
BTW, what's your workflow for comparing partialcheck output before and after? I've yet to find a way that doesn't feel clunky to work with, but maybe I'm missing something obvious. Do you just use a second SWIG source tree (as a git worktree maybe)? |
In Java (and C#, maybe others) for the string_constants.i testcase I see public final static String ZS1 = "\0"; is now public final static String ZS1 = ""; when wrapping #define ZS1 "\0" Maybe an esoteric corner case, but this might be a regression as they aren't quite the same thing. But on the other hand, there is a good change going from: public final static String ES2 = "\0"; to now public final static String ES2 = ""; for %constant ES2=""; |
The Go output is quite different for constants. It looks like the value is obtained from the C/C++ layer instead of outputting a constant in Go code. This is fine and probably should always have been the default. Is there an equivalent to Otherwise, I have finished my regression checks. Once my comments from today are addressed, feel free to merge to master.
Reminder, I can't see this yet. Regarding unicode, I don't enjoy this area nor feel particularly competent in it too, so don't have much to add. It does seem to be an area in general that could do with some focus and more testing so would encourage anyone keen on it to do so.
Probably not too important. We could have a compromise and if there is an octal escape at the end of the string, always emit 3 digits. Or add some more syntax something like
It's a little clunky, but not hard and varies depending on what I'm doing. Often I check for just a one language for example, java: make clean-java-test-suite && make partialcheck-java-test-suite
rm -rf ~/swig/partialgoodts/java
cp -rf Examples/test-suite/java ~/swig/partialgoodts/java # to grab a new baseline
# make some changes/switch branch or whatever to get a new version, possibly build with make
make clean-java-test-suite && make partialcheck-java-test-suite
meld Examples/test-suite/java ~/swig/partialgoodts/java & Once a baseline has been taken, it is then easy to keep repeating the comparison by running the test-suite and using meld. |
I think getting all the cases involving embedded zero bytes to work correctly is going to require going through the code handling these values to ensure it doesn't trip over these bytes (it's easy to truncate the string at them by treating it as a C string). For example, we can't do Getting Are you happy to leave this as it now is for now and I'll address it with a follow-up patch? |
I didn't intend to change that. I'll investigate.
OK.
Oh yes, will do.
I find it hard to get enthused about dealing with wide character Unicode, but would like to have UTF-8 handled well at least. Perhaps just converting all string and character literal values to be UTF-8 internally would make most sense (and the target language can then map them back if it wants wide characters, with shared helper functions to do that). If we try to handle different representations inside SWIG then the target languages all have to deal with all the representations which actually seems more complex overall.
Um, that's exactly what I was suggesting as an option above...
I'm leaning towards just noting
Thanks - that's pretty much what I've been doing, but feeling I was missing a simpler way. Perhaps it's worth a make rule to automate. |
This was causing most non-integer constants not to be wrapped using a native Go constant.
In Go, \' isn't valid in a double quoted string, while \" isn't valid in a single quoted rune, so to avoid needing two different escaping functions we always escape both using hex escapes.
The Go wrapping change was a bug - fixed by 5c6704f, and now it wraps slightly more cases using Go constants than before. I've also add a tweak to make it use casts more like it did before (5c6704f) - I don't think this matters, but keeping it more like before seems helpful. C# hex escaping regression test added by 846340a. There's also a bug with string escaping in Go as |
Makes sense to address DOH's String embedded |
The handling of string literals containing zero bytes was addressed by #2997 (which fixed the The |
Fixes #904
Fixes #1907
Fixes #2579
(Yes I know there's still a placeholder date in CHANGES.current...)
@wsfulton I'd welcome your input on this.
The basic idea is that
.val
in the parser and thevalue
attribute are always the C/C++ source code for an expression, so if the target language just wants to emit that in the generated C/C++ wrapper then it can just write it out without needing to do anything special forT_STRING
,T_CHAR
, etc.Then
.stringval
in the parser and thestringval
attribute are always the actual value of the string or character in a DOHString
. If a backend wants to emit the string or character as a string or character in the target language it can take this and apply appropriate escaping and quoting.I've split
%(hexescape)s
into%(csharpescape)s
for C# and%(goescape)s
for Go as both need hex escaping but with different rules. Maybe the language-specific escaping should actually move to the relevant language module instead - we'd probably have to use something likeinstead of
but that doesn't actually seem worse.