-
Notifications
You must be signed in to change notification settings - Fork 37.8k
test: Correctly decode UTF-8 literal string paths #24469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Call fs::u8path to convert some UTF-8 string literals to paths, instead of relying on implicit conversions. The implicit conversions incorrectly decode const char* paths using the current windows codepage, instead of treating them as UTF-8. This could cause test failures depending what environment windows tests are run in. Issue was reported by MarcoFalke <falke.marco@gmail.com> in bitcoin#24306 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crACK 2f5fd3c
I think the second option is good. While it requires more verbosity from callers, it also makes it explicit that the object must be an fs::path
instance, not a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK. The only concerns are about maintainability of the codebase in the future as the suggested changes, while being correct, are not forced by a test and/or the fs::path
interface. So I lean to the option 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
- I think the potential risk of inconsistent behavior of
fs_tests
based on the different environments is a bug severe enough to be solved. Hence I think options 0 and 1 are not feasible. - At first glance, I found option 3 to be more appealing as this reduces the need for verbose arguments. However, further reading the argument against option 3, I am convinced this would not be a way forward. We do not want to encourage representing paths as strings.
- I think option 2 is the way to go. Though it makes arguments verbose, this would ensure that there would be no risk of a wrongful interpretation of a non-ASCII char, as ASCII one. Also, this would encourage developers to convert string to a path as soon as possible, without relying on internal conversions.
There are also other followup options beyond what's listed above. Since the goal of providing |
Or something like |
I'm not sure non-ASCII path literals is something that needs a lot of special thought (or a special syntax). It's not something that we're likely to do except for testing. Unicode paths will generally come from the system or from the configuration, not our code. This PR looks fine to me. |
2f5fd3c test: Correctly decode UTF-8 literal string paths (Ryan Ofsky) Pull request description: Call `fs::u8path()` to convert some UTF-8 string literals to paths, instead of relying on the implicit conversion. Fake Macro pointed out in bitcoin#24306 (comment) that `fs_tests` are incorrectly decoding some literal UTF-8 paths using the current windows codepage, instead of treating them as UTF-8. This could cause test failures depending what environment windows tests are run under. The `fs::path` class exists to avoid problems like this, but because it is lenient with `const char*` conversions, under assumption that they are ["safe as long as the literals are ASCII"](https://github.com/bitcoin/bitcoin/blob/727b0cb59259ac63c627b09b503faada1a89bfb8/src/fs.h#L39), bugs like this are still possible. If we think this is a concern, followup options to try to prevent this bug in the future are: 0. Do nothing 1. Improve the "safe as long as the literals are ASCII" comment. Make it clear that non-ASCII strings are invalid. 2. Drop the implicit `const char*` conversion functions. This would be nice because it would simplifify the `fs::path` class a little, while making it safer. Drawback is that it would require some more verbosity from callers. For example, instead of `GetDataDirNet() / "mempool.dat"` they would have to write `GetDataDirNet() / fs::u8path("mempool.dat")` 3. Keep the implicit `const char*` conversion functions, but make them call `fs::u8path()` internally. Change the "safe as long as the literals are *ASCII*" comment to "safe as long as the literals are *UTF-8*". I'd be happy with 0, 1, or 2. I'd be a little resistant to 3 even though it was would add more safety, because it would slightly increase complexity, and because I think it would encourage representing paths as strings, when I think there are so many footguns associated with paths as strings, that it's best to convert strings to paths at the earliest point possible, and convert paths to strings at the latest point possible. ACKs for top commit: laanwj: Code review ACK 2f5fd3c w0xlt: crACK 2f5fd3c Tree-SHA512: 9c56714744592094d873b79843b526d20f31ed05eff957d698368d66025764eae8bfd5305d5f7b6cc38803f0d85fa5552003e5c6cacf1e076ea6d313bcbc960c
…lowed by path append operators f64aa9c Disallow more unsafe string->path conversions allowed by path append operators (Ryan Ofsky) Pull request description: Add more `fs::path` `operator/` and `operator+` overloads to prevent unsafe string->path conversions on Windows that would cause strings to be decoded according to the current Windows locale & code page instead of the correct string encoding. Update application code to deal with loss of implicit string->path conversions by calling `fs::u8path` or `fs::PathFromString` explicitly, or by just changing variable types from `std::string` to `fs::path` to avoid conversions altogether, or make them happen earlier. In all cases, there's no change in behavior either (1) because strings only contained ASCII characters and would be decoded the same regardless of what encoding was used, or (2) because of the 1:1 mapping between paths and strings using the `PathToString` and `PathFromString` functions. Motivation for this PR was just that I was experimenting with bitcoin#24469 and noticed that operations like `fs::path / std::string` were allowed, and I thought it would be better not to allow them. ACKs for top commit: hebasto: ACK f64aa9c Tree-SHA512: 944cce49ed51537ee7a35ea4ea7f5feaf0c8fff2fa67ee81ec5adebfd3dcbaf41b73eb35e49973d5f852620367f13506fd12a7a9b5ae3a7a0007414d5c9df50f
Call
fs::u8path()
to convert some UTF-8 string literals to paths, instead of relying on the implicit conversion. Fake Macro pointed out in #24306 (comment) thatfs_tests
are incorrectly decoding some literal UTF-8 paths using the current windows codepage, instead of treating them as UTF-8. This could cause test failures depending what environment windows tests are run under.The
fs::path
class exists to avoid problems like this, but because it is lenient withconst char*
conversions, under assumption that they are "safe as long as the literals are ASCII", bugs like this are still possible.If we think this is a concern, followup options to try to prevent this bug in the future are:
const char*
conversion functions. This would be nice because it would simplifify thefs::path
class a little, while making it safer. Drawback is that it would require some more verbosity from callers. For example, instead ofGetDataDirNet() / "mempool.dat"
they would have to writeGetDataDirNet() / fs::u8path("mempool.dat")
const char*
conversion functions, but make them callfs::u8path()
internally. Change the "safe as long as the literals are ASCII" comment to "safe as long as the literals are UTF-8".I'd be happy with 0, 1, or 2. I'd be a little resistant to 3 even though it was would add more safety, because it would slightly increase complexity, and because I think it would encourage representing paths as strings, when I think there are so many footguns associated with paths as strings, that it's best to convert strings to paths at the earliest point possible, and convert paths to strings at the latest point possible.