-
Notifications
You must be signed in to change notification settings - Fork 60
Description
WSLgit is forcing a conversion of data from git (diffs, key/values, files, status, binary, ...everything) from unknown encodings into UTF-8. This can result in corruption of data.
As per GIT documentation:
- Git at the core level treats path names simply as sequences of non-NUL bytes, there are no path name encoding conversions.
- Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings and codepages are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.).
- Repositories created on such [non UTF-8] systems will not work properly on UTF-8-based systems (e.g. Linux, Mac, Windows) and vice versa.
- Line endings can be LF, CRLF, or any native system line ending. Line endings can change per-file or be across an entire repository.
Today with WSLgit v0.50, data that is not UTF-8 is already being corrupted by WSLgit. Several charsets like ASCII, Windows-1251, and ISO-8859-1 have the same bytes encoding for commonly use characters. When WSL, git settings, and the user's Windows character set are within this limited set of charsets, it is likely they do not experience corruption. This is the often seen scenario where almost all the characters are ok except for the occassional ü or Ó doesn't look correct.
This leads me to some recommendations:
- I do not recommend WSLgit have comprehensive character-set support. Why? Because git itself has some limitations as written above. And, WSLgit is not a tool that needs to solve all problems.
- One of these problems to not fix is line endings. Line endings are not just LF and CRLF. However, since Git has some limitations on this and we don't have to solve all problems... I recommend only supporting LF and CRLF. References: https://en.wikipedia.org/wiki/Newline and https://docs.microsoft.com/en-us/visualstudio/ide/encodings-and-line-breaks
- Do not use conversion or validation APIs like
from_utf8_lossy()
orfrom_utf8()
. These APIs will corrupt data that is not valid UTF-8. Instead, transparently accept all data with APIs likefrom_utf8_unchecked()
or leave them in byte buffers. This is ok. Why? Because there are two cases today:- UTF-8 data -- this data works today and will continue to work with the unchecked APIs because it is already valid UTF-8. A side benefit is the unchecked APIs will be faster because it doesn't need to validate anything since the data is already valid UTF-8.
- not UTF-8 data -- this data is already being corrupted today. Therefore, it can only get better. By using the unchecked APIs, it is possible for the data to still be workable/usable. All of this data is just bits in groups of bytes. The bytes should not change. For commands like
git show
this is essential -- the data must not be altered in any way. For commands likegit rev-parse
the data might still work since the byte values can be transparently passed-through and as they are captured by tools like VS Code, they may correctly use the user's alternate character set since the bytes were transparently passed to them. Yes, there might be a problem if the data uses a path with non UTF-8 data in it and one or more of the/\:mnt
characters have a different byte encoding. However, this is no worse than today's already corruption of that same data. And it might be better if those 6 characters are encoded with the same bytes because the other characters may be transparently passed-through.
I have code which uses the above recommendations at https://github.com/diablodale/wslgit/tree/charset_trans