data can be corrupted by wslgit assuming everything is utf-8

WSLgit is forcing a conversion of data from git (diffs, key/values, files, status, binary, ...everything) from unknown encodings into UTF-8. This can result in corruption of data.

As per GIT documentation:
* Git at the core level treats path names simply as sequences of non-NUL bytes, there are no path name encoding conversions.
* Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings and codepages are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.).
* Repositories created on such [non UTF-8] systems will not work properly on UTF-8-based systems (e.g. Linux, Mac, Windows) and vice versa.
* Line endings can be LF, CRLF, or any native system line ending. Line endings can change per-file or be across an entire repository.

Today with WSLgit v0.50, data that is not UTF-8 is already being corrupted by WSLgit. Several charsets like ASCII, Windows-1251, and ISO-8859-1 have the [same bytes encoding for commonly use characters](https://en.wikipedia.org/wiki/Character_encoding). When WSL, git settings, and the user's Windows character set are within this limited set of charsets, it is likely they do not experience corruption. This is the often seen scenario where almost all the characters are ok except for the occassional ü or Ó doesn't look correct.

This leads me to some recommendations:

1. I do not recommend WSLgit have comprehensive character-set support. Why? Because git itself has some limitations as written above. And, WSLgit is not a tool that needs to solve all problems.
2. One of these problems to not fix is line endings. Line endings are not just LF and CRLF. However, since Git has some limitations on this and we don't have to solve all problems... I recommend only supporting LF and CRLF. References: https://en.wikipedia.org/wiki/Newline and https://docs.microsoft.com/en-us/visualstudio/ide/encodings-and-line-breaks
3. Do not use conversion or validation APIs like `from_utf8_lossy()` or `from_utf8()`. These APIs will corrupt data that is not valid UTF-8. Instead, transparently accept all data with APIs like `from_utf8_unchecked()` or leave them in byte buffers.  This is ok. Why? Because there are two cases today:
    1. UTF-8 data -- this data works today and will continue to work with the unchecked APIs because it is already valid UTF-8. A side benefit is the unchecked APIs will be faster because it doesn't need to validate anything since the data is already valid UTF-8.
    2. not UTF-8 data -- this data is already being corrupted today. Therefore, it can only get better. By using the unchecked APIs, it is _possible_ for the data to still be workable/usable. All of this data is just bits in groups of bytes. The bytes should not change. For commands like `git show` this is essential -- the data must not be altered in any way. For commands like `git rev-parse` the data might still work since the byte values can be transparently passed-through and as they are captured by tools like VS Code, they may correctly use the user's alternate character set since the bytes were transparently passed to them. Yes, there might be a problem if the data uses a path with non UTF-8 data in it and one or more of the `/\:mnt` characters have a different byte encoding. However, this is no worse than today's already corruption of that same data. And it might be better if those 6 characters are encoded with the same bytes because the other characters may be transparently passed-through.

I have code which uses the above recommendations at https://github.com/diablodale/wslgit/tree/charset_trans



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

data can be corrupted by wslgit assuming everything is utf-8 #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

data can be corrupted by wslgit assuming everything is utf-8 #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions