Skip to content

stb_texedit.h: support for utf-8 #188

@ocornut

Description

@ocornut

It looks like we could modify stb_textedit.h to enable the user to keep an UTF-8 underlying representation while avoiding random lookup. If we consider all indices and num_chars values as byte-indices (and not character-indices), we'd only need two extra functions:

  • STB_TEXTEDIT_GETPREVIOUSCHARINDEX(obj, int idx)

Default to i-1
For UTF-8 the user would need to backtrack in the stream looking for the first < 0x80 byte. that would work and be efficient enough. However it means that user's own handling of malformed UTF-8 (for which there are no standard convention for, AFAIK), to be compatible with rewinding would have to do the reverse operation. Editing malformed UTF-8 is a super edge-case that is reasonable to avoid or catch earlier, and wouldn't affect people not using UTF-8.

  • STB_TEXTEDIT_GETNEXTCHARINDEX(obj, int char_idx);

Default to i+1
Name would be a nice symmetry to the previous function. It could also be turned into a STB_TEXTEDIT_GETNUMINDICESFORCHAR() / STB_TEXTEDIT_GETBYTECOUNTFORCHAR() defaulting to 1.

  • A common pattern used by stb_textedit.h would be to call STB_TEXTEDIT_GETWIDTH() or STB_TEXTEDIT_GETCHAR() with one of those functions, so we could offer a way for the user to do both at once possibly, but we don't have to. It may just add unnecessary complexity to offer those.

With this scheme a typical loop such as

   for (i=0; first+i < n; ++i)
      find->x += STB_TEXTEDIT_GETWIDTH(str, first, i);

Would become

   for (i=0; first+i < n; i = STB_TEXTEDIT_GETNEXTCHARINDEX(i))
      find->x += STB_TEXTEDIT_GETWIDTH(str, first, i);

There's probably a few other things to solve and clarify but that's the gist of it.
Do you think you would take such a patch?

(
This is merely me dumping some ideas, as I'm not yet sure I went to undergo this modification. It has been niggling be that my text input widget has to do back and forth UTF-8 - wchar conversions. As I'm trying to handle large of text reasonably in an imgui context and not doing multiple pass on the data, the code is already quite complex and would benefit from only dealing with a single UTF-8 buffer. However for interactive performances with large amount of text, I may just as well have to rewrite something anyway because stb_textedit is not designed for large text.

So I can either:
a) Add this UTF-8 support to stb_textedit, it would simplify my code a lot (primary focus), make it a little faster, and generally may be useful to have that support in stb_textedit. The cons is that the code in stb_textedit.h will look a little heavier.
b) Rewrite something custom, more stateful to handle interacting with large text. More effort. I don't absolutely need the perf but it'd be nice. Nobody else will benefit from the improvement. I'd prefer to avoid this path.
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions