Skip to content

Conversation

chrisbra
Copy link
Member

Problem: search_stat not reset when pattern differs in case
Solution: use STRNCMP instead of MB_STRNICMP macro

There was a long standing todo comment, that using MB_STRNICMP is wrong. So let's change it to STRNCMP() instead. Even if it not handle multi-byte characters correctly, then Vim will rather recompute the search stat, instead of re-using the old (and possibly wrong) value.

fixes: #17312

Problem:  search_stat not reset when pattern differs in case
Solution: use STRNCMP instead of MB_STRNICMP macro

There was a long standing todo comment, that using MB_STRNICMP is wrong.
So let's change it to STRNCMP() instead. Even if it not handle
multi-byte characters correctly, then Vim will rather recompute the
search stat, instead of re-using the old (and possibly wrong) value.

fixes: vim#17312

Signed-off-by: Christian Brabandt <cb@256bit.org>
@girishji
Copy link
Contributor

@chrisbra, will this work for mb_strncmp()?

/*
 * Compare two multibyte strings up to "n" characters.
 * Case-sensitive version of mb_strnicmp().
 */
int mb_strncmp(char_u *s1, char_u *s2, size_t n)
{
    int c1, c2;
    size_t i = 0;

    while (i < n)
    {
        c1 = mb_ptr2char_adv(&s1);
        c2 = mb_ptr2char_adv(&s2);

        if (c1 != c2)
            return c1 - c2;
        if (c1 == NUL)  // end of string
            break;
        ++i;
    }
    return 0;
}

It should work for utf8 and DBCS. But, the comment says:

 * Needed for Big5, Shift-JIS and UTF-8 encoding.  Other DBCS encodings can
 * probably use strnicmp(), because there are no ASCII characters in the...

So, the above version may not work for Big5, etc. An alternative approach is to implement a utf_strncmp() function (below), and use it in mb_strncmp() modeled after mb_strnicmp(), but without the MB_TOLOWER() step.

/*
 * Compare two UTF-8 strings up to "n" Unicode characters.
 * Case-sensitive.
 */
int utf_strncmp(char_u *s1, char_u *s2, size_t n)
{
    int c1, c2;
    size_t i = 0;

    while (i < n)
    {
        if (*s1 == NUL && *s2 == NUL)
            return 0;

        c1 = utf_ptr2char((char_u *)s1);
        c2 = utf_ptr2char((char_u *)s2);

        if (c1 != c2)
            return c1 - c2;

        s1 += utf_ptr2len(s1);
        s2 += utf_ptr2len(s2);
        ++i;
    }

    return 0;
}

@chrisbra
Copy link
Member Author

Makes sense. I'll include your version then

@zeertzjq
Copy link
Member

zeertzjq commented May 14, 2025

I'm confused. How does STRNCMP() not handle multi-byte correctly and why is utf_strncmp() needed? And using this utf_strncmp() here is wrong, as lastpatlen is the number of bytes, not the number of characters.

@girishji
Copy link
Contributor

I'm confused. How does STRNCMP() not handle multi-byte correctly and why is utf_strncmp() needed? And using this utf_strncmp() here is wrong, as lastpatlen is the number of bytes, not the number of characters.

I think I was not clear. I meant using utf_strncmp() inside mb_strncmp:

    int
mb_strncmp(char_u *s1, char_u *s2, size_t nn)
{
...
    if (enc_utf8)
    {
	return utf_strncmp(...);
    }
    else
    {
	return STRNCMP(...)
   }
    return 0;
}

I assumed non-utf8 multi-code chars are different from utf8 multi-code (from reading the comments) and that strncmp() will not work for them. I assumed that is why strncmp() was not used earlier. Let me verify this with some tests.

@girishji
Copy link
Contributor

girishji commented May 15, 2025

@zeertzjq

STRNCMP() and STRCMP() do not appear to handle multi-byte characters correctly. Here's an example:

STRNCMP("é", "é", 2)

This returns 0, indicating the strings are equal—but they are not. The first string is a single character "é" (U+00E9), while the second is composed of two code points: "e" (U+0065) followed by a combining acute accent (U+0301).

In digraph notation:

  • The first string is entered as Ctrl-K ' e
  • The second string as e Ctrl-V u 0301

Since STRNCMP() compares raw bytes, it happens that the first two bytes match—even though the second string consists of three bytes in total. STRCMP() also gives an incorrect result for similar reasons.

EDIT: strcmp()/strncmp() also does not handle DBCS (Shift-JIS, Big5) correctly

@zeertzjq
Copy link
Member

STRNCMP() and STRCMP() do not appear to handle multi-byte characters correctly. Here's an example:

STRNCMP("é", "é", 2)

This case is not relevant here, since lastpatlen is the full length of lastpat, so it'll not compare only a part of a multibyte character.

STRCMP() also gives an incorrect result for similar reasons.

Example?

@girishji
Copy link
Contributor

girishji commented May 15, 2025

STRCMP() does handle multi-byte characters correctly—there was a bug in my test. The actual issue lies with STRNCMP(), which compares a fixed number of bytes and can misinterpret multi-byte sequences.

The correct solution seems to be to use STRCMP() or STRNCMP() (since lastpatlen will not split multibyte char).

Apologies for the earlier confusion.

@chrisbra
Copy link
Member Author

That's why I initially used STRNCMP. Because even if it does not handle multi-byte chars correctly, it would cause the search indicator to be updated additionally. Okay, I'll revert it then. Thanks both

@girishji
Copy link
Contributor

Thanks.

@chrisbra chrisbra closed this in 670d0c1 May 16, 2025
zeertzjq added a commit to zeertzjq/neovim that referenced this pull request May 16, 2025
Problem:  search_stat not reset when pattern differs in case
          (tahzibijafar)
Solution: use STRNCMP instead of MB_STRNICMP macro

There was a long standing todo comment, that using MB_STRNICMP is wrong.
So let's change it to STRNCMP() instead. Even if it not handle
multi-byte characters correctly, then Vim will rather recompute the
search stat, instead of re-using the old (and possibly wrong) value.

fixes: vim/vim#17312
closes: vim/vim#17314

vim/vim@670d0c1

Co-authored-by: Christian Brabandt <cb@256bit.org>
zeertzjq added a commit to neovim/neovim that referenced this pull request May 16, 2025
#34058)

Problem:  search_stat not reset when pattern differs in case
          (tahzibijafar)
Solution: use STRNCMP instead of MB_STRNICMP macro

There was a long standing todo comment, that using MB_STRNICMP is wrong.
So let's change it to STRNCMP() instead. Even if it not handle
multi-byte characters correctly, then Vim will rather recompute the
search stat, instead of re-using the old (and possibly wrong) value.

fixes: vim/vim#17312
closes: vim/vim#17314

vim/vim@670d0c1

Co-authored-by: Christian Brabandt <cb@256bit.org>
github-actions bot pushed a commit to neovim/neovim that referenced this pull request May 16, 2025
#34058)

Problem:  search_stat not reset when pattern differs in case
          (tahzibijafar)
Solution: use STRNCMP instead of MB_STRNICMP macro

There was a long standing todo comment, that using MB_STRNICMP is wrong.
So let's change it to STRNCMP() instead. Even if it not handle
multi-byte characters correctly, then Vim will rather recompute the
search stat, instead of re-using the old (and possibly wrong) value.

fixes: vim/vim#17312
closes: vim/vim#17314

vim/vim@670d0c1

Co-authored-by: Christian Brabandt <cb@256bit.org>
(cherry picked from commit ec5f054)
github-actions bot pushed a commit to neovim/neovim that referenced this pull request May 17, 2025
#34058)

Problem:  search_stat not reset when pattern differs in case
          (tahzibijafar)
Solution: use STRNCMP instead of MB_STRNICMP macro

There was a long standing todo comment, that using MB_STRNICMP is wrong.
So let's change it to STRNCMP() instead. Even if it not handle
multi-byte characters correctly, then Vim will rather recompute the
search stat, instead of re-using the old (and possibly wrong) value.

fixes: vim/vim#17312
closes: vim/vim#17314

vim/vim@670d0c1

Co-authored-by: Christian Brabandt <cb@256bit.org>
(cherry picked from commit ec5f054)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect search result
3 participants