Skip to content

Error with overlapping token definitions #420

@ccleve

Description

@ccleve

I'm getting a strange error when a regex could match the prefix of another regex. Maybe. I just don't know what the problem is. Here's a simplified case:


#[derive(Logos, Debug, PartialEq)]
#[logos(skip r".|[\r\n]")] // skip everything not recognized
pub enum LogosToken {
    // any letter except capital Z
    #[regex(r"[a-zA-Y]+", priority = 3)]
    WordExceptZ,

    // any number
    #[regex(r"[0-9]+", priority = 3)]
    Number,

    /*
    This expression is:
    (letter or number)* [Z] (letter or number)*
    In other words, a token with any number of letters or numbers,
    including at least one capital Z.
     */
    #[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
    TermWithZ,
}

#[pg_extern]
fn test_logos() {
    let mut lex = LogosToken::lexer("hello 42world fooZfoo");
    while let Some(result) = lex.next() {
        let slice = lex.slice();
        println!("{:?} {:?}", slice, result);
    }
}

This generates:

"hello" Ok(WordExceptZ)
"42world" Err(())
"fooZfoo" Ok(TermWithZ)

If I replace the regex over TermWithZ with #[regex(r"Z", priority = 3)], I get:

"hello" Ok(WordExceptZ)
"42" Ok(Number)
"world" Ok(WordExceptZ)
"foo" Ok(WordExceptZ)
"Z" Ok(TermWithZ)
"foo" Ok(WordExceptZ)

The "42world" is getting recognized correctly as a number and word.

What I don't understand is, why does the first TermWithZ regex mess up the recognition of "42world"? It doesn't contain a Z, so TermWithZ should ignore it completely and let the first two variants do their job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions