Skip to content

Lexer produces wrong tokens when more input is provided #265

@MalteJanz

Description

@MalteJanz

My logos lexer implementation somehow does not match the TK_NOT token when there is more input (like a whitespace) after it. Instead it matches the TK_WORD token in that case, which should be wrong when it has a lower priority.

Reproducible example with tests:

use logos::Logos;

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Logos)]
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
    #[regex(r"[ \t]+", priority = 1)]
    TK_WHITESPACE = 0,
    #[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 1)]
    TK_WORD,

    #[token("not", priority = 50)]
    TK_NOT,
    #[token("not in", priority = 60)]
    TK_NOT_IN,
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn single_not_works() {
        let mut lexer = SyntaxKind::lexer("not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn word_then_not_works() {
        let mut lexer = SyntaxKind::lexer("word not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn but_this_does_not_work() {
        let mut lexer = SyntaxKind::lexer("not word");
        // FAILED because
        // Left:  Some(Ok(TK_WORD)
        // Right: Some(Ok(TK_NOT)
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
    }

    #[test]
    fn this_is_fine() {
        let mut lexer = SyntaxKind::lexer("not in ");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT_IN)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
    }
}

I know that the situation with TK_NOT and TK_NOT_IN is maybe not ideal (if I remove the latter it works again). But for my parser it would be way better to have these tokens rather than two separate TK_NOT and TK_IN tokens. I would be thankfully for any suggestions that don't require me to remove either of TK_WORD or TK_NOT_IN to make the test but_this_does_not_work run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions