Skip to content

Strange behaviour when matching 'else' / 'else if' #160

@irh

Description

@irh

I'm working on a lexer for a language where I'd like to have else and else if lexed as separate tokens, but I'm running into suprising behaviour.

In the following example you can see that else has been lexed as Other:

mod else_if {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex(r"[ ]+", logos::skip)]
        #[error]
        Error,
        #[token("else")]
        Else,
        #[token("else if")]
        ElseIf,
        #[regex(r"[a-z]*")]
        Other,
    }

    #[test]
    fn else_x_else_if_y() {
        let mut lexer = Token::lexer("else x else if y");

        // Expected: assert_eq!(lexer.next().unwrap(), Token::Else);
        assert_eq!(lexer.next().unwrap(), Token::Other);

        assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::ElseIf);
        assert_eq!(lexer.next().unwrap(), Token::Other);
    }
}

Removing the space from else if allows else to be parsed as Else:

mod else_if_2 {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex(r"[ ]+", logos::skip)]
        #[error]
        Error,
        #[token("else")]
        Else,
        #[token("elseif")]
        ElseIf,
        #[regex(r"[a-z]*")]
        Other,
    }

    #[test]
    fn else_x_else_if_y() {
        let mut lexer = Token::lexer("else x elseif y");

        assert_eq!(lexer.next().unwrap(), Token::Else);
        assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::ElseIf);
        assert_eq!(lexer.next().unwrap(), Token::Other);
    }
}

Keeping the space in else if, but removing some of the characters from Else causes it to be unexpectedly matched.

mod else_if_3 {
    use logos::Logos;

    #[derive(Logos, Debug, PartialEq)]
    enum Token {
        #[regex(r"[ ]+", logos::skip)]
        #[error]
        Error,
        #[token("e")]
        Else,
        #[token("else if")]
        ElseIf,
        #[regex(r"[a-z]*")]
        Other,
    }

    #[test]
    fn else_x_else_if_y() {
        let mut lexer = Token::lexer("else x else if y");

        // Expected: assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::Else);

        assert_eq!(lexer.next().unwrap(), Token::Other);
        assert_eq!(lexer.next().unwrap(), Token::ElseIf);
        assert_eq!(lexer.next().unwrap(), Token::Other);
    }
}

My understanding of the token disambiguation documentation is that the first example should work as I'd expect, with Else and ElseIf being matched independently, with higher priority than Other. Do I have that wrong? And is the last example exposing a bug?

Thanks for your time and the great library!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions