Raw lexer

The raw lexer is a transformation from UTF-8 strings to "raw" lexical classes along with the length of the token with said lexical class. The lexer and lexical classes are called "raw" because they are not final; in order to finish lexing Rust source, the tokens must be "cooked" into proper tokens.

The most obvious difference is that the raw lexer makes no provision for floating point literals. Instead, the cooking step is responsible for gluing together raw tokens to make them.

Each raw lexical class is defined by a regular expression that matches that lexical class. In the case that two classes both match the prefix of a given string, ties are explicitly broken by always prefering one class over the other.

There are three lexical classes, which are not regular, block_comment, raw_string, and raw_byte_string, and thus cannot be specified with just a regular expression. For these lexical classes, they still have a regular prefix which is used to determine the lexical class of a string prefix between the regular classes and the nonregular classes. If the regular prefix of a nonregular class is chosen, then the nonregular grammar is used to recognize the entire token.

The regular expression used in this specification is as defined by the Rust regex crate, which is confined to a fully regular subset of modern extended regex. All regexes are anchored, meaning they will only match a prefix of the input string.

Regular tokens

line_comment: //[^\n]*
whitespace: \p{Pattern_White_Space}+
identifier: [_\p{XID_Start}]\p{XID_Continue}*
raw_identifier: r#[_\p{XID_Start}]\p{XID_Continue}*
lifetime: '\p{XID_Start}\p{XID_Continue}*
binary_integer: 0b_*[01][_01]*
octal_integer: 0o_*[0-7][_0-7]*
hexadecimal_integer: 0x_*[0-9a-fA-F][_0-9a-fA-F]*
decimal_number: [0-9][_0-9]*
character: '(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})'
byte: '(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-9a-fA-F]{2})'
string: "(?:[^\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})*"
byte_string: b"(?:[^\r\\']|\\['"]|\\[nrt\\0]|\\x[0-9a-fA-F]{2})"
exclamation: !
pound: #
dollar: \$
percent: %
ampersand: &
open_parenthesis: \(
close_parenthesis: \)
star: \*
plus: \+
comma: ,
minus: -
dot: \.
slash: /
colon: :
semicolon: ;
less: <
equal: =
greater: >
question: \?
at: @
open_bracket: \[
close_bracket: \]
circumflex: \^
left_bracket: \{
bar: \|
right_bracket: \}
tilde: ~

Ties

binary_integer always ties with decimal_integer.: Prefer binary_integer, which is always the longer match.
octal_integer always ties with decimal_integer.: Prefer octal_integer, which is always the longer match.
hexadecimal_integer always ties with decimal_integer.: Prefer hexadecimal_integer, which is always the longer match.
character can tie with lifetime.: Prefer character, which is always the longer match.

Nonregular tokens' regular prefix

block_comment: /\*
raw_string: r[#"]
raw_byte_string: br[#"]

Ties

block_comment always ties with slash.: Prefer block_comment, which is always the longer match.
raw_string always ties with identifier.: Prefer raw_string, which is always the longer match.
raw_byte_string always ties with identifier.: Prefer raw_byte_string, which is always the longer match.

Nonregular tokens

The block_comment, raw_string, and raw_byte_string tokens are not regular, so cannot be fully specified by a regular expression. Instead, the regular expression recognizes a prefix of the token, which is used to decide when to match a nonregular token instead of a regular token.

After the regular prefix is recognized, a further recognizer recognizes the full token. Like the regular expressions, these grammars should be considered anchored, matching a prefix. These recognizers are specified in PCRE, EBNF, and Rust code returning the prefix length.

If a full recognizer fails after the regular prefix is recognized, this is an error.

`block_comment`

block_comment is a context-free grammar.

Informally: matched pairs of opening /* and closing */.

PCRE

(?xs)
(?(DEFINE)(?P<inner> (?:(?! /\* | \*/ ) . )*))
    /\*
    (?P>inner)
    (?: (?R) (?P>inner) )*
    \*/

EBNF

block comment = open
              , { ({"/"} | {"*"}), char
                | ({"/"} | {"*"}), block comment
                }
              , close
              ;

open  = "/*" ;
close = "*/" ;
char  = ? any character ? - "/" - "*" ;

Rust

pub fn parse_block_comment(s: &str) -> usize {
    let mut chars = s.chars().peekable();
    assert_eq!(chars.next(), Some('/'));
    assert_eq!(chars.next(), Some('*'));

    let mut depth: usize = 1;
    let mut len: usize = 2;

    while depth > 0 {
        match chars.next() {
            Some('/') if matches!(chars.peek(), Some('*')) => {
                chars.next();
                depth += 1;
                len += 2;
            }
            Some('*') if matches!(chars.peek(), Some('/')) => {
                chars.next();
                depth -= 1;
                len += 2;
            }
            Some(c) => len += c.len_utf8(),
            None => panic!("exhausted source in block comment"),
        }
    }

    return len;
}

`raw_string`

raw_string is a context-sensitive grammar.

Informally: opened by r#*", closed by first instance of "#* with the same number of #.

Note: an implementation is allowed to put a limit on the number of # allowed in a raw string (thus making the grammar regular), but at least 2¹⁶ hashes must be supported. rustc uses a limit of 2¹⁶ at the time of writing this specification. The number of states required to create a DFA matching a raw string with up to n hashes can be approximated by the function 0.5n² + 2.5n. For scale, that's on the order of 2¹⁵ states to support 2⁸ hashes, and 2³¹ states to support 2¹⁶ hashes.

PCRE

(?xs)
r(?P<hashes>\#*)"
[^\r]*?
"(?P=hashes)

EBNF

Note that because raw_string is context-sensative, this EBNF has the additional restriction that only the shortest possible match is produced in ambiguous cases.

raw string = "r", inner ;

inner = "#" , inner , "#"
      | "\"", {char}, "\""
      ;
char  = ? any character ? - "\r" ;

Rust

fn parse_raw_string(s: &str) -> usize {
    let mut chars = s.chars().peekable();
    assert_eq!(chars.next(), Some('r'));

    let mut hashes: usize = 0;
    let mut len: usize = 1;

    loop {
        match chars.next() {
            Some('#') => {
                len += 1;
                hashes += 1;
            }
            Some('"') => {
                len += 1;
                break;
            }
            Some(c) => panic!("invalid char {:?} in raw string opening fence", c),
            None => panic!("exhausted source in raw string"),
        }
    }

    loop {
        match chars.next() {
            Some('"') => {
                len += 1;
                let mut hashes_seen: usize = 0;
                loop {
                    match chars.peek() {
                        Some('#') => {
                            chars.next();
                            len += 1;
                            hashes_seen += 1;
                            if hashes_seen == hashes {
                                return len;
                            }
                        }
                        _ => break,
                    }
                }
            }
            Some('\r') => panic!("bare CR not allowed in raw string"),
            Some(c) => len += c.len_utf8(),
            None => panic!("exhausted source in raw string"),
        }
    }
}

`raw_byte_string`

Uses the same grammar as raw_string, just prefixed with a b.

Cooking notes

These are notes about cooking taken while writing this document that should not be in the final document.

Floating point
- oh boy floating point
- https://github.com/rust-lang/rust/blob/master/src/librustc_lexer/src/lib.rs#L502
- binary_integer dot (?! dot | identifier) is an error
- octal_integer dot (?! dot | identifier) is an error
- hexadecimal_integer dot (?! dot | identifier) is an error
- To be a float, decimal_identifier may be followed by:
  - dot (?! dot | identifier)
  - dot decimal_integer (?! identifier)
  - dot decimal_integer identifier
  - dot binary_integer
  - dot octal_integer
  - dot hexadecimal_integer
  - oh boy e/E is annoying
  - why is floating point so hard
  - I would love to specify the grammar directly in terms of the micro lexer rather than deal with lookaround cooking floating point like this
  - But suffixes and lifetime lifetime should be handled by some stage of the lexer
  - Because the grammar has points where it accepts raw token streams
  - Namely, macros
Suffixed literals
- binary_integer identifier (error if suffix starts with e or E)
- octal_integer identifier (error if suffix starts with e or E)
- hexadecimal_integer identifier (error if suffix starts with e or E)
- decimal_integer identifier (unless suffix starts with e or E)
- character identifier
- byte identifier
- string identifier
- byte_string identifier
- raw_string identifier
- raw_byte_string identifier
Late errors
- lifetime lifetime cooks to an error

CAD97/raw_lexer.md Secret

Raw lexer

Regular tokens

Ties

Nonregular tokens' regular prefix

Ties

Nonregular tokens

`block_comment`

PCRE

EBNF

Rust

`raw_string`

PCRE

EBNF

Rust

`raw_byte_string`

Cooking notes

Centril commented Apr 7, 2020