Skip to content

Instantly share code, notes, and snippets.

@CAD97
Created April 6, 2020 01:59
Show Gist options
  • Save CAD97/9f9b04c81c929f7ce782521df69fdde0 to your computer and use it in GitHub Desktop.
Save CAD97/9f9b04c81c929f7ce782521df69fdde0 to your computer and use it in GitHub Desktop.

Raw lexer

The raw lexer is a transformation from UTF-8 strings to "raw" lexical classes along with the length of the token with said lexical class. The lexer and lexical classes are called "raw" because they are not final; in order to finish lexing Rust source, the tokens must be "cooked" into proper tokens.

The most obvious difference is that the raw lexer makes no provision for floating point literals. Instead, the cooking step is responsible for gluing together raw tokens to make them.

Each raw lexical class is defined by a regular expression that matches that lexical class. In the case that two classes both match the prefix of a given string, ties are explicitly broken by always prefering one class over the other.

There are three lexical classes, which are not regular, block_comment, raw_string, and raw_byte_string, and thus cannot be specified with just a regular expression. For these lexical classes, they still have a regular prefix which is used to determine the lexical class of a string prefix between the regular classes and the nonregular classes. If the regular prefix of a nonregular class is chosen, then the nonregular grammar is used to recognize the entire token.

The regular expression used in this specification is as defined by the Rust regex crate, which is confined to a fully regular subset of modern extended regex. All regexes are anchored, meaning they will only match a prefix of the input string.

Regular tokens

line_comment
//[^\n]*
whitespace
\p{Pattern_White_Space}+
identifier
[_\p{XID_Start}]\p{XID_Continue}*
raw_identifier
r#[_\p{XID_Start}]\p{XID_Continue}*
lifetime
'\p{XID_Start}\p{XID_Continue}*
binary_integer
0b_*[01][_01]*
octal_integer
0o_*[0-7][_0-7]*
hexadecimal_integer
0x_*[0-9a-fA-F][_0-9a-fA-F]*
decimal_number
[0-9][_0-9]*
character
'(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})'
byte
'(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-9a-fA-F]{2})'
string
"(?:[^\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})*"
byte_string
b"(?:[^\r\\']|\\['"]|\\[nrt\\0]|\\x[0-9a-fA-F]{2})"
exclamation
!
pound
#
dollar
\$
percent
%
ampersand
&
open_parenthesis
\(
close_parenthesis
\)
star
\*
plus
\+
comma
,
minus
-
dot
\.
slash
/
colon
:
semicolon
;
less
<
equal
=
greater
>
question
\?
at
@
open_bracket
\[
close_bracket
\]
circumflex
\^
left_bracket
\{
bar
\|
right_bracket
\}
tilde
~

Ties

binary_integer always ties with decimal_integer.
Prefer binary_integer, which is always the longer match.
octal_integer always ties with decimal_integer.
Prefer octal_integer, which is always the longer match.
hexadecimal_integer always ties with decimal_integer.
Prefer hexadecimal_integer, which is always the longer match.
character can tie with lifetime.
Prefer character, which is always the longer match.

Nonregular tokens' regular prefix

block_comment
/\*
raw_string
r[#"]
raw_byte_string
br[#"]

Ties

block_comment always ties with slash.
Prefer block_comment, which is always the longer match.
raw_string always ties with identifier.
Prefer raw_string, which is always the longer match.
raw_byte_string always ties with identifier.
Prefer raw_byte_string, which is always the longer match.

Nonregular tokens

The block_comment, raw_string, and raw_byte_string tokens are not regular, so cannot be fully specified by a regular expression. Instead, the regular expression recognizes a prefix of the token, which is used to decide when to match a nonregular token instead of a regular token.

After the regular prefix is recognized, a further recognizer recognizes the full token. Like the regular expressions, these grammars should be considered anchored, matching a prefix. These recognizers are specified in PCRE, EBNF, and Rust code returning the prefix length.

If a full recognizer fails after the regular prefix is recognized, this is an error.

block_comment

block_comment is a context-free grammar.

Informally: matched pairs of opening /* and closing */.

PCRE

(?xs)
(?(DEFINE)(?P<inner> (?:(?! /\* | \*/ ) . )*))
    /\*
    (?P>inner)
    (?: (?R) (?P>inner) )*
    \*/

EBNF

block comment = open
              , { ({"/"} | {"*"}), char
                | ({"/"} | {"*"}), block comment
                }
              , close
              ;

open  = "/*" ;
close = "*/" ;
char  = ? any character ? - "/" - "*" ;

Rust

pub fn parse_block_comment(s: &str) -> usize {
    let mut chars = s.chars().peekable();
    assert_eq!(chars.next(), Some('/'));
    assert_eq!(chars.next(), Some('*'));

    let mut depth: usize = 1;
    let mut len: usize = 2;

    while depth > 0 {
        match chars.next() {
            Some('/') if matches!(chars.peek(), Some('*')) => {
                chars.next();
                depth += 1;
                len += 2;
            }
            Some('*') if matches!(chars.peek(), Some('/')) => {
                chars.next();
                depth -= 1;
                len += 2;
            }
            Some(c) => len += c.len_utf8(),
            None => panic!("exhausted source in block comment"),
        }
    }

    return len;
}

raw_string

raw_string is a context-sensitive grammar.

Informally: opened by r#*", closed by first instance of "#* with the same number of #.

Note: an implementation is allowed to put a limit on the number of # allowed in a raw string (thus making the grammar regular), but at least 216 hashes must be supported. rustc uses a limit of 216 at the time of writing this specification. The number of states required to create a DFA matching a raw string with up to n hashes can be approximated by the function 0.5n2 + 2.5n. For scale, that's on the order of 215 states to support 28 hashes, and 231 states to support 216 hashes.

PCRE

(?xs)
r(?P<hashes>\#*)"
[^\r]*?
"(?P=hashes)

EBNF

Note that because raw_string is context-sensative, this EBNF has the additional restriction that only the shortest possible match is produced in ambiguous cases.

raw string = "r", inner ;

inner = "#" , inner , "#"
      | "\"", {char}, "\""
      ;
char  = ? any character ? - "\r" ;

Rust

fn parse_raw_string(s: &str) -> usize {
    let mut chars = s.chars().peekable();
    assert_eq!(chars.next(), Some('r'));

    let mut hashes: usize = 0;
    let mut len: usize = 1;

    loop {
        match chars.next() {
            Some('#') => {
                len += 1;
                hashes += 1;
            }
            Some('"') => {
                len += 1;
                break;
            }
            Some(c) => panic!("invalid char {:?} in raw string opening fence", c),
            None => panic!("exhausted source in raw string"),
        }
    }

    loop {
        match chars.next() {
            Some('"') => {
                len += 1;
                let mut hashes_seen: usize = 0;
                loop {
                    match chars.peek() {
                        Some('#') => {
                            chars.next();
                            len += 1;
                            hashes_seen += 1;
                            if hashes_seen == hashes {
                                return len;
                            }
                        }
                        _ => break,
                    }
                }
            }
            Some('\r') => panic!("bare CR not allowed in raw string"),
            Some(c) => len += c.len_utf8(),
            None => panic!("exhausted source in raw string"),
        }
    }
}

raw_byte_string

Uses the same grammar as raw_string, just prefixed with a b.

Cooking notes

These are notes about cooking taken while writing this document that should not be in the final document.

  • Floating point
    • oh boy floating point
    • https://github.com/rust-lang/rust/blob/master/src/librustc_lexer/src/lib.rs#L502
    • binary_integer dot (?! dot | identifier) is an error
    • octal_integer dot (?! dot | identifier) is an error
    • hexadecimal_integer dot (?! dot | identifier) is an error
    • To be a float, decimal_identifier may be followed by:
      • dot (?! dot | identifier)
      • dot decimal_integer (?! identifier)
      • dot decimal_integer identifier
      • dot binary_integer
      • dot octal_integer
      • dot hexadecimal_integer
      • oh boy e/E is annoying
      • why is floating point so hard
      • I would love to specify the grammar directly in terms of the micro lexer rather than deal with lookaround cooking floating point like this
      • But suffixes and lifetime lifetime should be handled by some stage of the lexer
      • Because the grammar has points where it accepts raw token streams
      • Namely, macros
  • Suffixed literals
    • binary_integer identifier (error if suffix starts with e or E)
    • octal_integer identifier (error if suffix starts with e or E)
    • hexadecimal_integer identifier (error if suffix starts with e or E)
    • decimal_integer identifier (unless suffix starts with e or E)
    • character identifier
    • byte identifier
    • string identifier
    • byte_string identifier
    • raw_string identifier
    • raw_byte_string identifier
  • Late errors
    • lifetime lifetime cooks to an error
@Centril
Copy link

Centril commented Apr 7, 2020

binary_integer
0b_*[01][_01]*

This regex seems like it's just 0b[_01]*?

character
'(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})'

There's a lot of duplication in these regexes -- I wonder if readability could be improved by extracting some sub-regexes into named ones first.

@CAD97
Copy link
Author

CAD97 commented Apr 7, 2020

The integer regex are prefix _* [[:digit:]] [_[:digit:]]* because the integer literal must have at least one digit.

Yes, the character/string regex could be deduplicated using subpatterns, but I've specifically avoided subpatterns because the regex crate doesn't support subpatterns, and actually using subpatterns requires the nonregular "recurse into subpattern" ((?P>subpattern)) syntax.

Also, just to note minor issues found so far:

  • lifetime doesn't allow starting with _
  • byte strings are only supposed to allow ASCII (chars <= \x7f) internally

@Centril
Copy link

Centril commented Apr 7, 2020

Oh of course. I thought I missed something in the integer regexes.

Regarding subregexes, I was thinking that you could just merge them in during testing, but otherwise use subregexes for presentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment