The raw lexer is a transformation from UTF-8 strings to "raw" lexical classes along with the length of the token with said lexical class. The lexer and lexical classes are called "raw" because they are not final; in order to finish lexing Rust source, the tokens must be "cooked" into proper tokens.
The most obvious difference is that the raw lexer makes no provision for floating point literals. Instead, the cooking step is responsible for gluing together raw tokens to make them.
Each raw lexical class is defined by a regular expression that matches that lexical class. In the case that two classes both match the prefix of a given string, ties are explicitly broken by always prefering one class over the other.
There are three lexical classes, which are not regular, block_comment
, raw_string
,
and raw_byte_string
, and thus cannot be specified with just a regular expression.
For these lexical classes, they still have a regular prefix which is used to determine the
lexical class of a string prefix between the regular classes and the nonregular classes.
If the regular prefix of a nonregular class is chosen,
then the nonregular grammar is used to recognize the entire token.
The regular expression used in this specification is as defined by the Rust regex crate, which is confined to a fully regular subset of modern extended regex. All regexes are anchored, meaning they will only match a prefix of the input string.
line_comment
//[^\n]*
whitespace
\p{Pattern_White_Space}+
identifier
[_\p{XID_Start}]\p{XID_Continue}*
raw_identifier
r#[_\p{XID_Start}]\p{XID_Continue}*
lifetime
'\p{XID_Start}\p{XID_Continue}*
binary_integer
0b_*[01][_01]*
octal_integer
0o_*[0-7][_0-7]*
hexadecimal_integer
0x_*[0-9a-fA-F][_0-9a-fA-F]*
decimal_number
[0-9][_0-9]*
character
'(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})'
byte
'(?:[^\t\n\r\\']|\\['"]|\\[nrt\\0]|\\x[0-9a-fA-F]{2})'
string
"(?:[^\r\\']|\\['"]|\\[nrt\\0]|\\x[0-7][0-9a-fA-F]|\\u\{(?:10|[0-9])[0-9a-fA-F]{0,4}})*"
byte_string
b"(?:[^\r\\']|\\['"]|\\[nrt\\0]|\\x[0-9a-fA-F]{2})"
exclamation
!
pound
#
dollar
\$
percent
%
ampersand
&
open_parenthesis
\(
close_parenthesis
\)
star
\*
plus
\+
comma
,
minus
-
dot
\.
slash
/
colon
:
semicolon
;
less
<
equal
=
greater
>
question
\?
at
@
open_bracket
\[
close_bracket
\]
circumflex
\^
left_bracket
\{
bar
\|
right_bracket
\}
tilde
~
binary_integer
always ties withdecimal_integer
.- Prefer
binary_integer
, which is always the longer match. octal_integer
always ties withdecimal_integer
.- Prefer
octal_integer
, which is always the longer match. hexadecimal_integer
always ties withdecimal_integer
.- Prefer
hexadecimal_integer
, which is always the longer match. character
can tie withlifetime
.- Prefer
character
, which is always the longer match.
block_comment
/\*
raw_string
r[#"]
raw_byte_string
br[#"]
block_comment
always ties withslash
.- Prefer
block_comment
, which is always the longer match. raw_string
always ties withidentifier
.- Prefer
raw_string
, which is always the longer match. raw_byte_string
always ties withidentifier
.- Prefer
raw_byte_string
, which is always the longer match.
The block_comment
, raw_string
, and raw_byte_string
tokens are not regular,
so cannot be fully specified by a regular expression.
Instead, the regular expression recognizes a prefix of the token,
which is used to decide when to match a nonregular token instead of a regular token.
After the regular prefix is recognized, a further recognizer recognizes the full token. Like the regular expressions, these grammars should be considered anchored, matching a prefix. These recognizers are specified in PCRE, EBNF, and Rust code returning the prefix length.
If a full recognizer fails after the regular prefix is recognized, this is an error.
block_comment
is a context-free grammar.
Informally: matched pairs of opening /*
and closing */
.
(?xs)
(?(DEFINE)(?P<inner> (?:(?! /\* | \*/ ) . )*))
/\*
(?P>inner)
(?: (?R) (?P>inner) )*
\*/
block comment = open
, { ({"/"} | {"*"}), char
| ({"/"} | {"*"}), block comment
}
, close
;
open = "/*" ;
close = "*/" ;
char = ? any character ? - "/" - "*" ;
pub fn parse_block_comment(s: &str) -> usize {
let mut chars = s.chars().peekable();
assert_eq!(chars.next(), Some('/'));
assert_eq!(chars.next(), Some('*'));
let mut depth: usize = 1;
let mut len: usize = 2;
while depth > 0 {
match chars.next() {
Some('/') if matches!(chars.peek(), Some('*')) => {
chars.next();
depth += 1;
len += 2;
}
Some('*') if matches!(chars.peek(), Some('/')) => {
chars.next();
depth -= 1;
len += 2;
}
Some(c) => len += c.len_utf8(),
None => panic!("exhausted source in block comment"),
}
}
return len;
}
raw_string
is a context-sensitive grammar.
Informally: opened by r#*"
, closed by first instance of "#*
with the same number of #
.
Note: an implementation is allowed to put a limit on the number of #
allowed in a raw string
(thus making the grammar regular), but at least 216 hashes must be supported.
rustc
uses a limit of 216 at the time of writing this specification.
The number of states required to create a DFA matching a raw string with up to n hashes
can be approximated by the function 0.5n2 + 2.5n.
For scale, that's on the order of 215 states to support 28 hashes,
and 231 states to support 216 hashes.
(?xs)
r(?P<hashes>\#*)"
[^\r]*?
"(?P=hashes)
Note that because raw_string
is context-sensative, this EBNF has the additional restriction
that only the shortest possible match is produced in ambiguous cases.
raw string = "r", inner ;
inner = "#" , inner , "#"
| "\"", {char}, "\""
;
char = ? any character ? - "\r" ;
fn parse_raw_string(s: &str) -> usize {
let mut chars = s.chars().peekable();
assert_eq!(chars.next(), Some('r'));
let mut hashes: usize = 0;
let mut len: usize = 1;
loop {
match chars.next() {
Some('#') => {
len += 1;
hashes += 1;
}
Some('"') => {
len += 1;
break;
}
Some(c) => panic!("invalid char {:?} in raw string opening fence", c),
None => panic!("exhausted source in raw string"),
}
}
loop {
match chars.next() {
Some('"') => {
len += 1;
let mut hashes_seen: usize = 0;
loop {
match chars.peek() {
Some('#') => {
chars.next();
len += 1;
hashes_seen += 1;
if hashes_seen == hashes {
return len;
}
}
_ => break,
}
}
}
Some('\r') => panic!("bare CR not allowed in raw string"),
Some(c) => len += c.len_utf8(),
None => panic!("exhausted source in raw string"),
}
}
}
Uses the same grammar as raw_string
, just prefixed with a b
.
These are notes about cooking taken while writing this document that should not be in the final document.
- Floating point
- oh boy floating point
- https://github.com/rust-lang/rust/blob/master/src/librustc_lexer/src/lib.rs#L502
binary_integer dot (?! dot | identifier)
is an erroroctal_integer dot (?! dot | identifier)
is an errorhexadecimal_integer dot (?! dot | identifier)
is an error- To be a float,
decimal_identifier
may be followed by:dot (?! dot | identifier)
dot decimal_integer (?! identifier)
dot decimal_integer identifier
dot binary_integer
dot octal_integer
dot hexadecimal_integer
- oh boy
e
/E
is annoying - why is floating point so hard
- I would love to specify the grammar directly in terms of the micro lexer rather than deal with lookaround cooking floating point like this
- But suffixes and
lifetime lifetime
should be handled by some stage of the lexer - Because the grammar has points where it accepts raw token streams
- Namely, macros
- Suffixed literals
binary_integer identifier
(error if suffix starts withe
orE
)octal_integer identifier
(error if suffix starts withe
orE
)hexadecimal_integer identifier
(error if suffix starts withe
orE
)decimal_integer identifier
(unless suffix starts withe
orE
)character identifier
byte identifier
string identifier
byte_string identifier
raw_string identifier
raw_byte_string identifier
- Late errors
lifetime lifetime
cooks to an error
Oh of course. I thought I missed something in the integer regexes.
Regarding subregexes, I was thinking that you could just merge them in during testing, but otherwise use subregexes for presentation?