third_party_rust_nom/doc/nom_recipes.md
LoganDark 481cd2dea6 Improve rust-style identifiers code
There is no reason to heap allocate for each individual character (or each group of characters?) when the allocated collection isn't even going to be used anyway
2022-03-14 22:32:55 +01:00

9.1 KiB

Nom Recipes

These are short recipes for accomplishing common tasks with nom.

Whitespace

Wrapper combinators that eat whitespace before and after a parser

use nom::{
  IResult,
  error::ParseError,
  combinator::value,
  sequence::delimited,
  character::complete::multispace0,
};

/// A combinator that takes a parser `inner` and produces a parser that also consumes both leading and 
/// trailing whitespace, returning the output of `inner`.
fn ws<'a, F: 'a, O, E: ParseError<&'a str>>(inner: F) -> impl FnMut(&'a str) -> IResult<&'a str, O, E>
  where
  F: Fn(&'a str) -> IResult<&'a str, O, E>,
{
  delimited(
    multispace0,
    inner,
    multispace0
  )
}

To eat only trailing whitespace, replace delimited(...) with terminated(&inner, multispace0). Likewise, the eat only leading whitespace, replace delimited(...) with preceded(multispace0, &inner). You can use your own parser instead of multispace0 if you want to skip a different set of lexemes.

Comments

// C++/EOL-style comments

This version uses % to start a comment, does not consume the newline character, and returns an output of ().

use nom::{
  IResult,
  error::ParseError,
  combinator::value,
  sequence::pair,
  bytes::complete::is_not,
  character::complete::char,
};

pub fn peol_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E>
{
  value(
    (), // Output is thrown away.
    pair(char('%'), is_not("\n\r"))
  )(i)
}

/* C-style comments */

Inline comments surrounded with sentinel tags (* and *). This version returns an output of () and does not handle nested comments.

use nom::{
  IResult,
  error::ParseError,
  combinator::value,
  sequence::tuple,
  bytes::complete::{tag, take_until},
};

pub fn pinline_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E> {
  value(
    (), // Output is thrown away.
    tuple((
      tag("(*"),
      take_until("*)"),
      tag("*)")
    ))
  )(i)
}

Identifiers

Rust-Style Identifiers

Parsing identifiers that may start with a letter (or underscore) and may contain underscores, letters and numbers may be parsed like this:

use nom::{
  IResult,
  branch::alt,
  multi::many0_count,
  combinator::recognize,
  sequence::pair,
  character::complete::{alpha1, alphanumeric1},
  bytes::complete::tag,
};

pub fn identifier(input: &str) -> IResult<&str, &str> {
  recognize(
    pair(
      alt((alpha1, tag("_"))),
      many0_count(alt((alphanumeric1, tag("_"))))
    )
  )(input)
}

Let's say we apply this to the identifier hello_world123abc. The first alt parser would recognize h. The pair combinator ensures that ello_world123abc will be piped to the next alphanumeric0 parser, which recognizes every remaining character. However, the pair combinator returns a tuple of the results of its sub-parsers. The recognize parser produces a &str of the input text that was parsed, which in this case is the entire &str hello_world123abc.

Literal Values

Escaped Strings

This is one of the examples in the examples directory.

Integers

The following recipes all return string slices rather than integer values. How to obtain an integer value instead is demonstrated for hexadecimal integers. The others are similar.

The parsers allow the grouping character _, which allows one to group the digits by byte, for example: 0xA4_3F_11_28. If you prefer to exclude the _ character, the lambda to convert from a string slice to an integer value is slightly simpler. You can also strip the _ from the string slice that is returned, which is demonstrated in the second hexdecimal number parser.

If you wish to limit the number of digits in a valid integer literal, replace many1 with many_m_n in the recipes.

Hexadecimal

The parser outputs the string slice of the digits without the leading 0x/0X.

use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::recognize,
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn hexadecimal(input: &str) -> IResult<&str, &str> { // <'a, E: ParseError<&'a str>>
  preceded(
    alt((tag("0x"), tag("0X"))),
    recognize(
      many1(
        terminated(one_of("0123456789abcdefABCDEF"), many0(char('_')))
      )
    )
  )(input)
}

If you want it to return the integer value instead, use map:

use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::{map_res, recognize},
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn hexadecimal_value(input: &str) -> IResult<&str, i64> {
  map_res(
    preceded(
      alt((tag("0x"), tag("0X"))),
      recognize(
        many1(
          terminated(one_of("0123456789abcdefABCDEF"), many0(char('_')))
        )
      )
    ),
    |out: &str| i64::from_str_radix(&str::replace(&out, "_", ""), 16)
  )(input)
}

Octal

use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::recognize,
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn octal(input: &str) -> IResult<&str, &str> {
  preceded(
    alt((tag("0o"), tag("0O"))),
    recognize(
      many1(
        terminated(one_of("01234567"), many0(char('_')))
      )
    )
  )(input)
}

Binary

use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::recognize,
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn binary(input: &str) -> IResult<&str, &str> {
  preceded(
    alt((tag("0b"), tag("0B"))),
    recognize(
      many1(
        terminated(one_of("01"), many0(char('_')))
      )
    )
  )(input)
}

Decimal

use nom::{
  IResult,
  multi::{many0, many1},
  combinator::recognize,
  sequence::terminated,
  character::complete::{char, one_of},
};

fn decimal(input: &str) -> IResult<&str, &str> {
  recognize(
    many1(
      terminated(one_of("0123456789"), many0(char('_')))
    )
  )(input)
}

Floating Point Numbers

The following is adapted from the Python parser by Valentin Lorentz (ProgVal).

use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::{opt, recognize},
  sequence::{preceded, terminated, tuple},
  character::complete::{char, one_of},
};

fn float(input: &str) -> IResult<&str, &str> {
  alt((
    // Case one: .42
    recognize(
      tuple((
        char('.'),
        decimal,
        opt(tuple((
          one_of("eE"),
          opt(one_of("+-")),
          decimal
        )))
      ))
    )
    , // Case two: 42e42 and 42.42e42
    recognize(
      tuple((
        decimal,
        opt(preceded(
          char('.'),
          decimal,
        )),
        one_of("eE"),
        opt(one_of("+-")),
        decimal
      ))
    )
    , // Case three: 42. and 42.42
    recognize(
      tuple((
        decimal,
        char('.'),
        opt(decimal)
      ))
    )
  ))(input)
}

fn decimal(input: &str) -> IResult<&str, &str> {
  recognize(
    many1(
      terminated(one_of("0123456789"), many0(char('_')))
    )
  )(input)
}

implementing FromStr

The FromStr trait provides a common interface to parse from a string.

use nom::{
  IResult, Finish, error::Error,
  bytes::complete::{tag, take_while},
};
use std::str::FromStr;

// will recognize the name in "Hello, name!"
fn parse_name(input: &str) -> IResult<&str, &str> {
  let (i, _) = tag("Hello, ")(input)?;
  let (i, name) = take_while(|c:char| c.is_alphabetic())(i)?;
  let (i, _) = tag("!")(i)?;

  Ok((i, name))
}

// with FromStr, the result cannot be a reference to the input, it must be owned
#[derive(Debug)]
pub struct Name(pub String);

impl FromStr for Name {
  // the error must be owned as well
  type Err = Error<String>;

  fn from_str(s: &str) -> Result<Self, Self::Err> {
      match parse_name(s).finish() {
          Ok((_remaining, name)) => Ok(Name(name.to_string())),
          Err(Error { input, code }) => Err(Error {
              input: input.to_string(),
              code,
          })
      }
  }
}

fn main() {
  // parsed: Ok(Name("nom"))
  println!("parsed: {:?}", "Hello, nom!".parse::<Name>());

  // parsed: Err(Error { input: "123!", code: Tag })
  println!("parsed: {:?}", "Hello, 123!".parse::<Name>());
}