third_party_rust_nom/doc/error_management.md
Wolf Thomsen 6e45c5d1f3
Remove duplicated section from error_management.md (#1529)
* Remove duplicated section from error_management.md

The section explaining the three different error types was duplicated (with minimal changes between the two sections). This (small) PR removes the redundancy.

* Update doc/error_management.md

Co-authored-by: Geoffroy Couprie <geo.couprie@gmail.com>
2022-12-30 18:01:29 +01:00

14 KiB

Error management

nom's errors are designed with multiple needs in mind:

  • indicate which parser failed and where in the input data
  • accumulate more context as the error goes up the parser chain
  • have a very low overhead, as errors are often discarded by the calling parser (examples: many0, alt)
  • can be modified according to the user's needs, because some languages need a lot more information

To match these requirements, nom parsers have to return the following result type:

pub type IResult<I, O, E=nom::error::Error<I>> = Result<(I, O), nom::Err<E>>;

pub enum Err<E> {
    Incomplete(Needed),
    Error(E),
    Failure(E),
}

The result is either an Ok((I, O)) containing the remaining input and the parsed value, or an Err(nom::Err<E>) with E the error type. nom::Err<E> is an enum because combinators can have different behaviours depending on the value. The Err<E> enum expresses 3 conditions for a parser error:

  • Incomplete indicates that a parser did not have enough data to decide. This can be returned by parsers found in streaming submodules to indicate that we should buffer more data from a file or socket. Parsers in the complete submodules assume that they have the entire input data, so if it was not sufficient, they will instead return a Err::Error. When a parser returns Incomplete, we should accumulate more data in the buffer (example: reading from a socket) and call the parser again
  • Error is a normal parser error. If a child parser of the alt combinator returns Error, it will try another child parser
  • Failure is an error from which we cannot recover: The alt combinator will not try other branches if a child parser returns Failure. If we know we were in the right branch (example: we found a correct prefix character but input after that was wrong), we can transform a Err::Error into a Err::Failure with the cut() combinator

If we are running a parser and know it will not return Err::Incomplete, we can directly extract the error type from Err::Error or Err::Failure with the finish() method:

let parser_result: IResult<I, O, E> = parser(input);
let result: Result<(I, O), E> = parser_result.finish();

If we used a borrowed type as input, like &[u8] or &str, we might want to convert it to an owned type to transmit it somewhere, with the to_owned() method:

let result: Result<(&[u8], Value), Err<Vec<u8>>> =
  parser(data).map_err(|e: E<&[u8]>| -> e.to_owned());

nom provides a powerful error system that can adapt to your needs: you can get reduced error information if you want to improve performance, or you can get a precise trace of parser application, with fine grained position information.

This is done through the third type parameter of IResult, nom's parser result type:

pub type IResult<I, O, E=nom::error::Error<I>> = Result<(I, O), Err<E>>;

pub enum Err<E> {
    Incomplete(Needed),
    Error(E),
    Failure(E),
}

This error type is completely generic in nom's combinators, so you can choose exactly which error type you want to use when you define your parsers, or directly at the call site. See the JSON parser for an example of choosing different error types at the call site.

Common error types

the default error type: nom::error::Error

#[derive(Debug, PartialEq)]
pub struct Error<I> {
  /// position of the error in the input data
  pub input: I,
  /// nom error code
  pub code: ErrorKind,
}

This structure contains a nom::error::ErrorKind indicating which kind of parser encountered an error (example: ErrorKind::Tag for the tag() combinator), and the input position of the error.

This error type is fast and has very low overhead, so it is suitable for parsers that are called repeatedly, like in network protocols. It is very limited though, it will not tell you about the chain of parser calls, so it is not enough to write user friendly errors.

Example error returned in a JSON-like parser (from examples/json.rs):

let data = "  { \"a\"\t: 42,
\"b\": [ \"x\", \"y\", 12 ] ,
\"c\": { 1\"hello\" : \"world\"
}
} ";

// will print:
// Err(
//   Failure(
//       Error {
//           input: "1\"hello\" : \"world\"\n  }\n  } ",
//           code: Char,
//       },
//   ),
// )
println!(
  "{:#?}\n",
  json::<Error<&str>>(data)
);

getting more information: nom::error::VerboseError

The VerboseError<I> type accumulates more information about the chain of parsers that encountered an error:

#[derive(Clone, Debug, PartialEq)]
pub struct VerboseError<I> {
  /// List of errors accumulated by `VerboseError`, containing the affected
  /// part of input data, and some context
  pub errors: crate::lib::std::vec::Vec<(I, VerboseErrorKind)>,
}

#[derive(Clone, Debug, PartialEq)]
/// Error context for `VerboseError`
pub enum VerboseErrorKind {
  /// Static string added by the `context` function
  Context(&'static str),
  /// Indicates which character was expected by the `char` function
  Char(char),
  /// Error kind given by various nom parsers
  Nom(ErrorKind),
}

It contains the input position and error code for each of those parsers. It does not accumulate errors from the different branches of alt, it will only contain errors from the last branch it tried.

It can be used along with the nom::error::context combinator to inform about the parser chain:

context(
  "string",
  preceded(char('\"'), cut(terminated(parse_str, char('\"')))),
)(i)

It is not very usable if printed directly:

// parsed verbose: Err(
//   Failure(
//       VerboseError {
//           errors: [
//               (
//                   "1\"hello\" : \"world\"\n  }\n  } ",
//                   Char(
//                       '}',
//                   ),
//               ),
//               (
//                   "{ 1\"hello\" : \"world\"\n  }\n  } ",
//                   Context(
//                       "map",
//                   ),
//               ),
//               (
//                   "{ \"a\"\t: 42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } ",
//                   Context(
//                       "map",
//                   ),
//               ),
//           ],
//       },
//   ),
// )
println!("parsed verbose: {:#?}", json::<VerboseError<&str>>(data));

But by looking at the original input and the chain of errors, we can build a more user friendly error message. The nom::error::convert_error function can build such a message.

let e = json::<VerboseError<&str>>(data).finish().err().unwrap();
// here we use the `convert_error` function, to transform a `VerboseError<&str>`
// into a printable trace.
//
// This will print:
// verbose errors - `json::<VerboseError<&str>>(data)`:
// 0: at line 2:
//   "c": { 1"hello" : "world"
//          ^
// expected '}', found 1
//
// 1: at line 2, in map:
//   "c": { 1"hello" : "world"
//        ^
//
// 2: at line 0, in map:
//   { "a" : 42,
//   ^
println!(
  "verbose errors - `json::<VerboseError<&str>>(data)`:\n{}",
  convert_error(data, e)
);

Note that VerboseError and convert_error are meant as a starting point for language errors, but that they cannot cover all use cases. So a custom convert_error function should probably be written.

Improving usability: nom_locate and nom-supreme

These crates were developed to improve the user experience when writing nom parsers.

nom_locate

nom_locate wraps the input data in a Span type that can be understood by nom parsers. That type provides location information, like line and column.

nom-supreme

nom-supreme provides the ErrorTree<I> error type, that provides the same chain of parser errors as VerboseError, but also accumulates errors from the various branches tried by alt.

With this error type, you can explore everything that has been tried by the parser.

The ParseError trait

If those error types are not enough, we can define our own, by implementing the ParseError<I> trait. All nom combinators are generic over that trait for their errors, so we only need to define it in the parser result type, and it will be used everywhere.

pub trait ParseError<I>: Sized {
    /// Creates an error from the input position and an [ErrorKind]
    fn from_error_kind(input: I, kind: ErrorKind) -> Self;

    /// Combines an existing error with a new one created from the input
    /// position and an [ErrorKind]. This is useful when backtracking
    /// through a parse tree, accumulating error context on the way
    fn append(input: I, kind: ErrorKind, other: Self) -> Self;

    /// Creates an error from an input position and an expected character
    fn from_char(input: I, _: char) -> Self {
        Self::from_error_kind(input, ErrorKind::Char)
    }

    /// Combines two existing errors. This function is used to compare errors
    /// generated in various branches of `alt`
    fn or(self, other: Self) -> Self {
        other
    }
}

Any error type has to implement that trait, that requires ways to build an error:

  • from_error_kind: From the input position and the ErrorKind enum that indicates in which parser we got an error
  • append: Allows the creation of a chain of errors as we backtrack through the parser tree (various combinators will add more context)
  • from_char: Creates an error that indicates which character we were expecting
  • or: In combinators like alt, allows choosing between errors from various branches (or accumulating them)

We can also implement the ContextError trait to support the context() combinator used by VerboseError<I>:

pub trait ContextError<I>: Sized {
    fn add_context(_input: I, _ctx: &'static str, other: Self) -> Self {
        other
    }
}

And there is also the FromExternalError<I, E> used by map_res to wrap errors returned by other functions:

pub trait FromExternalError<I, ExternalError> {
  fn from_external_error(input: I, kind: ErrorKind, e: ExternalError) -> Self;
}

Example usage

Let's define a debugging error type, that will print something every time an error is generated. This will give us a good insight into what the parser tried. Since errors can be combined with each other, we want it to keep some info on the error that was just returned. We'll just store that in a string:

struct DebugError {
    message: String,
}

Now let's implement ParseError and ContextError on it:

impl ParseError<&str> for DebugError {
    // on one line, we show the error code and the input that caused it
    fn from_error_kind(input: &str, kind: ErrorKind) -> Self {
        let message = format!("{:?}:\t{:?}\n", kind, input);
        println!("{}", message);
        DebugError { message }
    }

    // if combining multiple errors, we show them one after the other
    fn append(input: &str, kind: ErrorKind, other: Self) -> Self {
        let message = format!("{}{:?}:\t{:?}\n", other.message, kind, input);
        println!("{}", message);
        DebugError { message }
    }

    fn from_char(input: &str, c: char) -> Self {
        let message = format!("'{}':\t{:?}\n", c, input);
        println!("{}", message);
        DebugError { message }
    }

    fn or(self, other: Self) -> Self {
        let message = format!("{}\tOR\n{}\n", self.message, other.message);
        println!("{}", message);
        DebugError { message }
    }
}

impl ContextError<&str> for DebugError {
    fn add_context(input: &str, ctx: &'static str, other: Self) -> Self {
        let message = format!("{}\"{}\":\t{:?}\n", other.message, ctx, input);
        println!("{}", message);
        DebugError { message }
    }
}

So when calling our JSON parser with this error type, we will get a trace of all the times a parser stoppped and backtracked:

println!("debug: {:#?}", root::<DebugError>(data));
AlphaNumeric:   "\"\t: 42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } "

'{':    "42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } "

'{':    "42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } "
"map":  "42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } "

[..]

AlphaNumeric:   "\": { 1\"hello\" : \"world\"\n  }\n  } "

'"':    "1\"hello\" : \"world\"\n  }\n  } "

'"':    "1\"hello\" : \"world\"\n  }\n  } "
"string":       "1\"hello\" : \"world\"\n  }\n  } "

'}':    "1\"hello\" : \"world\"\n  }\n  } "

'}':    "1\"hello\" : \"world\"\n  }\n  } "
"map":  "{ 1\"hello\" : \"world\"\n  }\n  } "

'}':    "1\"hello\" : \"world\"\n  }\n  } "
"map":  "{ 1\"hello\" : \"world\"\n  }\n  } "
"map":  "{ \"a\"\t: 42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } "

debug: Err(
    Failure(
        DebugError {
            message: "'}':\t\"1\\\"hello\\\" : \\\"world\\\"\\n  }\\n  } \"\n\"map\":\t\"{ 1\\\"hello\\\" : \\\"world
\\"\\n  }\\n  } \"\n\"map\":\t\"{ \\\"a\\\"\\t: 42,\\n  \\\"b\\\": [ \\\"x\\\", \\\"y\\\", 12 ] ,\\n  \\\"c\\\": { 1\
\"hello\\\" : \\\"world\\\"\\n  }\\n  } \"\n",
        },
    ),
)

Here we can see that when parsing { 1\"hello\" : \"world\"\n }\n }, after getting past the initial {, we tried:

  • parsing a " because we're expecting a key name, and that parser was part of the "string" parser
  • parsing a } because the map might be empty. When this fails, we backtrack, through 2 recursive map parsers:
'}':    "1\"hello\" : \"world\"\n  }\n  } "
"map":  "{ 1\"hello\" : \"world\"\n  }\n  } "
"map":  "{ \"a\"\t: 42,\n  \"b\": [ \"x\", \"y\", 12 ] ,\n  \"c\": { 1\"hello\" : \"world\"\n  }\n  } "

Debugging parsers

While you are writing your parsers, you will sometimes need to follow which part of the parser sees which part of the input.

To that end, nom provides the dbg_dmp function that will observe a parser's input and output, and print a hexdump of the input if there was an error. Here is what it could return:

fn f(i: &[u8]) -> IResult<&[u8], &[u8]> {
    dbg_dmp(tag("abcd"), "tag")(i)
}

let a = &b"efghijkl"[..];

// Will print the following message:
// tag: Error(Error(Error { input: [101, 102, 103, 104, 105, 106, 107, 108], code: Tag })) at:
// 00000000        65 66 67 68 69 6a 6b 6c         efghijkl
f(a);

You can go further with the nom-trace crate