mirror of
https://github.com/openharmony/third_party_rust_nom.git
synced 2026-07-01 21:04:01 -04:00
Move the docs out of Github's wiki, into the tree
the github wiki is not read much, and it does not allow easy contributions. A docs/ folder in the source tree will be more up to date and will accept contributions
This commit is contained in:
+51
@@ -0,0 +1,51 @@
|
||||
### The compiler indicates `error: expected an item keyword` then points to the function's return type in `named!`:
|
||||
|
||||
```
|
||||
error: expected an item keyword
|
||||
named!(multi<Vec<&str>>, many0!( map_res!(tag!( "abcd" ), str::from_utf8) ) );
|
||||
^~~
|
||||
```
|
||||
|
||||
This happens because the macro processor mistakes `>>` for an operator. It will work correctly by adding a space, like this: `named!(multi< Vec<&str> >, ...`
|
||||
|
||||
### nom 1.0 does not compile on Rust older than 1.4
|
||||
|
||||
Typically, the error would look like this:
|
||||
|
||||
```
|
||||
src/stream.rs:74:44: 74:64 error: the parameter type `E` may not live long enough [E0309]
|
||||
src/stream.rs:74 if let &ConsumerState::Done(_,ref o) = self.apply(consumer) {
|
||||
^~~~~~~~~~~~~~~~~~~~
|
||||
note: in expansion of if let expansion
|
||||
src/stream.rs:74:5: 78:6 note: expansion site
|
||||
src/stream.rs:74:44: 74:64 help: run `rustc --explain E0309` to see a detailed explanation
|
||||
src/stream.rs:74:44: 74:64 help: consider adding an explicit lifetime bound `E: 'b`...
|
||||
src/stream.rs:74:44: 74:64 note: ...so that the reference type `&stream::ConsumerState<O, E, M>` does not outlive the data it points at
|
||||
src/stream.rs:74 if let &ConsumerState::Done(_,ref o) = self.apply(consumer) {
|
||||
^~~~~~~~~~~~~~~~~~~~
|
||||
note: in expansion of if let expansion
|
||||
src/stream.rs:74:5: 78:6 note: expansion site
|
||||
src/stream.rs:74:44: 74:64 error: the parameter type `M` may not live long enough [E0309]
|
||||
src/stream.rs:74 if let &ConsumerState::Done(_,ref o) = self.apply(consumer) {
|
||||
^~~~~~~~~~~~~~~~~~~~
|
||||
note: in expansion of if let expansion
|
||||
src/stream.rs:74:5: 78:6 note: expansion site
|
||||
src/stream.rs:74:44: 74:64 help: run `rustc --explain E0309` to see a detailed explanation
|
||||
src/stream.rs:74:44: 74:64 help: consider adding an explicit lifetime bound `M: 'b`...
|
||||
src/stream.rs:74:44: 74:64 note: ...so that the reference type `&stream::ConsumerState<O, E, M>` does not outlive the data it points at
|
||||
src/stream.rs:74 if let &ConsumerState::Done(_,ref o) = self.apply(consumer) {
|
||||
^~~~~~~~~~~~~~~~~~~~
|
||||
note: in expansion of if let expansion
|
||||
src/stream.rs:74:5: 78:6 note: expansion site
|
||||
error: aborting due to 2 previous errors
|
||||
|
||||
Could not compile `nom`.
|
||||
```
|
||||
|
||||
This is caused by some lifetime issues that may be fixed in a future version of nom. In the meantime, you can add `default-features=false` to nom's declaration in `Cargo.toml` to deactivate this part of the code:
|
||||
|
||||
```toml
|
||||
[dependencies.nom]
|
||||
version = "~1.0.0"
|
||||
default-features = false
|
||||
```
|
||||
@@ -0,0 +1,314 @@
|
||||
# Error management
|
||||
|
||||
Parser combinators are useful tools to build parsers, but they are notoriously bad at error reporting. This happens because a tree of parser acts as a single parser, and the only error you get will come from the root parser.
|
||||
|
||||
This is especially annoying while developing, since you cannot know which parser failed, and why.
|
||||
|
||||
Nom provides a few tools to help you in reporting errors and debugging parsers.
|
||||
|
||||
## Debugging macros
|
||||
|
||||
There are two macros that you can use to check what is happening while you write your parsers: `dbg!` and `dbg_dmp!`.
|
||||
|
||||
They take a parser or combinator as input and, if it returns an `Error` or `Incomplete`, will print the result and the parser passed as argument. It will return the result unmodified, so it can be added and removed from your parser without any impact.
|
||||
|
||||
```rust
|
||||
#[macro_use] extern crate nom;
|
||||
|
||||
fn main() {
|
||||
named!(f, dbg!( tag!( "abcd" ) ) );
|
||||
|
||||
let a = &b"efgh"[..];
|
||||
f(a);
|
||||
}
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```
|
||||
Error(Position(0, [101, 102, 103, 104])) at l.5 by ' tag ! ( "abcd" ) '
|
||||
```
|
||||
|
||||
The result sent by `dbg_dmp!` is slightly different:
|
||||
|
||||
```rust
|
||||
#[macro_use] extern crate nom;
|
||||
|
||||
fn main() {
|
||||
named!(f, dbg_dmp!( tag!( "abcd" ) ) );
|
||||
|
||||
let a = &b"efghijkl"[..];
|
||||
f(a);
|
||||
}
|
||||
```
|
||||
|
||||
It will print, along with the result and the parser, a hexdump of the input buffer passed to the parser.
|
||||
|
||||
```
|
||||
Error(Position(0, [101, 102, 103, 104, 105, 106, 107, 108])) at l.5 by ' tag ! ( "abcd" ) '
|
||||
00000000 65 66 67 68 69 6a 6b 6c efghijkl
|
||||
```
|
||||
|
||||
## Error reporting
|
||||
|
||||
As a reminder, here are the basic types of nom:
|
||||
|
||||
```rust
|
||||
#[derive(Debug,PartialEq,Eq,Clone)]
|
||||
pub enum Err<P,E=u32>{
|
||||
Code(ErrorKind<E>),
|
||||
Node(ErrorKind<E>, Box<Err<P,E>>),
|
||||
Position(ErrorKind<E>, P),
|
||||
NodePosition(ErrorKind<E>, P, Box<Err<P,E>>)
|
||||
}
|
||||
|
||||
#[derive(Debug,PartialEq,Eq)]
|
||||
pub enum Needed {
|
||||
Unknown,
|
||||
Size(u32)
|
||||
}
|
||||
#[derive(Debug,PartialEq,Eq)]
|
||||
pub enum IResult<I,O,E=u32> {
|
||||
Done(I,O),
|
||||
Error(Err<I,E>),
|
||||
Incomplete(Needed)
|
||||
}
|
||||
```
|
||||
|
||||
An error in nom can be either:
|
||||
- an `ErrorKind<E>` error code
|
||||
- an `ErrorKind<E>` error code and a pointer to the next error
|
||||
- an `ErrorKind<E>` error code and an input slice
|
||||
- an `ErrorKind<E>` error code, an input slice and a pointer to the next error
|
||||
|
||||
`E` is the custom error type you can provide. Otherwise, it is an `u32` by default.
|
||||
If you need more information on the errors, or want to act on them in the calling code you can use the `error!` combinator. It takes an `ErrorKind<E>` error code and a parser as argument. If the child parser returns an error, it will wrap it in another error (a `NodePosition`) with its own error code, and return it directly.
|
||||
|
||||
### Adding an error
|
||||
|
||||
Sometimes, you want to provide an error code at a specific point in the parser tree. The `add_error!` macro can be used for this:
|
||||
|
||||
```rust
|
||||
named!(err_test,
|
||||
preceded!(tag!("efgh"), add_error!(ErrorKind::Custom(42),
|
||||
chain!(
|
||||
tag!("ijkl") ~
|
||||
res: add_error!(ErrorKind::Custom(128), tag!("mnop")) ,
|
||||
|| { res }
|
||||
)
|
||||
)
|
||||
));
|
||||
let a = &b"efghblah"[..];
|
||||
let blah = &b"blah"[..];
|
||||
|
||||
let res_a = err_test(a);
|
||||
assert_eq!(res_a, Error(NodePosition(ErrorKind::Custom(42), blah, Box::new(Position(ErrorKind::Tag, blah)))));
|
||||
```
|
||||
|
||||
If the child parser returns an error, `add_error!` will add its own at the head of the error chain.
|
||||
|
||||
### Early returns
|
||||
|
||||
This macro does an **early return**: it will not pass the error to the parent parser like other combinators, but will directly do a `return`, thus exiting the function. It works a bit like the "cut" operator in Prolog, in that there is no backtracking.
|
||||
|
||||
If another `error!` call is present in the parent parsing chain, it will intercept the previously returned error, and wrap it with its own error code.
|
||||
|
||||
Here is how it works in practice:
|
||||
|
||||
```rust
|
||||
use std::collections;
|
||||
|
||||
named!(err_test, alt!(
|
||||
tag!("abcd") |
|
||||
preceded!(
|
||||
tag!("efgh"),
|
||||
error!(
|
||||
42,
|
||||
chain!(
|
||||
tag!("ijkl") ~
|
||||
res: error!(128, tag!("mnop")) ,
|
||||
|| { res }
|
||||
)
|
||||
)
|
||||
)
|
||||
));
|
||||
|
||||
let a = &b"efghblah"[..];
|
||||
let b = &b"efghijklblah"[..];
|
||||
|
||||
|
||||
let blah = &b"blah"[..];
|
||||
|
||||
let res_a = err_test(a);
|
||||
let res_b = err_test(b);
|
||||
|
||||
assert_eq!(res_a, Error(NodePosition(42, blah, Box::new(Position(0, blah)))));
|
||||
assert_eq!(res_b, Error(
|
||||
NodePosition(42, &b"ijklblah"[..],
|
||||
Box::new(NodePosition(128, blah,
|
||||
Box::new(Position(0, blah))
|
||||
))
|
||||
)
|
||||
));
|
||||
```
|
||||
|
||||
With this mechanism, you get a chain of error codes and the corresponding positions in the input slice. If the `error!`calls are strategically placed, they can give a lot of information about what happened during parsing.
|
||||
|
||||
## Error pattern matching
|
||||
|
||||
Once you get a chain of errors with easily identifying codes, you probably want to match on these to provide useful error messages. This is the intended use of nom's error types.
|
||||
|
||||
### Simple matching
|
||||
|
||||
The `error_to_list` function can gather all of the error codes in a vector. This vector is essentially a signature of the parsing path and will let you distinguish between the different parsing errors.
|
||||
|
||||
```rust
|
||||
use nom::util::error_to_list;
|
||||
|
||||
fn error_to_string<P>(e: Err<P>) -> &str {
|
||||
let v:Vec<u32> = error_to_list(e);
|
||||
if &v[..] == [ErrorKind::Custom(42),ErrorKind::Tag] {
|
||||
"missing `ijkl` tag"
|
||||
} else if &v[..] == [ErrorKind::Custom(42), ErrorKind::Custom(128), ErrorKind::Tag] {
|
||||
"missing `mnop` tag after `ijkl`"
|
||||
} else {
|
||||
"unrecognized error"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### With slice patterns
|
||||
|
||||
If you can use the *slice patterns* feature, you can easily match on errors this way:
|
||||
|
||||
```rust
|
||||
#![feature(slice_patterns)]
|
||||
use nom::util::error_to_list;
|
||||
|
||||
fn error_to_string<P>(e: Err<P>) -> &str {
|
||||
let v:Vec<u32> = error_to_list(e);
|
||||
match &v[..] {
|
||||
[ErrorKind::Custom(42),ErrorKind::Tag] => "missing `ijkl` tag",
|
||||
[ErrorKind::Custom(42), ErrorKind::Custom(128), ErrorKind::Tag] => "missing `mnop` tag after `ijkl`",
|
||||
_ => "unrecognized error"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### With box patterns
|
||||
|
||||
If you can use box patterns, you can match directly on the error instead of filtering with `error_to_list`.
|
||||
|
||||
```rust
|
||||
#![feature(box_patterns)]
|
||||
|
||||
use std::str;
|
||||
fn error_to_string<P>(e:Err<P>) -> String
|
||||
match e {
|
||||
NodePosition(ErrorKind::Custom(42), i1, box Position(ErrorKind::Tag, i2)) => {
|
||||
format!("missing `ijkl` tag, found '{}' instead", str::from_utf8(i2).unwrap())
|
||||
},
|
||||
NodePosition(ErrorKind::Custom(42), i1, box NodePosition(ErrorKind::Custom(128), i2, box Position(ErrorKind::Tag, i3))) => {
|
||||
format!("missing `mnop` tag after `ijkl`, found '{}' instead", str::from_utf8(i3).unwrap())
|
||||
},
|
||||
_ => "unrecognized error".to_string()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Merr pattern matching
|
||||
|
||||
This error reporting approach comes from the [Merr](https://github.com/pippijn/merr) library in Ocaml. Its basic idea is that if you match directly on the error codes to generate error messages, you will have to constantly update the matching code when you modify the grammar. So, instead of matching on static code lists (like the solutions above), you generate those lists from known wrong inputs, and associate them with the corresponding messages.
|
||||
|
||||
To do this in nom, you use the `add_error_pattern` function:
|
||||
|
||||
```rust
|
||||
fn add_error_pattern<'a,I,O>(h: &mut HashMap<Vec<ErrorKind>, &'a str>, res: IResult<I,O>, message: &'a str) -> bool
|
||||
```
|
||||
|
||||
It takes as argument a mutable hashmap that will contain the correspondance between error code lists and error messages, an error, and an error message.
|
||||
|
||||
To use it, you fill up the hashmap, before parsing, with know bad inputs (if you work with binary data, the `include_bytes!` macro might help you there). Then you can just get the error by passing the result of `error_to_list!` as key of the hashmap.
|
||||
|
||||
```rust
|
||||
use nom::util::{add_error_pattern, error_to_list};
|
||||
|
||||
let mut err_map = collections::HashMap::new();
|
||||
add_error_pattern(
|
||||
&mut err_map,
|
||||
err_test(&b"efghpouet"[..]),
|
||||
"missing `ijkl` tag"
|
||||
);
|
||||
|
||||
add_error_pattern(
|
||||
&mut err_map,
|
||||
err_test(&b"efghijklpouet"[..]),
|
||||
"missing `mnop` tag after `ijkl`"
|
||||
);
|
||||
|
||||
if let IResult::Error(e) = err_test(&b"efghblah"[..]) {
|
||||
assert_eq!(err_map.get(&error_to_list(e)), Some(&"missing `ijkl` tag"));
|
||||
};
|
||||
|
||||
if let IResult::Error(e) = err_test(&b"efghijklblah"[..]) {
|
||||
assert_eq!(err_map.get(&error_to_list(e)), Some(&"missing `mnop` tag after `ijkl`"));
|
||||
};
|
||||
|
||||
```
|
||||
|
||||
## Colored hexdump
|
||||
|
||||
To help in format discovery, visual tools can sometimes help. The error chain system gives a correspondence between codes and positions in the input, so displaying what input has been handled by which parser is possible.
|
||||
|
||||
Let's take a parser with a few more `error!` calls:
|
||||
|
||||
```rust
|
||||
named!(err_test, alt!(
|
||||
tag!("abcd") |
|
||||
error!(12,
|
||||
preceded!(tag!("efgh"), error!(42,
|
||||
chain!(
|
||||
tag!("ijk") ~
|
||||
res: error!(128, tag!("mnop")) ,
|
||||
|| { res }
|
||||
)
|
||||
)
|
||||
)
|
||||
)
|
||||
));
|
||||
```
|
||||
|
||||
We can then define the function `display_error` as follows:
|
||||
|
||||
```rust
|
||||
use nom::util::{generate_colors,prepare_errors,print_codes,print_offsets};
|
||||
|
||||
pub fn display_error<I,O>(input: &[u8], res: IResult<I,O>) {
|
||||
let mut h: HashMap<u32, &str> = HashMap::new();
|
||||
h.insert(12, "preceded");
|
||||
h.insert(42, "chain");
|
||||
h.insert(128, "tag mnop");
|
||||
h.insert(0, "tag");
|
||||
|
||||
if let Some(v) = prepare_errors(input, res) {
|
||||
let colors = generate_colors(&v);
|
||||
println!("parsers: {}", print_codes(colors, h));
|
||||
println!("{}", print_offsets(input, 0, &v));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
We give names for the error codes, then make a map between error codes and ANSI colors. The `nom::util::print_codes` shows this map inline.
|
||||
|
||||
The `nom::util::print_offsets` will print the input data in hexadecimal format, with colors applying to different parts of the input.
|
||||
|
||||
As an example, for this call:
|
||||
|
||||
```rust
|
||||
let input = &b"efghijklblahblah"[..];
|
||||
|
||||
display_error(input, err_test(input));
|
||||
```
|
||||
|
||||
We get the following output:
|
||||

|
||||
@@ -0,0 +1,13 @@
|
||||
# Nom is an awesome parser combinators library in Rust
|
||||
|
||||
To get started using nom, you can include it in your Rust projects from
|
||||
[crates.io](https://crates.io/crates/nom). Here are a few links you will find useful:
|
||||
|
||||
* [Reference documentation](http://rust.unhandledexpression.com/nom/)
|
||||
* [Gitter chat room](https://gitter.im/Geal/nom). You can also go to the #nom IRC
|
||||
channel on irc.mozilla.org, or ping 'geal' on Mozilla, Freenode, Geeknode or oftc IRC
|
||||
* [Tutorial about parsing ISO8601 dates](https://fnordig.de/2015/07/16/omnomnom-parsing-iso8601-dates-using-nom/)
|
||||
* [Making a new parser from scratch](https://github.com/Geal/nom/blob/master/docs/making_a_new_parser_from_scratch.md)
|
||||
(general tips on writing a parser and code architecture)
|
||||
* [How to handle parser errors](https://github.com/Geal/nom/blob/master/docs/error_management.md)
|
||||
* [How nom's macro combinators work](https://github.com/Geal/nom/blob/master/docs/how_nom_macros_work.md)
|
||||
@@ -0,0 +1,168 @@
|
||||
nom uses Rust macros heavily to provide a nice syntax and generate parsing code. This has multiple advantages:
|
||||
|
||||
* it gives the apparence of combining functions without the runtime cost of closures
|
||||
* it helps Rust's code inference and borrow checking (less lifetime issues than iterator based solutions)
|
||||
* the generated code is very linear, just a large chain of pattern matching
|
||||
|
||||
# Defining a new macro
|
||||
|
||||
Let's take the `opt!` macro as example: `opt!` returns `IResult<I,Option<O>>`, producing a `Some(o)` if the child parser succeeded, and None otherwise. Here is how you could use it:
|
||||
|
||||
```rust
|
||||
named!(opt_tag<Option<&[u8]>>, opt!(digit));
|
||||
```
|
||||
|
||||
And here is how it is defined:
|
||||
|
||||
```rust
|
||||
#[macro_export]
|
||||
macro_rules! opt(
|
||||
($i:expr, $submac:ident!( $($args:tt)* )) => (
|
||||
{
|
||||
match $submac!($i, $($args)*) {
|
||||
$crate::IResult::Done(i,o) => $crate::IResult::Done(i, Some(o)),
|
||||
$crate::IResult::Error(_) => $crate::IResult::Done($i, None),
|
||||
$crate::IResult::Incomplete(_) => $crate::IResult::Done($i, None)
|
||||
}
|
||||
}
|
||||
);
|
||||
($i:expr, $f:expr) => (
|
||||
opt!($i, call!($f));
|
||||
);
|
||||
);
|
||||
```
|
||||
|
||||
To define a Rust macro, you indicate the name of the macro, then each pattern it is meant to apply to:
|
||||
|
||||
```rust
|
||||
macro_rules! my_macro (
|
||||
(<pattern1>) => ( <generated code for pattern1> );
|
||||
(<pattern2>) => ( <generated code for pattern2> );
|
||||
);
|
||||
```
|
||||
|
||||
## Passing input
|
||||
|
||||
The first thing you can see in `opt!` is that the pattern have an additional parameter that you do not use:
|
||||
|
||||
```rust
|
||||
($i:expr, $f:expr)
|
||||
```
|
||||
|
||||
while you call:
|
||||
|
||||
```rust
|
||||
opt!(digit)
|
||||
```
|
||||
|
||||
This is the first trick of nom macros: the first parameter, usually `$i` or `$input`, is the input data, passed by the parent parser. The expression using `named!` will translate like this:
|
||||
|
||||
```rust
|
||||
named!(opt_tag<Option<&[u8]>>, opt!(digit));
|
||||
```
|
||||
|
||||
to
|
||||
|
||||
```rust
|
||||
fn opt_tag(input:&[u8]) -> IResult<&[u8], Option<&[u8]>> {
|
||||
opt!(input, digit)
|
||||
}
|
||||
```
|
||||
|
||||
This is how combinators hide all the plumbing: they receive the input automatically from the parent parser, may use that input, and pass the remaining input to the child parser.
|
||||
|
||||
When you have multiple submacros, such as this example, the input is always passed to the first, top level combinator:
|
||||
|
||||
```rust
|
||||
macro_rules! multispaced (
|
||||
($i:expr, $submac:ident!( $($args:tt)* )) => (
|
||||
delimited!($i, opt!(multispace), $submac!($($args)*), opt!(multispace));
|
||||
);
|
||||
($i:expr, $f:expr) => (
|
||||
multispaced!($i, call!($f));
|
||||
);
|
||||
);
|
||||
```
|
||||
|
||||
Here, `delimited!` will apply `opt!(multispace)` on the input, and if successful, will apply `$submac!($($args)*)` on the remaining input, and if successful, store the output and apply `opt!(multispace)` on the remaining input.
|
||||
|
||||
## Applying on macros or functions
|
||||
|
||||
The second trick you can see is the two patterns:
|
||||
|
||||
```rust
|
||||
#[macro_export]
|
||||
macro_rules! opt(
|
||||
($i:expr, $submac:ident!( $($args:tt)* )) => (
|
||||
[...]
|
||||
);
|
||||
($i:expr, $f:expr) => (
|
||||
opt!($i, call!($f));
|
||||
);
|
||||
);
|
||||
```
|
||||
|
||||
The first pattern is used to receive a macro as child parser, like this:
|
||||
|
||||
```rust
|
||||
opt!(tag!("abcd"))
|
||||
```
|
||||
|
||||
The second pattern can receive a function, and transforms it in a macro, then calls itself again. This is done to avoid repeating code. Applying `opt!` with `digit` as argument would be transformed from this:
|
||||
|
||||
```rust
|
||||
opt!(digit)
|
||||
```
|
||||
|
||||
transformed with the second pattern:
|
||||
|
||||
```rust
|
||||
opt!(call!(digit))
|
||||
```
|
||||
|
||||
The `call!` macro transforms `call!(input, f)` into `f(i)`. If you need to pass more parameters to the function, you can Use `call!(input, f, arg, arg2)` to get `f(i, arg, arg2)`.
|
||||
|
||||
## Using the macro's parameters
|
||||
|
||||
The macro argument is decomposed into `$submac:ident!`, the macro's name and a bang, and `( $($args:tt)* )`, the tokens contained between the parenthesis of the macro call.
|
||||
|
||||
```rust
|
||||
($i:expr, $submac:ident!( $($args:tt)* )) => (
|
||||
{
|
||||
match $submac!($i, $($args)*) {
|
||||
$crate::IResult::Done(i,o) => $crate::IResult::Done(i, Some(o)),
|
||||
$crate::IResult::Error(_) => $crate::IResult::Done($i, None),
|
||||
$crate::IResult::Incomplete(_) => $crate::IResult::Done($i, None)
|
||||
}
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
The macro is called with the input we got, as first argument, then we pattern match on the result. Every combinator or parser must return a `IResult`, so you know what patterns you need to verify. If you need to call two parsers in a sequence, use the first parameter of `IResult::Done(i,o)`: it is the input remaining after the first parser was applied.
|
||||
|
||||
As an example, see how the `preceded!` macro works:
|
||||
|
||||
```rust
|
||||
($i:expr, $submac:ident!( $($args:tt)* ), $submac2:ident!( $($args2:tt)* )) => (
|
||||
{
|
||||
match $submac!($i, $($args)*) {
|
||||
$crate::IResult::Error(a) => $crate::IResult::Error(a),
|
||||
$crate::IResult::Incomplete(i) => $crate::IResult::Incomplete(i),
|
||||
$crate::IResult::Done(i1,_) => {
|
||||
match $submac2!(i1, $($args2)*) {
|
||||
$crate::IResult::Error(a) => $crate::IResult::Error(a),
|
||||
$crate::IResult::Incomplete(i) => $crate::IResult::Incomplete(i),
|
||||
$crate::IResult::Done(i2,o2) => {
|
||||
$crate::IResult::Done(i2, o2)
|
||||
}
|
||||
}
|
||||
},
|
||||
}
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
It applies the first parser, and if it succeeds, discards its result, and applies the remaining input `i1` to the second parser.
|
||||
|
||||
If you need more tips, please refer to [the little book of Rust macros](https://danielkeep.github.io/tlborm/book/README.html).
|
||||
|
||||
@@ -0,0 +1,248 @@
|
||||
Writing a parser is a very fun, interactive process, but sometimes a daunting task. How do you test it? How to see ambiguities in specifications?
|
||||
|
||||
nom is designed to abstract data manipulation (counting array offsets, converting to structures, etc) while providing a safe, composable API. It also takes care of making the code easy to test and read, but it can be confusing at first, if you are not familiar with parser combinators, or if you are not used to Rust macros.
|
||||
|
||||
This document is here to help you in getting started with nom. If you need more specific help, please ping `geal` on IRC (mozilla, freenode, geeknode, oftc), go to `#nom` on Mozilla IRC, or on the [Gitter chat room](https://gitter.im/Geal/nom).
|
||||
|
||||
# First step: the initial research
|
||||
|
||||
A big part of the initial work lies in accumulating enough documentation and samples to understand the format. The specification is useful, but specifications represent an "official" point of view, that may not be the real world usage. Any blog post or open source code is useful, because it shows how people understand the format, and how they work around each other's bugs (if you think a specification ensures every implementation is consistent with the others, think again).
|
||||
|
||||
You should get a lot of samples (file or network traces) to test your code. The easy way is to use a small number of samples coming from the same source and develop everything around them, to realize later that they share a very specific bug.
|
||||
|
||||
# Code organization
|
||||
|
||||
While it is tempting to insert the parsing code right inside the rest of the logic, it usually results in unmaintainable code, and makes testing challenging. Parser combinators, the parsing technique used in nom, assemble a lot of small functions to make powerful parsers. This means that those functions only depend on their input, not on an external state. This makes it easy to parse the input partially, and to test those functions independently.
|
||||
|
||||
Usually, you can separate the parsing functions in their own module, so you could have a `src/lib.rs`file containing this:
|
||||
|
||||
```rust
|
||||
#[macro_use]
|
||||
extern crate nom;
|
||||
|
||||
pub mode parser;
|
||||
```
|
||||
|
||||
And use the methods and structure from `parser` there. The `src/parser.rs` would then import nom functions and structures if needed:
|
||||
|
||||
```rust
|
||||
use nom::{be_u16, be_u32};
|
||||
```
|
||||
|
||||
# Writing a first parser
|
||||
|
||||
Let's parse a simple expression like `(12345)`. nom parsers are functions that use the `nom::IResult` type everywhere. As an example, a parser taking a byte slice `&[u8]` and returning a 32 bits unsigned integer `u32` would have this signature: `fn parse_u32(input: &[u8]) -> IResult<&[u8], u32>`.
|
||||
|
||||
The `IResult` type depends on the input and output types, and an optional custom error type. This enum can either contain `Done(i,o)` containing the remaining input and the output value, an error, or an indication that more data is needed.
|
||||
|
||||
```rust
|
||||
#[derive(Debug,PartialEq,Eq,Clone)]
|
||||
pub enum IResult<I,O,E=u32> {
|
||||
Done(I,O),
|
||||
Error(Err<I,E>),
|
||||
Incomplete(Needed)
|
||||
}
|
||||
```
|
||||
|
||||
nom uses this type everywhere. Every combination of parsers will pattern match on this to know if it must return a value, an error, consume more data, etc. But this is done behind the scenes most of the time.
|
||||
|
||||
nom provides a macro for function definition, called `named!`:
|
||||
|
||||
```rust
|
||||
named!(my_function( &[u8] ) -> &[u8], tag!("abcd"));
|
||||
|
||||
named!(my_function<&[u8], &[u8]>, tag!("abcd"));
|
||||
|
||||
named!(my_function, tag!("abcd"));
|
||||
```
|
||||
|
||||
But you could as easily define the function yourself like this:
|
||||
|
||||
```rust
|
||||
fn my_function(input: &[u8]) -> IResult<&[u8], &[u8]> {
|
||||
tag!(input, "abcd")
|
||||
}
|
||||
```
|
||||
|
||||
Note that we pass the input to the first parser in the manual definition, while we do not when we use `named!`. This is a macro trick specific to nom: every parser takes the input as first parameter, and the macros take care of giving the remaining input to the next parser. As an example, take a simple parser like the following one, which recognizes the word "hello" then takes the next 5 bytes:
|
||||
|
||||
```rust
|
||||
named!(prefixed, preceded!(tag!("hello"), take!(5)));
|
||||
```
|
||||
|
||||
Once the macros have expanded, this would correspond to:
|
||||
|
||||
```rust
|
||||
fn prefixed(i: &[u8]) -> ::nom::IResult<&[u8], &[u8]> {
|
||||
{
|
||||
match {
|
||||
#[inline(always)]
|
||||
fn as_bytes<T: ::nom::AsBytes>(b: &T) -> &[u8] {
|
||||
b.as_bytes()
|
||||
}
|
||||
let expected = "hello";
|
||||
let bytes = as_bytes(&expected);
|
||||
{
|
||||
let res: ::nom::IResult<&[u8], &[u8]> =
|
||||
if bytes.len() > i.len() {
|
||||
::nom::IResult::Incomplete(::nom::Needed::Size(bytes.len()))
|
||||
} else if &i[0..bytes.len()] == bytes {
|
||||
::nom::IResult::Done(&i[bytes.len()..],
|
||||
&i[0..bytes.len()])
|
||||
} else {
|
||||
::nom::IResult::Error(::nom::Err::Position(::nom::ErrorKind::Tag,
|
||||
i))
|
||||
};
|
||||
res
|
||||
}
|
||||
} {
|
||||
::nom::IResult::Error(a) => ::nom::IResult::Error(a),
|
||||
::nom::IResult::Incomplete(i) => ::nom::IResult::Incomplete(i),
|
||||
::nom::IResult::Done(i1, _) => {
|
||||
match {
|
||||
let cnt = 5 as usize;
|
||||
let res: ::nom::IResult<&[u8], &[u8]> =
|
||||
if i1.len() < cnt {
|
||||
::nom::IResult::Incomplete(::nom::Needed::Size(cnt))
|
||||
} else {
|
||||
::nom::IResult::Done(&i1[cnt..],
|
||||
&i1[0..cnt])
|
||||
};
|
||||
res
|
||||
} {
|
||||
::nom::IResult::Error(a) => ::nom::IResult::Error(a),
|
||||
::nom::IResult::Incomplete(i) =>
|
||||
::nom::IResult::Incomplete(i),
|
||||
::nom::IResult::Done(i2, o2) => {
|
||||
::nom::IResult::Done(i2, o2)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
While this may look like a lot of code, the compiler and the CPU will happily optimize everything, do not worry. You can see that the function matches on the result of the first parser, stops there if it returned an error or incomplete, and if it returned a value, takes the remaining input `i1`, applies the second parser on it, then matches on the result (and returns the value `o2` and the remaining input `i2`).
|
||||
|
||||
A lot of complex patterns are implemented that way: generic macros combining other macros or functions. This will handle partial consumption and passing data slices for you.
|
||||
|
||||
Since it is easy to combine small parsers, I encourage you to write small functions corresponding to specific parts of the format, test them independently, then combine them in more general parsers.
|
||||
|
||||
# Finding the right combinator
|
||||
|
||||
nom has a lot of different combinators, depending on the use case. They are all described in the [reference](http://rust.unhandledexpression.com/nom/).
|
||||
|
||||
[Basic functions](http://rust.unhandledexpression.com/nom/#functions) are available. They deal mostly in recognizing character types, like `alphanumeric` or `digit`. They also parse big endian and little endian integers and floats of multiple sizes.
|
||||
|
||||
Most of the macros are there to combine parsers, and they do not depend on the input type. this is the case for all of those defined in [src/macros.rs](https://github.com/Geal/nom/blob/master/src/macros.rs). The reference indicates a [possible type signature](http://rust.unhandledexpression.com/nom/#macros) for what the macros expect and return. In case of doubt, the documentation often indicates a [code example](http://rust.unhandledexpression.com/nom/macro.many0!.html) after the macro definition.
|
||||
|
||||
## Type specific combinators
|
||||
|
||||
Byte slice related macros can be found in [src/bytes.rs](https://github.com/Geal/nom/blob/master/src/bytes.rs). This file contains the following combinators: `tag!`, `is_not!`, `is_a!`, `escaped!`, `escaped_transform!`, `take_while!`, `take_while1!`, `take_till!`, `take!`, `take_str!`, `take_until_and_consume!`, `take_until_either!`, `take_until_either_and_consume`.
|
||||
|
||||
Bit stream related macros are in [src/bits.rs](https://github.com/Geal/nom/blob/master/src/bits.rs).
|
||||
|
||||
Character related macros are in [src/character.rs](https://github.com/Geal/nom/blob/master/src/character.rs).
|
||||
|
||||
Regular expression related macros are in [src/regexp.rs](https://github.com/Geal/nom/blob/master/src/regexp.rs).
|
||||
|
||||
# Testing the parsers
|
||||
|
||||
Once you have a parser function, a good trick is to test it on a lot of the samples you gathered, and integrate this to your unit tests. To that end, put all of the test files in a folder like `assets` and refer to test files like this:
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn header_test() {
|
||||
let data = include_bytes!("../assets/axolotl-piano.gif");
|
||||
println!("bytes:\n{}", &data[0..100].to_hex(8));
|
||||
let res = header(data);
|
||||
```
|
||||
|
||||
The `include_bytes!` macro (provided by Rust's standard library) will integrate the file as a byte slice in your code. You can then just refer to the part of the input the parser has to handle via its offset. Here, we take the first 100 bytes of a GIF file to parse its header (complete code [here](https://github.com/Geal/gif.rs/blob/master/src/parser.rs#L305-L309)).
|
||||
|
||||
If your parser handles textual data, you can just use a lot of strings directly in the test, like this:
|
||||
|
||||
```rust
|
||||
#[test]
|
||||
fn factor_test() {
|
||||
assert_eq!(factor(&b"3"[..]), IResult::Done(&b""[..], 3));
|
||||
assert_eq!(factor(&b" 12"[..]), IResult::Done(&b""[..], 12));
|
||||
assert_eq!(factor(&b"537 "[..]), IResult::Done(&b""[..], 537));
|
||||
assert_eq!(factor(&b" 24 "[..]), IResult::Done(&b""[..], 24));
|
||||
}
|
||||
```
|
||||
|
||||
The more samples and test cases you get, the more you can experiment with your parser design.
|
||||
|
||||
# Debugging the parsers
|
||||
|
||||
While Rust macros are really useful to get a simpler syntax, they can sometimes give cryptic errors. As an example, `named!(manytag, many0!(take!(5)));` would result in the following error:
|
||||
|
||||
```
|
||||
<nom macros>:6:38: 6:41 error: mismatched types:
|
||||
expected `&[u8]`,
|
||||
found `collections::vec::Vec<&[u8]>`
|
||||
(expected &-ptr,
|
||||
found struct `collections::vec::Vec`) [E0308]
|
||||
<nom macros>:6 } $ crate:: IResult:: Done ( input , res ) } ) ; ( $ i : expr , $ f : expr )
|
||||
^~~
|
||||
<nom macros>:20:1: 20:34 note: in this expansion of many0! (defined in <nom macros>)
|
||||
tests/arithmetic.rs:13:1: 13:35 note: in this expansion of named! (defined in <nom macros>)
|
||||
<nom macros>:6:38: 6:41 help: run `rustc --explain E0308` to see a detailed explanation
|
||||
error: aborting due to previous error
|
||||
```
|
||||
|
||||
This particular one is caused by `named!` generating a function returning a `IResult< &[u8], &[u8] >`, while `many0!(take!(5))` returns a `IResult< &[u8], Vec<&[u8]> >`.
|
||||
|
||||
There are a few tools you can use to debug how code is generated.
|
||||
|
||||
## trace_macros
|
||||
|
||||
The `trace_macros` feature show how macros are applied. To use it, add `#![feature(trace_macros)]` at the top of your file (you need Rust nightly for this), then apply it like this:
|
||||
|
||||
```rust
|
||||
trace_macros!(true);
|
||||
named!(manytag, many0!(take!(5)));
|
||||
trace_macros!(false);
|
||||
```
|
||||
|
||||
It will result in the following output during compilation:
|
||||
|
||||
```
|
||||
named! { manytag , many0 ! ( take ! ( 5 ) ) }
|
||||
many0! { i , take ! ( 5 ) }
|
||||
take! { input , 5 }
|
||||
```
|
||||
|
||||
## Pretty printing
|
||||
|
||||
rustc can show how code is expanded with the option `--pretty=expanded`. If you want to use it with cargo, use the following command line: `cargo rustc <cargo options> -- -Z unstable-options --pretty=expanded`
|
||||
|
||||
It will print the `manytag` function like this:
|
||||
|
||||
```rust
|
||||
fn manytag(i: &[u8]) -> ::nom::IResult<&[u8], &[u8]> {
|
||||
{
|
||||
let mut res = Vec::new();
|
||||
let mut input = i;
|
||||
while let ::nom::IResult::Done(i, o) =
|
||||
{
|
||||
let cnt = 5 as usize;
|
||||
let res: ::nom::IResult<&[u8], &[u8]> =
|
||||
if input.len() < cnt {
|
||||
::nom::IResult::Incomplete(::nom::Needed::Size(cnt))
|
||||
} else {
|
||||
::nom::IResult::Done(&input[cnt..],
|
||||
&input[0..cnt])
|
||||
};
|
||||
res
|
||||
} {
|
||||
if i.len() == input.len() { break ; }
|
||||
res.push(o);
|
||||
input = i;
|
||||
}
|
||||
::nom::IResult::Done(input, res)
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,65 @@
|
||||
The 1.0 release of nom is one of the biggest since the beginning of the project. Its goal was to rework some core parts to be more flexible, and clean code that was awkward or unclear. This resulted in breaking changes, that I hope will not happen again in the future (but hey, we are Rust developers, breaking changes are FUN for us!).
|
||||
|
||||
Here are a few tips to update your code to run with nom 1.0:
|
||||
|
||||
# Error typing
|
||||
|
||||
`nom::Err` now depends on two generic types, the position `P` and the error type `E`:
|
||||
|
||||
```rust
|
||||
pub enum Err<P,E=u32>{
|
||||
Code(ErrorKind<E>),
|
||||
Node(ErrorKind<E>, Box<Err<P,E>>),
|
||||
Position(ErrorKind<E>, P),
|
||||
NodePosition(ErrorKind<E>, P, Box<Err<P,E>>)
|
||||
}
|
||||
```
|
||||
|
||||
The default error type is `u32` to keep some compatibility with older code. To update your code, the first step is to **replace all usages of `nom::ErrorCode` by `nom::ErrorKind`**. `ErrorKind` is now an enum that contains the same instances as the previous `ErrorCode`, with an additional generic parameter:
|
||||
|
||||
```rust
|
||||
pub enum ErrorKind<E=u32> {
|
||||
Custom(E),
|
||||
Tag,
|
||||
MapRes,
|
||||
MapOpt,
|
||||
Alt,
|
||||
[...]
|
||||
}
|
||||
```
|
||||
|
||||
`ErrorKind::Custom` is where you will store your custom error type. Note that default nom parsers like `alphabetic` use `u32` as custom type, so you may need to translate the error types coming from those parsers like this:
|
||||
|
||||
```rust
|
||||
fix_error!(CustomErrorType, alphabetic)
|
||||
```
|
||||
|
||||
Since the error type is now an enum instead of a `u32`, you can now **replace any `ErrorCode::Tag as u32` by `ErrorKind::Tag`**.
|
||||
|
||||
# Lifetime elision
|
||||
|
||||
The error type is now completely generic over the input type, so the lifetime that appeared in `IResult` is not necessary anymore. It changes function declarations like this:
|
||||
|
||||
```rust
|
||||
fn parse_status<'a>(i: &'a [u8]) -> IResult<'a, &'a [u8], Status>
|
||||
|
||||
// To this:
|
||||
fn parse_status(i: &[u8]) -> IResult<&[u8], Status>
|
||||
```
|
||||
|
||||
# Producers and consumers
|
||||
|
||||
The old implementation was not flexible, and a bit slow (because of allocations). The new implementation can be driven more precisely outside of the consumer, step by step if needed, can return a result, has custom error types, and can combine consumers. You can see [an example in the repository](https://github.com/Geal/nom/blob/master/tests/omnom.rs#).
|
||||
|
||||
# Changes around `Incomplete`
|
||||
|
||||
* `chain!` will now count how much data has been consumed before a child parser returns `Incomplete`, and return an `Incomplete` with the added data size
|
||||
* an optional parser (in `opt!` or `chain!`) will return `Incomplete` if the child parser returned `Incomplete`, instead of stopping there. This is the correct behaviour, because the result will be the same if the data comes in chunks or complete from the start
|
||||
* `alt!` will now return `Incomplete` if one of its alternatives returns `Incomplete` instead of skipping to the next branch
|
||||
|
||||
In the cases where you know that the data you get is complete, you can wrap a parser with `complete!`. This combinator will transform `Incomplete` in an `Error`.
|
||||
|
||||
# Other changes
|
||||
|
||||
`filter!` has been renamed to `take_while!`
|
||||
|
||||
Reference in New Issue
Block a user