Use big-endian match/replace for both blobs and ints.
Sometimes we have unmarked blobs (no little/big-endian info);
for ANYBLOBs we intentionally lose all marking;
but even for marked ints we may need this too.
Consider that kernel code does not convert the data
(i.e. not ntohs(pkt->proto) == ETH_P_BATMAN),
but instead converts the constant (i.e. pkt->proto == htons(ETH_P_BATMAN)).
In such case we will see dynamic operand that does not
match what we have in the program.
Even during mutation of a call we want to analyze whole program
to find all used addresses (rather then stop on the selected call).
Also update address during ANY mutation if size has increased.
Squash complex structs into flat byte array and mutate this array
with generic blob mutations. This allows to mutate what we currently
consider as paddings and add/remove paddings from structs, etc.
1. mmap all memory always, without explicit mmap calls in the program.
This makes lots of things much easier and removes lots of code.
Makes mmap not a special syscall and allows to fuzz without mmap enabled.
2. Change address assignment algorithm.
Current algorithm allocates unmapped addresses too frequently
and allows collisions between arguments of a single syscall.
The new algorithm analyzes actual allocations in the program
and places new arguments at unused locations.
There are 2 bugs currently:
1. mutationArgs recurses into special types,
even though they must be mutated as the whole only.
2. When mutationArgs is called from Gen.MutateArg,
it included the top special type as well,
it must not because at this point only the subargs
must be mutated.
Fix both problems.
Make Foreach* callback accept the arg and a context struct
that can contain lots of aux info.
This (1) removes lots of unuser base/parent args,
(2) provides foundation for stopping recursion,
(3) allows to merge foreachSubargOffset.
Unions with only 1 field are not actually unions,
and can always be replaced with the option type.
However, they are still useful when there will be
more options in future but currently only 1 is described.
Alternatives are:
- not using union (but then all existing programs will be
broken when union is finally introduced)
- adding a fake field (ugly and reduces fuzzer efficiency)
Allow unions with only 1 field.
Fixes#188
We now will write just ""/1000 to denote a 1000-byte output buffer.
Also we now don't store 1000-byte buffer in memory just to denote size.
Old format is still parsed.
In some cases we need to extend a buffer by a large
margin to pass the next if in kernel (a size check).
Currently we only append a single byte, so we can
never pass the if incrementally (size is always
smaller than threshold, so 1-byte larger inputs
are not added to corpus).
The call index check episodically fails:
2017/10/02 22:07:32 bad call index 1, calls 1, program:
under unknown circumstances. I've looked at the code again
and don't see where/how we can mess CallIndex.
Added a new test for minimization that especially checks resulting
CallIndex.
It would be good to understand what happens, but we don't have
any reproducers. CallIndex is actually unused at this point.
Manager only needs call name. So remove CallIndex entirely.
Now each prog function accepts the desired target explicitly.
No global, implicit state involved.
This is much cleaner and allows cross-OS/arch testing, etc.
Large overhaul moves syscalls and arg types from sys to prog.
Sys package now depends on prog and contains only generated
descriptions of syscalls.
Introduce prog.Target type that encapsulates all targer properties,
like syscall list, ptr/page size, etc. Also moves OS-dependent pieces
like mmap call generation from prog to sys.
Update #191
We currently use uintptr for all values.
This won't work for 32-bit archs.
Moreover in some cases we use uintptr but assume
that it is always 64-bits (e.g. in encodingexec).
Switch everything to uint64.
Update #324
During minimization we create a single memory mapping that contains all
the smaller mmap() ranges, so that other mmap() calls can be dropped.
This "uber-mmap" used to start at 0x7f0000000000 regardless of where the
smaller mappings were located. Change its starting address to the
beginning of the first small mmap() range.
Due to https://github.com/google/syzkaller/issues/316 there're too many
mmap() calls in the programs, and syzkaller is spending quite a bit of
time mutating them. Most of the time changing mmap() calls won't give
us new coverage, so let's not do it too often.
Right now Arg is a huge struct (160 bytes), which has many different fields
used for different arg kinds. Since most of the args we see in a typical
corpus are ArgConst, this results in a significant memory overuse.
This change:
- makes Arg an interface instead of a struct
- adds a SomethingArg struct for each arg kind we have
- converts all *Arg pointers into just Arg, since interface variable by
itself contains a pointer to the actual data
- removes ArgPageSize, now ConstArg is used instead
- consolidates correspondence between arg kinds and types, see comments
before each SomethingArg struct definition
- now LenType args that denote the length of VmaType args are serialized as
"0x1000" instead of "(0x1000)"; to preserve backwards compatibility
syzkaller is able to parse the old format for now
- multiple small changes all over to make the above work
After this change syzkaller uses twice less memory after deserializing a
typical corpus.
This change adds a `csum[kind, type]` type.
The only available kind right now is `ipv4`.
Using `csum[ipv4, int16be]` in `ipv4_header` makes syzkaller calculate
and embed correct checksums into ipv4 packets.
The optimization change removed validation too aggressively.
We do need program validation during deserialization,
because we can get bad programs from corpus or hub.
Restore program validation after deserialization.
A bunch of spot optmizations after cpu/memory profiling:
1. Optimize hot-path coverage comparison in fuzzer.
2. Don't allocate and copy serialized program, serialize directly into shmem.
3. Reduce allocations during parsing of output shmem (encoding/binary sucks).
4. Don't allocate and copy coverage arrays, refer directly to the shmem region
(we are not going to mutate them).
5. Don't validate programs outside of tests, validation allocates tons of memory.
6. Replace the choose primitive with simpler switches.
Choose allocates fullload of memory (for int, func, and everything the func refers).
7. Other minor optimizations.
Currently we generate arrays of size [0,5] with equal probability.
Generate [0,10] with bias towards smaller arrays. But 0 has the lowest probability.
I've benchmark a slightly different change with max array size of 20,
results are somewhat inconclusive: it was better than baseline almost all way,
but baseline suddenly caught up at the end. It also considerably reduced
executions per second (by ~20%). So increasing array size to 10 should be a win...
Currently we stop mutating with 50% probability.
Stop mutating with 33% probability instead.
Benchmark shows both coverage increase and corpus reduction:
baseline oneof3 diff
coverage 65467 65604 137
corpus 35423 35354 -69
exec total 5474879 5023268 -451611
Add new pseudo syscall syz_kvm_setup_cpu that setups VCPU into
interesting states for execution. KVM is too difficult to setup otherwise.
Lots of improvements possible, but this is a starting point.
This allows to write:
string[salg_type, 14]
which will give a string buffer of size 14 regardless of actual string size.
Convert salg_type/salg_name to this.
Allow to define string flags in txt descriptions. E.g.:
filesystem = "ext2", "ext3", "ext4"
and then use it in string type:
ptr[in, string[filesystem]]
Eliminate assignTypeAndDir function and instead assign
types to all args during construction.
This will allow considerable simplifation of assignSizes.