Most likely reports without proper stack traces were caused by a bug in the
unwinder and are now fixed in 187b96db5ca7 "x86/unwind/orc: Fix
unwind_get_return_address_ptr() for inactive tasks".
Disable trying to use questionable frames for now.
Fixes#1834
Some terms are normalised on the technical level
but may be oppressive on a societal level.
Replace them with more technically neutral terms.
See the following doc for a longer version:
https://tools.ietf.org/id/draft-knodel-terminology-00.html
I though maybe we need special handling for them:
stop at kmem_cache_alloc function. But now I am not sure.
This can also be an infinite loop which calls kmalloc/kfree.
Let's not change code for now, just fix things with tests
(this is a good representative set).
If we produce no guilty file at all, the report is mailed only to LKML,
which is mostly equivalent to mailing to nobody.
If we skip all files, return the first one.
Add new __ia32_compat_sys_ioctl anchor frame
(something seems to have been changed in compat ioctl's).
Also skip all compat_ioctl frames, it's pretty common naming
convention and it may help to avoid some dups across
compat/non-compat paths.
The previous commit "pkg/report: handle cases when whole stack is questionable"
mishandles frames that start with [PC] prefix before " ? ".
Restore that part.
1. Always append diagnosis output at the end.
Don't intermix it with kernel output. It's confusing and not useful.
2. Don't include diagnosis output into Report.
It's too verbose and is not the crash. Keep it only in the Output.
If the report is identified as corrupted because there are no frames at all,
try to re-extract using questionable frames.
This is a bit risky and may produce lots of one-off corrupted reports
at random locations. But we won't know until we deploy this...
Fixes#1216
nmi_check_duration() prints "INFO: NMI handler took too long" on slow debug kernels.
It happens a lot in qemu, and the messages are frequently corrupted
(intermixed with other kernel output as they are printed from NMI)
and are not matched against the suppression in pkg/report.
This write prevents these messages from being printed.
Overall idea of netlink checking.
Currnetly we check netlink policies for common detectable mistakes.
First, we detect what looks like a netlink policy in our descriptions
(these are structs/unions only with nlattr/nlnext/nlnetw fields).
Then we find corresponding symbols (offset/size) in vmlinux using nm.
Then we read elf headers and locate where these symbols are in the rodata section.
Then read in the symbol data, which is an array of nla_policy structs.
These structs allow to easily figure out type/size of attributes.
Finally we compare our descriptions with the kernel policy description.
Update #590
On X86-64, dereferencing a non-canonical address normally causes a #GP, for
which syzkaller already has a pattern. However, if the base register of the
non-canonical address is RBP (which can happen in builds that use RBP as a
general-purpose register because they don't use frame pointer unwinding),
#SS is thrown instead, for which syzkaller did not yet have a pattern.
To see this kind of fault, you can insert the following code in
kernel_init() after the call to rcu_end_inkernel_boot():
asm volatile(
"movabs $0x8000000000000000, %rbp\n\t"
"movq (%rbp), %rax\n\t"
"ud2\n\t"
);
Linux prints a different error message for #SS, so add that error message
to syzkaller's list of patterns.
The the added test for exception from exception corner case.
"BUG: spinlock lockup" fails to respect panic_on_warn and panic
after printing report (though, it's a BUG already, so it should
have been paniced even without panic_on_warn).
As the result we got "spinlock lockup" followed by "rcu stall" report.
And we have that special exception for rcu stalls b/c for them
the most of the report is irrelevant up to apic_timer_interrupt frame.
The code did not expect this weird double-report case and skipped
everything up to apic_timer_interrupt, though it's actually
a lockup in netfilter code.
An upcoming patch for Linux will change the error reporting pattern for
general protection faults such that the colon doesn't necessarily come
immediately after the string "general protection fault" (see
https://lore.kernel.org/lkml/20191118142144.GC6363@zn.tnic/).
Change the pattern in syzkaller before that happens.
Note that this is not necessarily the final format; in particular, the
ordering of the KASAN note and the "general protection fault" line might
swap.
The port-based exception APIs have been deprecated on Fuchsia and will
be removed shortly. Delete them from the syscall definitions and
modify the Fuchsia executor to use the new channel-based APIs instead.