Right after committing the on_each_cpu change,
another report come in where smp_call_function
is not called from on_each_cpu. And there are
actually more such callers in code, and also
as existing tests show. smp_call_function seems
to be the better root cause indication.
It also hash high branching factor and a bug is more likely in the callback.
For the added test we used to say:
INFO: rcu detected stall in __sys_sendmsg
now we say more useful:
INFO: rcu detected stall in tc_modify_qdisc
Most likely reports without proper stack traces were caused by a bug in the
unwinder and are now fixed in 187b96db5ca7 "x86/unwind/orc: Fix
unwind_get_return_address_ptr() for inactive tasks".
Disable trying to use questionable frames for now.
Fixes#1834
I though maybe we need special handling for them:
stop at kmem_cache_alloc function. But now I am not sure.
This can also be an infinite loop which calls kmalloc/kfree.
Let's not change code for now, just fix things with tests
(this is a good representative set).
If we produce no guilty file at all, the report is mailed only to LKML,
which is mostly equivalent to mailing to nobody.
If we skip all files, return the first one.
Add new __ia32_compat_sys_ioctl anchor frame
(something seems to have been changed in compat ioctl's).
Also skip all compat_ioctl frames, it's pretty common naming
convention and it may help to avoid some dups across
compat/non-compat paths.
The previous commit "pkg/report: handle cases when whole stack is questionable"
mishandles frames that start with [PC] prefix before " ? ".
Restore that part.
If the report is identified as corrupted because there are no frames at all,
try to re-extract using questionable frames.
This is a bit risky and may produce lots of one-off corrupted reports
at random locations. But we won't know until we deploy this...
Fixes#1216
On X86-64, dereferencing a non-canonical address normally causes a #GP, for
which syzkaller already has a pattern. However, if the base register of the
non-canonical address is RBP (which can happen in builds that use RBP as a
general-purpose register because they don't use frame pointer unwinding),
#SS is thrown instead, for which syzkaller did not yet have a pattern.
To see this kind of fault, you can insert the following code in
kernel_init() after the call to rcu_end_inkernel_boot():
asm volatile(
"movabs $0x8000000000000000, %rbp\n\t"
"movq (%rbp), %rax\n\t"
"ud2\n\t"
);
Linux prints a different error message for #SS, so add that error message
to syzkaller's list of patterns.
The the added test for exception from exception corner case.
"BUG: spinlock lockup" fails to respect panic_on_warn and panic
after printing report (though, it's a BUG already, so it should
have been paniced even without panic_on_warn).
As the result we got "spinlock lockup" followed by "rcu stall" report.
And we have that special exception for rcu stalls b/c for them
the most of the report is irrelevant up to apic_timer_interrupt frame.
The code did not expect this weird double-report case and skipped
everything up to apic_timer_interrupt, though it's actually
a lockup in netfilter code.
An upcoming patch for Linux will change the error reporting pattern for
general protection faults such that the colon doesn't necessarily come
immediately after the string "general protection fault" (see
https://lore.kernel.org/lkml/20191118142144.GC6363@zn.tnic/).
Change the pattern in syzkaller before that happens.
Note that this is not necessarily the final format; in particular, the
ordering of the KASAN note and the "general protection fault" line might
swap.
The port-based exception APIs have been deprecated on Fuchsia and will
be removed shortly. Delete them from the syscall definitions and
modify the Fuchsia executor to use the new channel-based APIs instead.
Some syzkaller panics happen due to memory corruptions,
but it still would be useful at least to get some visibility into these crashes.
On some OSes we actualy already detect them as they have "panic:" oops pattern,
but not e.g. on linux.
Fixes#318
The problem with task hung reports is that they manifest at random victim stacks,
rather at the root cause stack. E.g. if there is something wrong with RCU subsystem,
we are getting hangs all over the kernel on all synchronize_* calls.
So before resotring to the common logic of skipping some common frames,
we look for 2 common buckets: hangs on synchronize_rcu and hangs on rtnl_lock
and group these together.