Rather than writing the diagnosis to the kernel console, Diagnose can
now directly return the extra debugging info, which will be appended ot
the kernel console log.
This gives almost 100% coverage for MonitorExecution.
Test all corner cases like lost connection, no output,
diagnose, exiting/non-exiting programs, etc.
Update #875
Diagnose currently sends the panic signal to generate a traceback for
additional context.
However, Diagnose is also called in otherwise successful scenarios
(vm.Instance.MonitorExecution -> vm.monitor.extractError). Triggering a
panic will make this successful scenario look like a failure.
We could simply suppress this panic, but 1) that means we never shutdown
cleanly (not important, but ugly), and 2) we're less likely to detect
delayed crashes since we kill the sandbox immediately (that's what
MonitorExecution is checking for).
Instead, switch from -panic-signal to -trace-signal, which simply logs a
traceback without exiting. This option was added to runsc in
24c1158b9c.
The other uses of Diagnose will always generate a report regardless of
an additional panic, so we're not losing any reports.
* vm/qemu: Improve debug output.
When running in debug mode, the number of VMs is reduced to 1.
State this in the debug output.
* vm/qemu: Don't start debug output with a capital letter.
As requested by Dimitry.
* vm: Provide debug message when reduing number of VMs.
Apply this change to all affected platforms for consistency.
Suggested by Dmitry.
* Add myself to AUTHORS/CONTRIBUTORS files.
* vm: Fix compilation issues missed in earlier commit.
* vm: Use logging to write debug message.
Allow setting qemu_args to "" in the config file. This is needed
when running qemu from the qemu-devel package on FreeBSD, which
does not support the -enable-kvm option.
Without this patch, an entry "" is added to the list of command
line parameters, which breaks the starting of the qemu instances.
* build/openbsd: minor cleanup (use tuples instead of maps)
* Grammar nits in comments.
* Simplify openbsd.Create, will defer when there's more than one error exit.
* pkg/build: Support copying kernel into GCE image
* Simple test for openbsd image copy build.
* Cleanup in case something failed before.
* Support multi-processor VMs on GCE.
* More debug
* Reformat
* OpenBSD gce image needs to be raw.
* GC
* Force format to GNU directly on Go 1.10 or newer.
* Use vmType passed as a parameter inside openbsd.go
* gofmt
* more fmt
* Can't use GENERIC.mp just yet.
* capitalize
* Copyright
The ddb(4) debugger defaults to showing 24 lines at a time, the next chunk of
lines will be displayed only after receiving keyboard input. Setting maxlines to
0 disables pagination completely.
As of commit 3f053259, gVisor sentry panics are no longer sent to the
stderr for "runsc run" by default, as that stderr belongs exclusively to
the application.
As a result, syzbot never sees the gVisor panic stack trace, and is only
reporting errors that occur when waiting for a dead sandbox.
Passing the "-debug" flag to runsc will make the sentry panics visible
to syzbot again.
As a result, the boot time is significantly improved since there's no longer any
need to copy the complete disk.
This feature was recently committed to OpenBSD-current. Any existing base image
used must be recreated, this time using the qcow2 disk format.
vmctl start periodically fails with:
vmctl: start vm command failed: Operation already in progress
So try to sleep for a bit after vmctl stop.
And detect when vmctl start terminates prematurely
to avoid 10 minute timeout for ip extraction.
vmctl console fails from time to time with:
vmctl: console not found
Probably there is some race (most of these things assume
that there is a human typing commands with delays).
Also, vmctl start can connect to console itself with -c flag.
So use that because it both solves the console race and
also makes code much more similar to other VM implementations (qemu, gvisor).
This also eliminates 3 additional goroutines per VM.
A dozen of vmm's running on a GCE machine can be really slow to boot.
Timeouts have only single goal: preventing complete system stalls
when/if external commands episodically hang. There is no value
in keeping them as close as possible to expected durations.
This can only lead to various flakes. Increase timeouts by an
order of magnitude.
The goroutine sends on bootOutputStop to notify about its completion,
but the main goroutine is not receiving from the chan on success
and since the chan in unbuffered, the output reading goroutine
hangs on the send forever.
I am getting:
failed to run vmctl -name syzkaller-ci-openbsd-main-test-0
vmctl: name too long
The name is auto-generated from parts which ensure that it is unique.
We can't easily name it shorter. So strip the syzkaller prefix,
which is not strictly necessary.
We currently have this list in multiple places (somewhat diverged).
Specify this "overcommit" property in VM implementations.
In particular, we also want to allow overcommit for "vmm" type.
Update #712
The IP address of a VM is calculated based on the formula 100.64.X.3 where X
being the ID of the VM, starting from 0. After starting 256 VMs 64 will flip
over to 65 and so on. A more robust solution to calculating the IP is to simply
read it from output during boot.
While here, stop using the VM ID as the identifier since the VM name also works.
os.Args[0] can be just binary name which was looked up using $PATH.
In such case copy will fail because the path does not exist.
Lookup binary name using $PATH.
mgrconfig was used only by syz-manager initially,
but now it's used by a dozen of packages and it's
weird to import from under a binary dir.
pkg/ is much more reasonable dir for a widely used
helper package.
The problem with that commit is that for GCE implementation
we immidiately kill console connection too when receive diagnose signal.
This leads to truncated output.
By using UDEV rules, we can create device nodes which exist at
/dev/ttyUSB.{android device serial}
Which makes it easier to determine which console belongs to a device.
While this is non-standard behavior, it's an inexpensive path check
and makes the lookup faster and deterministic.
Some kernel bugs don't stop kernel.
For such bugs whiel vm.MonitorExecution waits for kernel output for 10 secs,
fuzzer continues running programs and produces tons of output
after the kernel bug message. Kill fuzzer once MonitorExecution
detects a kernel bug.
See #516 for description of the problem.
The new scheme is:
1. RCU stalls the highest priority.
CONFIG_RCU_CPU_STALL_TIMEOUT=100
which results in stalls detected after 100-101 secs.
2. Then softlockup detector.
kernel.watchdog_thresh = 55 (sysctl)
which surprisingly detects stalls after 110-132 secs.
3. Then hung tasks and workqueue stalls.
Unfortunately we can't separate them because that would
require setting "no output" timeout to 10+ minutes.
workqueue.watchdog_thresh=140 (cmdline)
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=140
Both are detected after 140-280 secs.
4. Finally, "no output" crashes.
Detected by vm.MonitorExecution after 300 secs.
Fixes#516
Currently all (linux-specific) suppressions are hardcoded in mgrconfig.
This is very wrong. Move them to pkg/report and allow to specify per OS.
Add gvisor-specific suppressions.
This required a bit of refactoring. Introduce mgrconfig.KernelObj finally.
Make report.NewReporter and vm.Create accept mgrconfig directly
instead of passing it as multiple scattered args.
Remove tools/syz-parse and it always did the same as tools/syz-symbolize.
Simplify global vars in syz-manager/cover.go.
Create reporter eagerly in manager. Use sort.Slice more.
Overall -90 lines removed.
We have fallback coverage implmentation for freebsd.
1. It's broken after some recent changes.
2. We need it for fuchsia, windows, akaros, linux too.
3. It's painful to work with C code.
Move fallback coverage to ipc package,
fix it and provide for all OSes.
echo 0 to kptr_restrict in /proc/sys/kernel to unhide
kernel pointers when fuzzing for more reliable crash
dedup and easier debugging when analyzing crash.
Some machine configurations have strict limits on number of CPUs
and don't support NUMA (e.g. arm vexpress-a15).
maxcpu and numa options make qemu fail.
Don't be too clever. If necessary maxcpu and numa options
can be added in qemu_args.
In pkg/report we add up to 5 lines of kernel output before the report.
However, MonitorExecution leaves only up to 128 bytes of preceeding output,
so frequently preceeding lines are not included in the report.
Increase the context to 512 bytes.
Don't connect by hostname, this seems to be broken on GCE.
Episodically connecting by hostname gives:
Could not resolve hostname: Name or service not known
Try extracting report from console output only first. If that doesn't work,
try extracting it from the whole log.
Add regexp for executor printed BUGs.
Optimize regexps for rcu detected stalls.
Update rep.StartPos and rep.EndPos in vm/vm.go as well as rep.Output.
GCE serial reply seems to be buggy, we see lots of "serialport: VM disconnected"
and "packet_write_wait: Connection to 1.2.3.4 port 9600: Broken pipe"
errors, which do not have any explanation.
Ignore all serial relay errors.
Make it possible to monitor health and operation
of all managers from dashboard.
1. Notify dashboard about internal syz-ci errors
(currently we don't know when/if they happen).
2. Send statistics from managers to dashboard.
Boot and minimally test images before declaring them as good
and switching to using them.
If image build/boot/test fails, upload report about this to dashboard.
Currently getting a complete report requires a complex,
multi-step dance (including getting information that
external users are not interested in -- guilty file).
Simplify interface down to 2 functions: Parse and Symbolize.
Parse does what it did before, Symbolize symbolizes report
and fills in maintainers. This simplifies both implementations
of Reporter interface and all users of the interface.
Potentially we could get this down to 1 function Parse
that does everything. However, (1) Symbolize can fail,
while Parse cannot, (2) usually we want to ignore (log)
Symbolize errors, but otherwise proceed with the report,
(3) repro does not need symbolization for all but the
last report.
Whole raw output is indivisble part of Report,
currently we always pass Output separately along with Report.
Make Output a Report field.
Then, put whole Report into manager Crash and repro context and Result.
There is little point in passing Report as aa bunch of separate fields.
Turns out GetSerialPortOutput API does not work if instance has
serial port connections enabled (which we always have).
Get output from serial port relay service instead.
This allows callers to get access to Report.Corrupted.
Better than adding 6-th return value and will allow
to pipe other report properties if necessary.
New console output code crashes with nil deref,
because we shadow outer err variable and then
dereference nil err.
Also express ssh connect timeout in real time.
Currently the timeout is on par of ~25 mins
(5s sleep + 10s connect timeout) * 100.
Reduce timeout to 5m of real time.
When manager is stopped there are sometimes runaway qemu
processes still running. Set PDEATHSIG for all subprocesses.
We never need child processes outliving parents.
We currently have several names for crash attributes, which is disturbing.
E.g. crash title is called "Title" or "Desc". Name them consistently.
Title - single line bug identity.
Report - whole crash text.
Log - whole fuzzer/kernel output.