linux

mirror of https://github.com/FEX-Emu/linux.git synced 2024-12-23 09:56:00 +00:00

History

Thomas Gleixner a1cbcaa9ea sched_clock: Prevent 64bit inatomicity on 32bit systems The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org		2013-04-08 11:50:44 +02:00
..
debug	module: add new state MODULE_STATE_UNFORMED.	2013-01-12 13:27:05 +10:30
events	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 17:49:41 -08:00
gcov
irq	Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 19:07:27 -08:00
power	Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2012-12-12 08:18:24 -08:00
sched	sched_clock: Prevent 64bit inatomicity on 32bit systems	2013-04-08 11:50:44 +02:00
time	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 19:05:45 -08:00
trace	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 18:19:48 -08:00
.gitignore
acct.c	cputime: Use accessors to read task cputime stats	2013-01-27 19:23:31 +01:00
async.c	Merge branch 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2013-02-19 22:01:33 -08:00
audit_tree.c	audit: catch possible NULL audit buffers	2013-01-11 14:54:55 -08:00
audit_watch.c	audit: catch possible NULL audit buffers	2013-01-11 14:54:55 -08:00
audit.c	kernel/audit.c: avoid negative sleep durations	2013-01-11 14:54:56 -08:00
audit.h	audit: optimize audit_compare_dname_path	2012-10-12 00:32:02 -04:00
auditfilter.c	audit: fix auditfilter.c kernel-doc warnings	2013-01-10 14:35:23 -08:00
auditsc.c	audit: catch possible NULL audit buffers	2013-01-11 14:54:55 -08:00
backtracetest.c
bounds.c
capability.c	userns: Teach inode_capable to understand inodes whose uids map to other namespaces.	2012-05-15 14:59:24 -07:00
cgroup_freezer.c	cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()	2012-11-19 08:13:38 -08:00
cgroup.c	Merge branch 'akpm' (Andrew's patch-bomb)	2012-12-17 20:58:12 -08:00
compat.c	x32: fix sigtimedwait	2012-12-26 01:15:03 -05:00
configs.c
context_tracking.c	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 18:19:48 -08:00
cpu_pm.c	kernel/cpu_pm.c: fix various typos	2012-05-31 17:49:27 -07:00
cpu.c	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 19:04:55 -08:00
cpuset.c	cpuset: use N_MEMORY instead N_HIGH_MEMORY	2012-12-12 17:38:32 -08:00
crash_dump.c
cred.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2012-12-18 10:55:28 -08:00
delayacct.c	cputime: Use accessors to read task cputime stats	2013-01-27 19:23:31 +01:00
dma.c	Remove all #inclusions of asm/system.h	2012-03-28 18:30:03 +01:00
elfcore.c
exec_domain.c
exit.c	cputime: Use accessors to read task cputime stats	2013-01-27 19:23:31 +01:00
extable.c	extable: Skip sorting if sorted at build time.	2012-04-19 15:06:55 -07:00
fork.c	This implements the cputime accounting on full dynticks CPUs.	2013-02-05 13:10:33 +01:00
freezer.c	freezer: change ptrace_stop/do_signal_stop to use freezable_schedule()	2012-10-26 14:27:49 -07:00
futex_compat.c	futex: Mark get_robust_list as deprecated	2012-03-29 11:37:17 +02:00
futex.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
groups.c	userns: Convert in_group_p and in_egroup_p to use kgid_t	2012-05-03 03:29:33 -07:00
hrtimer.c	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 19:05:45 -08:00
hung_task.c	hung task debugging: Inject NMI when hung and going to panic	2012-04-25 12:39:25 +02:00
irq_work.c	Merge branch 'nohz/printk-v8' into irq/core	2013-02-05 00:48:46 +01:00
itimer.c	itimer: Use printk_once instead of WARN_ONCE	2012-04-10 11:00:30 +02:00
jump_label.c	jump_label: Export jump_label_rate_limit()	2012-08-06 19:00:35 +03:00
kallsyms.c	vsprintf: fix %ps on non symbols when using kallsyms	2012-05-29 16:22:32 -07:00
kcmp.c	kcmp: include linux/ptrace.h	2012-12-20 17:40:19 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks	locking: Adjust spin lock inlining Kconfig options	2012-09-13 17:56:13 +02:00
Kconfig.preempt	locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage	2012-03-23 13:18:57 +01:00
kexec.c	kdump: remove unneeded include	2012-10-06 03:05:19 +09:00
kfifo.c	[media] kernel:kfifo: export __kfifo_max_r symbol	2012-04-11 18:24:37 -03:00
kmod.c	Merge branch 'master' into for-3.9-async	2013-01-23 09:31:01 -08:00
kprobes.c	Merge branch 'for-3.9-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2013-02-19 21:58:52 -08:00
ksysfs.c	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2012-12-11 18:10:49 -08:00
kthread.c	kthread: use N_MEMORY instead N_HIGH_MEMORY	2012-12-12 17:38:33 -08:00
latencytop.c
lglock.c	brlocks/lglocks: turn into functions	2012-05-29 23:28:41 -04:00
lockdep_internals.h
lockdep_proc.c	lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name()	2012-10-24 12:39:09 +02:00
lockdep_states.h
lockdep.c	lockdep: Check if nested lock is actually held	2012-09-13 17:00:44 +02:00
Makefile	Nothing all that exciting; a new module-from-fd syscall for those who want	2012-12-19 07:55:08 -08:00
modsign_certificate.S	MODSIGN: Avoid using .incbin in C source	2012-12-14 13:06:44 +10:30
modsign_pubkey.c	keys: use keyring_alloc() to create module signing keyring	2012-12-20 17:40:21 -08:00
module_signing.c	MODSIGN: Don't use enum-type bitfields in module signature info block	2012-12-05 11:27:24 +10:30
module-internal.h	MODSIGN: Move the magic string to the end of a module and eliminate the search	2012-10-19 17:30:40 -07:00
module.c	module: fix missing module_mutex unlock	2013-01-20 20:22:58 -08:00
mutex-debug.c
mutex-debug.h
mutex.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
mutex.h
notifier.c
nsproxy.c	userns: Implement unshare of the user namespace	2012-11-20 04:18:14 -08:00
padata.c	padata: use __this_cpu_read per-cpu helper	2012-12-06 17:16:23 +08:00
panic.c	panic: fix a possible deadlock in panic()	2012-07-30 17:25:13 -07:00
params.c	params: replace printk(KERN_<LVL>...) with pr_<lvl>(...)	2012-05-04 17:28:18 -07:00
pid_namespace.c	pidns: Stop pid allocation when init dies	2012-12-25 16:10:05 -08:00
pid.c	kernel/pid.c: reenable interrupts when alloc_pid() fails because init has exited	2013-02-12 14:34:00 -08:00
posix-cpu-timers.c	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 19:05:45 -08:00
posix-timers.c	posix-timers: Fix clock_adjtime to always return timex data on success	2013-01-15 18:16:07 -08:00
printk.c	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 17:47:58 -08:00
profile.c	profiling: Remove unused timer hook	2013-01-24 15:37:26 +01:00
ptrace.c	uprobes: Add exports for module use	2013-02-08 17:47:13 +01:00
range.c
rcu.h	rcu: Provide RCU CPU stall warnings for tiny RCU	2013-01-28 22:06:21 -08:00
rcupdate.c	Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD	2013-01-28 22:25:21 -08:00
rcutiny_plugin.h	rcu: Provide RCU CPU stall warnings for tiny RCU	2013-01-28 22:06:21 -08:00
rcutiny.c	Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD	2013-01-28 22:25:21 -08:00
rcutorture.c	rcu: Allow rcutorture to be built at low optimization levels	2013-02-04 12:18:20 -08:00
rcutree_plugin.h	rcu: Make rcu_nocb_poll an early_param instead of module_param	2013-01-08 14:12:19 -08:00
rcutree_trace.c	rcu: Separate accounting of callbacks from callback-free CPUs	2012-11-16 10:05:57 -08:00
rcutree.c	Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD	2013-01-28 22:25:21 -08:00
rcutree.h	Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD	2013-01-28 22:25:21 -08:00
relay.c	splice: fix racy pipe->buffers uses	2012-06-13 21:16:42 +02:00
res_counter.c	res_counter: return amount of charges after res_counter_uncharge()	2012-12-18 15:02:12 -08:00
resource.c	kernel/resource.c: fix stack overflow in __reserve_region_with_split()	2012-10-06 03:05:31 +09:00
rtmutex_common.h
rtmutex-debug.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
rtmutex-debug.h
rtmutex-tester.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
rtmutex.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
rtmutex.h
rwsem.c	lockdep, rwsem: provide down_write_nest_lock()	2013-01-11 14:54:55 -08:00
seccomp.c	seccomp: Make syscall skipping and nr changes more consistent	2012-10-02 21:14:29 +10:00
semaphore.c	semaphore: fix improper comment reference to mutex	2012-04-05 17:15:55 -07:00
signal.c	This implements the cputime accounting on full dynticks CPUs.	2013-02-05 13:10:33 +01:00
smp.c	smp: Fix SMP function call empty cpu mask race	2013-01-28 11:21:57 +01:00
smpboot.c	smpboot: Allow selfparking per cpu threads	2013-02-14 15:29:37 +01:00
smpboot.h	smpboot: Provide infrastructure for percpu hotplug threads	2012-08-13 17:01:07 +02:00
softirq.c	cputime: Safely read cputime of full dynticks CPUs	2013-01-27 20:35:47 +01:00
spinlock.c	locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage	2012-03-23 13:18:57 +01:00
srcu.c	srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock()	2013-02-07 15:19:36 -08:00
stacktrace.c
stop_machine.c	stop_machine: Use smpboot threads	2013-02-14 15:29:38 +01:00
sys_ni.c	module: add syscall to load module from fd	2012-12-14 13:05:22 +10:30
sys.c	cputime: Rename thread_group_times to thread_group_cputime_adjusted	2012-11-28 17:07:57 +01:00
sysctl_binary.c	pidns: Use task_active_pid_ns where appropriate	2012-11-19 05:59:09 -08:00
sysctl.c	sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice	2013-02-07 20:51:07 +01:00
task_work.c	task_work: task_work_add() should not succeed after exit_task_work()	2012-09-13 16:47:34 +02:00
taskstats.c	taskstats: cgroupstats_user_cmd() may leak on error	2012-10-06 03:05:31 +09:00
test_kprobes.c
time.c	time, Fix setting of hardware clock in NTP code	2013-02-08 15:07:05 -08:00
timeconst.pl	timeconst.pl: Eliminate Perl warning	2013-02-07 17:14:08 -08:00
timer.c	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-02-19 18:19:48 -08:00
tracepoint.c	static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc\|dec]()	2012-02-24 10:05:59 +01:00
tsacct.c	cputime: Use accessors to read task cputime stats	2013-01-27 19:23:31 +01:00
uid16.c	userns: Convert setting and getting uid and gid system calls to use kuid and kgid	2012-05-03 03:28:41 -07:00
up.c
user_namespace.c	userns: Fix typo in description of the limitation of userns_install	2012-12-14 18:36:36 -08:00
user-return-notifier.c
user.c	proc: Usable inode numbers for the namespace file descriptors.	2012-11-20 04:19:49 -08:00
utsname_sysctl.c
utsname.c	userns: Require CAP_SYS_ADMIN for most uses of setns.	2012-12-14 16:12:03 -08:00
wait.c	propagate name change to comments in kernel source	2012-12-06 10:39:54 +01:00
watchdog.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
workqueue_internal.h	workqueue: rename cpu_workqueue to pool_workqueue	2013-02-13 19:29:12 -08:00
workqueue.c	workqueue: un-GPL function delayed_work_timer_fn()	2013-02-19 10:09:13 -08:00