Fix incorrect usage of css_get and css_put to put a different css in
pids_{cancel_,}attach() than the one grabbed in pids_can_attach(). This
could lead to quite serious memory leakage (and unsafe operations on the
putted css).
tj: minor comment update
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is a race condition in SMP bootup code, which may result
in
WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()
or
kernel BUG at kernel/smpboot.c:135!
It can be triggered with a bit of luck in Linux guests running
on busy hosts.
CPU0 CPUn
==== ====
_cpu_up()
__cpu_up()
start_secondary()
set_cpu_online()
cpumask_set_cpu(cpu,
to_cpumask(cpu_online_bits));
cpu_notify(CPU_ONLINE)
<do stuff, see below>
cpumask_set_cpu(cpu,
to_cpumask(cpu_active_bits));
During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.
Variant 1:
cpu_notify(CPU_ONLINE)
workqueue_cpu_up_callback()
rebind_workers()
set_cpus_allowed_ptr()
This call fails because it requires an active CPU; rebind_workers()
ends with a warning:
WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
workqueue_cpu_up_callback()
Variant 2:
cpu_notify(CPU_ONLINE)
smpboot_thread_call()
smpboot_unpark_threads()
..
__kthread_unpark()
__kthread_bind()
wake_up_state()
..
select_task_rq()
select_fallback_rq()
The ->wake_cpu of the unparked thread is not allowed, making a call
to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
find an allowed, active CPU and promptly resets the allowed CPUs, so
that the task in question ends up on CPU0.
When those unparked tasks are eventually executed, they run
immediately into a BUG:
kernel BUG at kernel/smpboot.c:135!
Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb96976f ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.
Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b552f ("sched: Cleanup cpu_active madness")
Together, these give us the following non-working solutions:
- secondary CPU sets active before online, because active is assumed to
be a subset of online;
- secondary CPU sets online before active, because the primary CPU
assumes that an online CPU is also active;
- secondary CPU sets online and waits for primary CPU to set active,
because it might deadlock.
Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are
active & online") introduces an arch-specific solution to this
arch-independent problem.
Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.
set_cpus_allowed_ptr()")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Wilson <msw@amazon.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU cleanup from Paul E. McKenney:
"Privatize smp_mb__after_unlock_lock(). This commit moves the
definition of smp_mb__after_unlock_lock() to kernel/rcu/tree.h,
in recognition of the fact that RCU is the only thing using
this, that nothing else is likely to use it, and that it is
likely to go away completely."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull irq fixes from Thomas Gleixner:
"A series of small fixlets for a regression visible on OMAP devices
caused by the conversion of the OMAP interrupt chips to hierarchical
interrupt domains. Mostly one liners on the driver side plus a small
helper function in the core to avoid open coded mess in the drivers"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/crossbar: Restore set_wake functionality
irqchip/crossbar: Restore the mask on suspend behaviour
ARM: OMAP: wakeupgen: Restore the irq_set_type() mechanism
irqchip/crossbar: Restore the irq_set_type() mechanism
genirq: Introduce irq_chip_set_type_parent() helper
genirq: Don't return ENOSYS in irq_chip_retrigger_hierarchy
Pull timer fixes from Thomas Gleixner:
"Two minimalistic fixes for 4.2 regressions:
- Eric fixed a thinko in the timer_list base switching code caused by
the overhaul of the timer wheel. It can cause a cpu to see the
wrong base for a timer while we move the timer around.
- Guenter fixed a regression for IMX if booted w/o device tree, where
the timer interrupt is not initialized and therefor the machine
fails to boot"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/imx: Fix boot with non-DT systems
timer: Write timer->flags atomically
Commit 75e3b37d0598 ("hrtimer: Drop return code of hrtimer_switch_to_hres()")
drops the return code of hrtimer_switch_to_hres(). While doing so, it also
drops the return statement itself on failure. This may cause a system hang.
Seen when running arm:multi_v7_defconfig in qemu with devicetree file
vexpress-v2p-ca9.
Fixes: 75e3b37d0598 ("hrtimer: Drop return code of hrtimer_switch_to_hres()")
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/1440231047-16256-1-git-send-email-linux@roeck-us.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This helper is required for irq chips which do not implement a
irq_set_type callback and need to call down the irq domain hierarchy
for the actual trigger type change.
This helper is required to fix further wreckage caused by the
conversion of TI OMAP to hierarchical irq domains and therefor tagged
for stable.
[ tglx: Massaged changelog ]
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: <linux@arm.linux.org.uk>
Cc: <nsekhar@ti.com>
Cc: <jason@lakedaemon.net>
Cc: <balbi@ti.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: <tony@atomide.com>
Cc: <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org # 4.1
Link: http://lkml.kernel.org/r/1439554830-19502-3-git-send-email-grygorii.strashko@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
irq_chip_retrigger_hierarchy() returns -ENOSYS if it was not able to
find at least one .irq_retrigger() callback implemented in the IRQ
domain hierarchy.
That's wrong, because check_irq_resend() expects a 0 return value from
the callback in case that the hardware assisted resend was not
possible. If the return value is non zero the core code assumes
hardware resend success and the software resend is not invoked.
This results in lost interrupts on platforms where none of the parent
irq chips in the hierarchy implements the retrigger callback.
This is observable on TI OMAP, where the hierarchy is:
ARM GIC <- OMAP wakeupgen <- TI Crossbar
Return 0 instead so the software resend mechanism gets invoked.
[ tglx: Massaged changelog ]
Fixes: 85f08c17de26 ('genirq: Introduce helper functions...')
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: <linux@arm.linux.org.uk>
Cc: <nsekhar@ti.com>
Cc: <jason@lakedaemon.net>
Cc: <balbi@ti.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: <tony@atomide.com>
Cc: stable@vger.kernel.org # 4.1
Link: http://lkml.kernel.org/r/1439554830-19502-2-git-send-email-grygorii.strashko@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This allows cgroup subsystems to use a different name on the unified
hierarchy. cgroup_subsys->name is used on the unified hierarchy,
->legacy_name elsewhere. If ->legacy_name is not explicitly set, it's
automatically set to ->name and the userland visible behavior remains
unchanged.
v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount
options are used only on legacy hierarchies. Suggested by Li
Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
It doesn't make sense to print subsystems on mount option or
/proc/PID/cgroup for the default hierarchy.
* cgroup.controllers file at the root of the default hierarchy lists
the currently attached controllers.
* The default hierarchy is catch-all for unmounted subsystems.
* The default hierarchy doesn't accept any mount options.
Suppress subsystem printing on mount options and /proc/PID/cgroup for
the default hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
The variable called "this_base" is confusing because its name suggests
it's of "struct hrtimer_clock_base" type, along with "base" and "new_base"
which doesn't help understanding this complicated function.
Make its name clearer and fix the misleading comment while at it.
[ tglx: Fixed the comment for real ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1439907509-9553-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
lock_timer_base() cannot prevent the following :
CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
// The next line clears TIMER_MIGRATING
timer->flags &= ~TIMER_BASEMASK;
CPU2 (in lock_timer_base())
see timer base is cpu0 base
spin_lock_irqsave(&base->lock, *flags);
if (timer->flags == tf)
return base; // oops, wrong base
timer->flags |= base->cpu // too late
We must write timer->flags in one go, otherwise we can fool other cpus.
Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if disabled")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Christopherson <jon@jons.org>
Cc: David Miller <davem@davemloft.net>
Cc: xen-devel@lists.xen.org
Cc: david.vrabel@citrix.com
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Link: http://lkml.kernel.org/r/1439831928.32680.11.camel@edumazet-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Pull cgroup fix from Tejun Heo:
"A fix for a subtle bug introduced back during 3.17 cycle which
interferes with setting configurations under specific conditions"
* 'for-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: use trialcs->mems_allowed as a temp variable
The conversion between struct timespec and jiffies is not year 2038
safe on 32bit systems. Introduce timespec64_to_jiffies() and
jiffies_to_timespec64() functions which use struct timespec64 to
make it ready for 2038 issue.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The current_kernel_time() is not year 2038 safe on 32bit systems
since it returns a timespec value. Introduce current_kernel_time64()
which returns a timespec64 value.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The weak update_persistent_clock64() calls update_persistent_clock(),
if the architecture defines an update_persistent_clock64() to replace
and remove its update_persistent_clock() version, when building the
kernel the linker will throw an undefined symbol error, that is, any
arch that switches to update_persistent_clock64() will have this issue.
To solve the issue, we add the common weak update_persistent_clock().
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Two issues were found on an IMX6 development board without an
enabled RTC device(resulting in the boot time and monotonic
time being initialized to 0).
Issue 1:exportfs -a generate:
"exportfs: /opt/nfs/arm does not support NFS export"
Issue 2:cat /proc/stat:
"btime 4294967236"
The same issues can be reproduced on x86 after running the
following code:
int main(void)
{
struct timeval val;
int ret;
val.tv_sec = 0;
val.tv_usec = 0;
ret = settimeofday(&val, NULL);
return 0;
}
Two issues are different symptoms of same problem:
The reason is a positive wall_to_monotonic pushes boot time back
to the time before Epoch, and getboottime will return negative
value.
In symptom 1:
negative boot time cause get_expiry() to overflow time_t
when input expire time is 2147483647, then cache_flush()
always clears entries just added in ip_map_parse.
In symptom 2:
show_stat() uses "unsigned long" to print negative btime
value returned by getboottime.
This patch fix the problem by prohibiting time from being set to a value which
would cause a negative boot time. As a result one can't set the CLOCK_REALTIME
time prior to (1970 + system uptime).
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wang YanQing <udknight@gmail.com>
[jstultz: reworded commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
timespec_trunc() avoids rounding if granularity <= nanoseconds-per-jiffie
(or TICK_NSEC). This optimization assumes that:
1. current_kernel_time().tv_nsec is already rounded to TICK_NSEC (i.e.
with HZ=1000 you'd get 1000000, 2000000, 3000000... but never 1000001).
This is no longer true (probably since hrtimers introduced in 2.6.16).
2. TICK_NSEC is evenly divisible by all possible granularities. This may
be true for HZ=100, 250, 1000, but obviously not for HZ=300 /
TICK_NSEC=3333333 (introduced in 2.6.20).
Thus, sub-second portions of in-core file times are not rounded to on-disk
granularity. I.e. file times may change when the inode is re-read from disk
or when the file system is remounted.
This affects all file systems with file time granularities > 1 ns and < 1s,
e.g. CEPH (1000 ns), UDF (1000 ns), CIFS (100 ns), NTFS (100 ns) and FUSE
(configurable from user mode via struct fuse_init_out.time_gran).
Steps to reproduce with e.g. UDF:
$ dd if=/dev/zero of=udfdisk count=10000 && mkudffs udfdisk
$ mkdir udf && mount udfdisk udf
$ touch udf/test && stat -c %y udf/test
2015-06-09 10:22:56.130006767 +0200
$ umount udf && mount udfdisk udf
$ stat -c %y udf/test
2015-06-09 10:22:56.130006000 +0200
Remounting truncates the mtime to 1 µs.
Fix the rounding in timespec_trunc() and update the documentation.
timespec_trunc() is exclusively used to calculate inode's [acm]time (mostly
via current_fs_time()), and always with super_block.s_time_gran as second
argument. So this can safely be changed without side effects.
Note: This does _not_ fix the issue for FAT's 2 second mtime resolution,
as super_block.s_time_gran isn't prepared to handle different ctime /
mtime / atime resolutions nor resolutions > 1 second.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
I noticed for non-monotonic timers in timer_list, some of the
output looked a little confusing.
For example:
#1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
# expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]
You'll note the relative time till the expiration "[in xxx to
yyy nsecs]" is incorrect. This is because its printing the delta
between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.
This patch fixes this issue by adding the clock offset to the
"now" time which we use to calculate the delta.
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull locking fix from Ingo Molnar:
"A single fix for a locking self-test crash"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/pvqspinlock: Fix kernel panic in locking-selftest
Existing users of ioremap_cache() are mapping memory that is known in
advance to not have i/o side effects. These users are forced to cast
away the __iomem annotation, or otherwise neglect to fix the sparse
errors thrown when dereferencing pointers to this memory. Provide
memremap() as a non __iomem annotated ioremap_*() in the case when
ioremap is otherwise a pointer to cacheable memory. Empirically,
ioremap_<cacheable-type>() call sites are seeking memory-like semantics
(e.g. speculative reads, and prefetching permitted).
memremap() is a break from the ioremap implementation pattern of adding
a new memremap_<type>() for each mapping type and having silent
compatibility fall backs. Instead, the implementation defines flags
that are passed to the central memremap() and if a mapping type is not
supported by an arch memremap returns NULL.
We introduce a memremap prototype as a trivial wrapper of
ioremap_cache() and ioremap_wt(). Later, once all ioremap_cache() and
ioremap_wt() usage has been removed from drivers we teach archs to
implement arch_memremap() with the ability to strictly enforce the
mapping type.
Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Move certificate handling out of the kernel/ directory and into a certs/
directory to get all the weird stuff in one place and move the generated
signing keys into this directory.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
Conflicts:
drivers/net/ethernet/cavium/Kconfig
The cavium conflict was overlapping dependency
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The Intel build-bot detected a sparse warning with with a patch I posted a
couple of days ago that was accepted in the audit/next tree:
Subject: [linux-next:master 6689/6751] kernel/audit_watch.c:543:36: sparse: dereference of noderef expression
Date: Friday, August 07, 2015, 06:57:55 PM
From: kbuild test robot <fengguang.wu@intel.com>
tree: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head: e6455bc5b91f41f842f30465c9193320f0568707
commit: 2e3a8aeb63e5335d4f837d453787c71bcb479796 [6689/6751] Merge remote- tracking branch 'audit/next'
sparse warnings: (new ones prefixed by >>)
>> kernel/audit_watch.c:543:36: sparse: dereference of noderef expression
kernel/audit_watch.c:544:28: sparse: dereference of noderef expression
34d99af5 Richard Guy Briggs 2015-08-05 541 int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark *mark)
34d99af5 Richard Guy Briggs 2015-08-05 542 {
34d99af5 Richard Guy Briggs 2015-08-05 @543 unsigned long ino = tsk->mm- >exe_file->f_inode->i_ino;
34d99af5 Richard Guy Briggs 2015-08-05 544 dev_t dev = tsk->mm->exe_file- >f_inode->i_sb->s_dev;
:::::: The code at line 543 was first introduced by commit
:::::: 34d99af52ad40bd498ba66970579a5bc1fb1a3bc audit: implement audit by executable
tsk->mm->exe_file requires RCU access. The warning was reproduceable by adding
"C=1 CF=-D__CHECK_ENDIAN__" to the build command, and verified eliminated with
this patch.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
The code that places signals in signal queues computes the uids, gids,
and pids at the time the signals are enqueued. Which means that tasks
that share signal queues must be in the same pid and user namespaces.
Sharing signal handlers is fine, but bizarre.
So make the code in fork and userns_install clearer by only testing
for what is functionally necessary.
Also update the comment in unshare about unsharing a user namespace to
be a little more explicit and make a little more sense.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In the logic in the initial commit of unshare made creating a new
thread group for a process, contingent upon creating a new memory
address space for that process. That is wrong. Two separate
processes in different thread groups can share a memory address space
and clone allows creation of such proceses.
This is significant because it was observed that mm_users > 1 does not
mean that a process is multi-threaded, as reading /proc/PID/maps
temporarily increments mm_users, which allows other processes to
(accidentally) interfere with unshare() calls.
Correct the check in check_unshare_flags() to test for
!thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.
By using the correct checks in unshare this removes the possibility of
an accidental denial of service attack.
Additionally using the correct checks in unshare ensures that only an
explicit unshare(CLONE_VM) can possibly trigger the slow path of
current_is_single_threaded(). As an explict unshare(CLONE_VM) is
pointless it is not expected there are many applications that make
that call.
Cc: stable@vger.kernel.org
Fixes: b2e0d98705e60e45bbb3c0032c48824ad7ae0704 userns: Implement unshare of the user namespace
Reported-by: Ricky Zhou <rickyz@chromium.org>
Reported-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
A PKCS#7 or CMS message can have per-signature authenticated attributes
that are digested as a lump and signed by the authorising key for that
signature. If such attributes exist, the content digest isn't itself
signed, but rather it is included in a special authattr which then
contributes to the signature.
Further, we already require the master message content type to be
pkcs7_signedData - but there's also a separate content type for the data
itself within the SignedData object and this must be repeated inside the
authattrs for each signer [RFC2315 9.2, RFC5652 11.1].
We should really validate the authattrs if they exist or forbid them
entirely as appropriate. To this end:
(1) Alter the PKCS#7 parser to reject any message that has more than one
signature where at least one signature has authattrs and at least one
that does not.
(2) Validate authattrs if they are present and strongly restrict them.
Only the following authattrs are permitted and all others are
rejected:
(a) contentType. This is checked to be an OID that matches the
content type in the SignedData object.
(b) messageDigest. This must match the crypto digest of the data.
(c) signingTime. If present, we check that this is a valid, parseable
UTCTime or GeneralTime and that the date it encodes fits within
the validity window of the matching X.509 cert.
(d) S/MIME capabilities. We don't check the contents.
(e) Authenticode SP Opus Info. We don't check the contents.
(f) Authenticode Statement Type. We don't check the contents.
The message is rejected if (a) or (b) are missing. If the message is
an Authenticode type, the message is rejected if (e) is missing; if
not Authenticode, the message is rejected if (d) - (f) are present.
The S/MIME capabilities authattr (d) unfortunately has to be allowed
to support kernels already signed by the pesign program. This only
affects kexec. sign-file suppresses them (CMS_NOSMIMECAP).
The message is also rejected if an authattr is given more than once or
if it contains more than one element in its set of values.
(3) Add a parameter to pkcs7_verify() to select one of the following
restrictions and pass in the appropriate option from the callers:
(*) VERIFYING_MODULE_SIGNATURE
This requires that the SignedData content type be pkcs7-data and
forbids authattrs. sign-file sets CMS_NOATTR. We could be more
flexible and permit authattrs optionally, but only permit minimal
content.
(*) VERIFYING_FIRMWARE_SIGNATURE
This requires that the SignedData content type be pkcs7-data and
requires authattrs. In future, this will require an attribute
holding the target firmware name in addition to the minimal set.
(*) VERIFYING_UNSPECIFIED_SIGNATURE
This requires that the SignedData content type be pkcs7-data but
allows either no authattrs or only permits the minimal set.
(*) VERIFYING_KEXEC_PE_SIGNATURE
This only supports the Authenticode SPC_INDIRECT_DATA content type
and requires at least an SpcSpOpusInfo authattr in addition to the
minimal set. It also permits an SPC_STATEMENT_TYPE authattr (and
an S/MIME capabilities authattr because the pesign program doesn't
remove these).
(*) VERIFYING_KEY_SIGNATURE
(*) VERIFYING_KEY_SELF_SIGNATURE
These are invalid in this context but are included for later use
when limiting the use of X.509 certs.
(4) The pkcs7_test key type is given a module parameter to select between
the above options for testing purposes. For example:
echo 1 >/sys/module/pkcs7_test_key/parameters/usage
keyctl padd pkcs7_test foo @s </tmp/stuff.pkcs7
will attempt to check the signature on stuff.pkcs7 as if it contains a
firmware blob (1 being VERIFYING_FIRMWARE_SIGNATURE).
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
Fix up the dependencies somewhat too, while we're at it.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Pull RCU changes from Paul E. McKenney:
- The combination of tree geometry-initialization simplifications
and OS-jitter-reduction changes to expedited grace periods.
These two are stacked due to the large number of conflicts
that would otherwise result.
[ With one addition, a temporary commit to silence a lockdep false
positive. Additional changes to the expedited grace-period
primitives (queued for 4.4) remove the cause of this false
positive, and therefore include a revert of this temporary commit. ]
- Documentation updates.
- Torture-test updates.
- Miscellaneous fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "dl_boosted" flag is set by comparing *absolute* deadlines
(c.f., rt_mutex_setprio()).
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438782979-9057-2-git-send-email-parri.andrea@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment is "misleading"; fix it by adapting a comment from
push_rt_tasks().
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438782979-9057-1-git-send-email-parri.andrea@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change the calling context of sched_class::set_cpus_allowed() such
that we can assume the task is inactive.
This allows us to easily make changes that affect accounting done by
enqueue/dequeue. This does in fact completely remove
set_cpus_allowed_rt() and greatly reduces set_cpus_allowed_dl().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.667516139@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Give every class a set_cpus_allowed() method, this enables some small
optimization in the RT,DL implementation by avoiding a double
cpumask_weight() call.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
without locks, a caller might observe an old value and race with the
set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
it:
__kthread_bind()
do_set_cpus_allowed()
<SYSCALL>
sched_setaffinity()
if (p->flags & PF_NO_SETAFFINITIY)
set_cpus_allowed_ptr()
p->flags |= PF_NO_SETAFFINITY
Fix the bug by putting everything under the regular scheduler locks.
This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current code ensures that a task has a normalized vruntime when switching away
from the fair class, but it does not ensure the task has a non-normalized
vruntime when switching back to the fair class.
This is an example breaking this consistency:
1. a task is in fair class and !queued
2. changes its class to RT class (still !queued)
3. changes its class to fair class again (still !queued)
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439197375-27927-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Systems which have all nodes at a distance of at most 1 hop should be
identified as 'NUMA_DIRECT'.
However, the scheduler incorrectly identifies it as 'NUMA_BACKPLANE'.
This is because 'n' is assigned to sched_max_numa_distance but the
code (mis)interprets it to mean 'number of hops'.
Rik had actually used sched_domains_numa_levels for detecting a
'NUMA_DIRECT' topology:
http://marc.info/?l=linux-kernel&m=141279712429834&w=2
But that was changed when he removed the hops table in the
subsequent version:
http://marc.info/?l=linux-kernel&m=141353106106771&w=2
Fixing the issue here.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439256048-3748-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The qrwlock implementation is slightly heavy in its use of memory
barriers, mainly through the use of _cmpxchg() and _return() atomics, which
imply full barrier semantics.
This patch modifies the qrwlock code to use the more relaxed atomic
routines so that we can reduce the unnecessary barrier overhead on
weakly-ordered architectures.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hp.com
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438880084-18856-7-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A question [1] was raised about the use of page::private in AUX buffer
allocations, so let's add a clarification about its intended use.
The private field and flag are used by perf's rb_alloc_aux() path to
tell the pmu driver the size of each high-order allocation, so that the
driver can program those appropriately into its hardware. This only
matters for PMUs that don't support hardware scatter tables. Otherwise,
every page in the buffer is just a page.
This patch adds a comment about the private field to the AUX buffer
allocation path.
[1] http://marc.info/?l=linux-kernel&m=143803696607968
Reported-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438063204-665-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>