21763 Commits

Author SHA1 Message Date
Tom Zanussi
4e4a4d7570 tracing: Update cond flag when enabling or disabling a trigger
When a trigger is enabled, the cond flag should be set beforehand,
otherwise a trigger that's expecting to process a trace record
(e.g. one with post_trigger set) could be invoked without one.

Likewise a trigger's cond flag should be reset after it's disabled,
not before.

Link: http://lkml.kernel.org/r/a420b52a67b1c2d3cab017914362d153255acb99.1448303214.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 15:24:14 -05:00
Steven Rostedt (Red Hat)
4239c38fe0 ring-buffer: Process commits whenever moving to a new page.
When crossing over to a new page, commit the current work. This will allow
readers to get data with less latency, and also simplifies the work to get
timestamps working for interrupted events.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 15:24:05 -05:00
Arnd Bergmann
d2b4365809 cpuset: Replace all instances of time_t with time64_t
The following patch replaces all instances of time_t with time64_t i.e.
change the type used for representing time from 32-bit to 64-bit. All
32-bit kernels to date use a signed 32-bit time_t type, which can only
represent time until January 2038. Since embedded systems running 32-bit
Linux are going to survive beyond that date, we have to change all
current uses, in a backwards compatible way.

The patch also changes the function get_seconds() that returns a 32-bit
integer to ktime_get_seconds() that returns seconds as 64-bit integer.

The patch changes the type of ticks from time_t to u32. We keep ticks as
32-bits as the function uses 32-bit arithmetic which would prove less
expensive than 64-bit arithmetic and the function is expected to be
called atleast once every 32 seconds.

Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-25 14:00:05 -05:00
Daniel Borkmann
c9da161c65 bpf: fix clearing on persistent program array maps
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-25 12:14:09 -05:00
Eric Dumazet
81b1a832d7 pidns: fix NULL dereference in __task_pid_nr_ns()
I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :

pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :

    if (pid && ns->level <= pid->level) { // CRASH

Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(

get_task_pid() can benefit from same fix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-24 12:03:55 -08:00
Steven Rostedt (Red Hat)
70004986ff ring-buffer: Remove redundant update of page timestamp
The first commit of a buffer page updates the timestamp of that page. No
need to have the update to the next page add the timestamp too. It will only
be replaced by the first commit on that page anyway.

Only update to a page if it contains an event.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:29:16 -05:00
Steven Rostedt (Red Hat)
8573636ea7 ring-buffer: Use READ_ONCE() for most tail_page access
As cpu_buffer->tail_page may be modified by interrupts at almost any time,
the flow of logic is very important. Do not let gcc get smart with
re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
accesses.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:29:15 -05:00
Steven Rostedt (Red Hat)
bd1b7cd360 ring-buffer: Put back the length if crossed page with add_timestamp
Commit fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing
data" added a descriptor that holds various data instead of passing around
several variables through parameters. The problem was that one of the
parameters was modified in a function and the code was designed not to have
an effect on that modified  parameter. Now that the parameter is a
descriptor and any modifications to it are non-volatile, the size of the
data could be unnecessarily expanded.

Remove the extra space added if a timestamp was added and the event went
across the page.

Cc: stable@vger.kernel.org # 4.3+
Fixes: fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:27:25 -05:00
Steven Rostedt (Red Hat)
b81f472a20 ring-buffer: Update read stamp with first real commit on page
Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.

rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.

Cc: stable@vger.kernel.org # 2.6.30+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:23:17 -05:00
Andy Lutomirski
ed11a7f1b3 context_tracking: Switch to new static_branch API
This is much less error-prone than the old code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/812df7e64f120c5c7c08481f36a8caa9f53b2199.1447361906.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-24 09:56:43 +01:00
Paul E. McKenney
6cf1008122 rcu: Add transitivity to remaining rcu_node ->lock acquisitions
The rule is that all acquisitions of the rcu_node structure's ->lock
must provide transitivity:  The lock is not acquired that frequently,
and sorting out exactly which required it and which did not would be
a maintenance nightmare.  This commit therefore supplies the needed
transitivity to the remaining ->lock acquisitions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23 10:37:35 -08:00
Peter Zijlstra
2a67e741bb rcu: Create transitive rnp->lock acquisition functions
Providing RCU's memory-ordering guarantees requires that the rcu_node
tree's locking provide transitive memory ordering, which the Linux kernel's
spinlocks currently do not provide unless smp_mb__after_unlock_lock()
is used.  Having a separate smp_mb__after_unlock_lock() after each and
every lock acquisition is error-prone, hard to read, and a bit annoying,
so this commit provides wrapper functions that pull in the
smp_mb__after_unlock_lock() invocations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23 10:37:35 -08:00
Waiman Long
d78045306c locking/pvqspinlock, x86: Optimize the PV unlock code path
The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:

 1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..

 2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

  Threads	Before patch	After patch	% Change
  -------	------------	-----------	--------
     1		   134.1 ms	  119.3 ms	  -11%
     2		   1286  ms	   953  ms	  -26%
     3		   3715  ms	  3480  ms	  -6.3%
     4		   4092  ms	  3764  ms	  -8.0%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:02:02 +01:00
Waiman Long
aa68744f80 locking/qspinlock: Avoid redundant read of next pointer
With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:02:00 +01:00
Waiman Long
81b5598665 locking/qspinlock: Prefetch the next node cacheline
A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.

This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.

The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:01:59 +01:00
Waiman Long
64d816cba0 locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()
This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with the more relaxed _acquire or _release versions of
those calls to enable other architectures to adopt queued spinlocks
with less memory barrier performance overhead.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:01:58 +01:00
Byungchul Park
525628c73b sched/fair: Modify the comment about lock assumptions in migrate_task_rq_fair()
The comment describing migrate_task_rq_fair() says that the caller
should hold p->pi_lock. But in some cases the caller can hold
task_rq(p)->lock instead of p->pi_lock. So the comment is broken and
this patch fixes it.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447806899-20303-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:20 +01:00
Oleg Nesterov
accaf6ea3d stop_machine: Clean up the usage of the preemption counter in cpu_stopper_thread()
1. Change this code to use preempt_count_inc/preempt_count_dec; this way
   it works even if CONFIG_PREEMPT_COUNT=n, and we avoid the unnecessary
   __preempt_schedule() check (stop_sched_class is not preemptible).

   And this makes clear that we only want to make preempt_count() != 0
   for __might_sleep() / schedule_debug().

2. Change WARN_ONCE() to use %pf to print the function name and remove
   kallsyms_lookup/ksym_buf.

3. Move "int ret" into the "if (work)" block, this looks more consistent.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193332.GA8281@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:20 +01:00
Oleg Nesterov
dd2e3121e3 stop_machine: Shift the 'done != NULL' check from cpu_stop_signal_done() to callers
Change cpu_stop_queue_work() and cpu_stopper_thread() to check done != NULL
before cpu_stop_signal_done(done). This makes the code more clean imo, note
that cpu_stopper_thread() has to do this check anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193329.GA8274@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:19 +01:00
Oleg Nesterov
6fa3b826bc stop_machine: Kill cpu_stop_done->executed
Now that cpu_stop_done->executed becomes write-only (ignoring WARN_ON()
checks) we can remove it.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193326.GA8269@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:19 +01:00
Oleg Nesterov
4aff1ca697 stop_machine: Change __stop_cpus() to rely on cpu_stop_queue_work()
Change queue_stop_cpus_work() to return true if it queues at least one
work, this means that the caller should wait.

__stop_cpus() can check the value returned by queue_stop_cpus_work() and
avoid done.executed, just like stop_one_cpu() does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193323.GA8262@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:19 +01:00
Oleg Nesterov
958c5f848e stop_machine: Change stop_one_cpu() to rely on cpu_stop_queue_work()
Change stop_one_cpu() to return -ENOENT if cpu_stop_queue_work() fails.
Otherwise we know that ->executed must be true after wait_for_completion()
so we can just return done.ret.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193320.GA8259@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:18 +01:00
Oleg Nesterov
1b034bd989 stop_machine: Make cpu_stop_queue_work() and stop_one_cpu_nowait() return bool
Change cpu_stop_queue_work() to return true if the work was queued and
change stop_one_cpu_nowait() to return the result of cpu_stop_queue_work().
This makes it more useful, for example now you can alloc cpu_stop_work for
stop_one_cpu_nowait() and free it in the callback or if stop_one_cpu_nowait()
fails, currently this is impossible because you can't know if @fn will be
called or not.

Also, this allows to kill cpu_stop_done->executed, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151117170523.GA13955@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:18 +01:00
Oleg Nesterov
6a19005157 stop_machine: Don't disable preemption in stop_two_cpus()
Now that stop_two_cpus() path does not check cpu_active() we can remove
preempt_disable(), it was only needed to ensure that stop_machine() can
not be called after we observe cpu_active() == T and before we queue the
new work.

Also, turn the pointless and confusing ->executed check into WARN_ON().
We know that both works must be executed, otherwise we have a bug. And
in fact I think that done->executed should die, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193314.GA8249@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:18 +01:00
Oleg Nesterov
64038f292a stop_machine: Fix possible cpu_stopper_thread() crash
stop_one_cpu_nowait(fn) will crash the kernel if the callback returns
nonzero, work->done == NULL in this case.

This needs more cleanups, cpu_stop_signal_done() is called right after
we check done != NULL and it does the same check.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193311.GA8242@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Geliang Tang
01783e0d45 sched/core: Use list_is_singular() in sched_can_stop_tick()
Use list_is_singular() to check if run_list has only one entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a5453fafd735affcf28e53a1d0a3d6965cb5dbb5.1447582547.git.geliangtang@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Joonwoo Park
3ea94de15c sched/core: Fix incorrect wait time and wait count statistics
At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc/<pid>/sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: ohaugan@codeaurora.org
Link: http://lkml.kernel.org/r/20151113033854.GA4247@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Peter Zijlstra
90eec103b9 treewide: Remove old email address
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:44:58 +01:00
Rik van Riel
51170840fe sched/numa: Cap PTE scanning overhead to 3% of run time
There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work() runs, the task spends at least 32x as much time on
other things than it did on task_numa_work().

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work(), we delay a future run of
task_numa_work() until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work().
The longer task_numa_work() takes, the more likely it is this happens.

If task_numa_work() takes very little time, chances are low that that
code will do anything, but we will not care.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:54 +01:00
Byungchul Park
525705d15e sched/fair: Consider missed ticks in NOHZ_FULL in update_cpu_load_nohz()
Usually the tick can be stopped for an idle CPU in NOHZ. However in NOHZ_FULL
mode, a non-idle CPU's tick can also be stopped. However, update_cpu_load_nohz()
does not consider the case a non-idle CPU's tick has been stopped at all.

This patch makes the update_cpu_load_nohz() know if the calling path comes
from NOHZ_FULL or idle NOHZ.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447115762-19734-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Byungchul Park
5954327548 sched/fair: Prepare __update_cpu_load() to handle active tickless
There are some cases where distance between ticks is more than one tick
while the CPU is not idle, e.g. full NOHZ.

However __update_cpu_load() assumes it is the idle tickless case if the
distance between ticks is more than 1, even though it can be the active
tickless case as well. Thus in the active tickless case, updating the CPU
load will not be performed correctly.

Where the current code assumes the load for each tick is zero, this is
(obviously) not true in non-idle tickless case. We can approximately
consider the load ~= this_rq->cpu_load[0] during tickless in non-idle
tickless case.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444816056-11886-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Peter Zijlstra
d937cdc59e sched/fair: Clean up the explanation around decaying load update misses
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:52 +01:00
Dietmar Eggemann
38c6ade2dd sched/fair: Remove empty idle enter and exit functions
Commit cd126afe838d ("sched/fair: Remove rq's runnable avg") got rid of
rq->avg and so there is no need to update it any more when entering or
exiting idle.

Remove the now empty functions idle_{enter|exit}_fair().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1445342681-17171-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:51 +01:00
Arnd Bergmann
89b411081d sched/rt: Hide the push_irq_work_func() declaration
The push_irq_work_func() function is conditionally defined only
when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the
forward declaration remains visibile without HAVE_RT_PUSH_IPI,
causing a gcc warning in ARM64 allnoconfig:

  kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function]

This changes the code to use the same condition for both the
declaration and the function definition, which gets rid of the
warning.

As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI
thing after:

  8053871d0f7f ("smp: Fix smp_call_function_single_async() locking")

Until that is done, this patch can be used to avoid the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:25:08 +01:00
Stephane Eranian
614e4c4ebc perf/core: Robustify the perf_cgroup_from_task() RCU checks
This patch reinforces the lockdep checks performed by
perf_cgroup_from_tsk() by passing the perf_event_context
whenever possible. It is okay to not hold the RCU read lock
when we know we hold the ctx->lock. This patch makes sure this
property holds.

In some functions, such as perf_cgroup_sched_in(), we do not
pass the context because we are sure we are holding the RCU
read lock.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: edumazet@google.com
Link: http://lkml.kernel.org/r/1447322404-10920-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:21:03 +01:00
Stephane Eranian
ddaaf4e291 perf/core: Fix RCU problem with cgroup context switching code
The RCU checker detected RCU violation in the cgroup switching routines
perf_cgroup_sched_in() and perf_cgroup_sched_out(). We were dereferencing
cgroup from task without holding the RCU lock.

Fix this by holding the RCU read lock. We move the locking from
perf_cgroup_switch() to avoid double locking.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: edumazet@google.com
Link: http://lkml.kernel.org/r/1447322404-10920-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:21:03 +01:00
Daniel Vetter
92907cbbef Linux 4.4-rc2
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWUmHZAAoJEHm+PkMAQRiGHtcH/RVRsn8re0WdRWYaTr9+Hknm
 CGlRJN4LKecttgYQ/2bS1QsDbt8usDPBiiYVopqGXQxPBmjyDAqPjsa+8VzCaVc6
 WA+9LDB+PcW28lD6BO+qSZCOAm7hHSZq7dtw9x658IqO+mI2mVeCybsAyunw2iWi
 Kf5q90wq6tIBXuT8YH9MXGrSCQw00NclbYeYwB9CmCt9hT/koEFBdl7uFUFitB+Q
 GSPTz5fXhgc5Lms85n7flZlrVKoQKmtDQe4/DvKZm+SjsATHU9ru89OxDBdS5gSG
 YcEIM4zc9tMjhs3GC9t6WXf6iFOdctum8HOhUoIN/+LVfeOMRRwAhRVqtGJ//Xw=
 =DCUg
 -----END PGP SIGNATURE-----

Merge tag 'v4.4-rc2' into drm-intel-next-queued

Linux 4.4-rc2

Backmerge to get at

commit 1b0e3a049efe471c399674fd954500ce97438d30
Author: Imre Deak <imre.deak@intel.com>
Date:   Thu Nov 5 23:04:11 2015 +0200

    drm/i915/skl: disable display side power well support for now

so that we can proplery re-eanble skl power wells in -next.

Conflicts are just adjacent lines changed, except for intel_fbdev.c
where we need to interleave the changs. Nothing nefarious.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
2015-11-23 09:04:05 +01:00
Vitaly Kuznetsov
7625b3a000 kernel/panic.c: turn off locks debug before releasing console lock
Commit 08d78658f393 ("panic: release stale console lock to always get the
logbuf printed out") introduced an unwanted bad unlock balance report when
panic() is called directly and not from OOPS (e.g.  from out_of_memory()).
The difference is that in case of OOPS we disable locks debug in
oops_enter() and on direct panic call nobody does that.

Fixes: 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20 16:17:32 -08:00
Richard Weinberger
9d8a765211 kernel/signal.c: unexport sigsuspend()
sigsuspend() is nowhere used except in signal.c itself, so we can mark it
static do not pollute the global namespace.

But this patch is more than a boring cleanup patch, it fixes a real issue
on UserModeLinux.  UML has a special console driver to display ttys using
xterm, or other terminal emulators, on the host side.  Vegard reported
that sometimes UML is unable to spawn a xterm and he's facing the
following warning:

  WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()

It turned out that this warning makes absolutely no sense as the UML
xterm code calls sigsuspend() on the host side, at least it tries.  But
as the kernel itself offers a sigsuspend() symbol the linker choose this
one instead of the glibc wrapper.  Interestingly this code used to work
since ever but always blocked signals on the wrong side.  Some recent
kernel change made the WARN_ON() trigger and uncovered the bug.

It is a wonderful example of how much works by chance on computers. :-)

Fixes: 68f3f16d9ad0f1 ("new helper: sigsuspend()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>	[3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20 16:17:32 -08:00
Tejun Heo
16af439645 cgroup: implement cgroup_get_from_path() and expose cgroup_put()
Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path.  This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.

v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).

Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-20 15:55:52 -05:00
Tejun Heo
b11cfb5807 cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it
cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question.  While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.

This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level.  cgroup_is_descendant() is accordingly reimplmented
and made inline.

Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-20 15:55:52 -05:00
Daniel Borkmann
f99bf205da bpf: add show_fdinfo handler for maps
Add a handler for show_fdinfo() to be used by the anon-inodes
backend for eBPF maps, and dump the map specification there. Not
only useful for admins, but also it provides a minimal way to
compare specs from ELF vs pinned object.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 11:04:15 -05:00
Linus Torvalds
a3d66b5a17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
 "A fix for module handling in case kASLR has been enabled, from Zhou
  Chengming"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: x86: fix relocation computation with kASLR
2015-11-19 12:16:12 -08:00
Lukas Wunner
581da2cab5 async: export current_is_async()
Introduced by 84b233adcca3 ("workqueue: implement current_is_async()").

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2015-11-19 17:51:48 +01:00
Sudeep Holla
a946e8c717 genirq: Delay incrementing interrupt count if it's disabled/pending
In case of a wakeup interrupt, irq_pm_check_wakeup disables the interrupt
and marks it pending and suspended, disables it and notifies the pm core
about the wake event. The interrupt gets handled later once the system
is resumed.

However the irq stats is updated twice: once when it's disabled waiting
for the system to resume and later when it's handled, resulting in wrong
counting of the wakeup interrupt when waking up the system.

This patch updates the interrupt count so that it's updated only when
the interrupt gets handled. It's already handled correctly in
handle_edge_irq and handle_edge_eoi_irq.

Reported-by: Manoil Claudiu <claudiu.manoil@freescale.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1446661957-1019-1-git-send-email-sudeep.holla@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-16 17:55:55 +01:00
Tejun Heo
67e9c74b8a cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
With major controllers - cpu, memory and io - shaping up for the
unified hierarchy, cgroup2 is about ready to be, gradually, released
into the wild.  Replace __DEVEL__sane_behavior flag which was used to
select the unified hierarchy with a separate filesystem type "cgroup2"
so that unified hierarchy can be mounted as follows.

  mount -t cgroup2 none $MOUNT_POINT

The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").

v2: Assign a different magic number to cgroup2 fs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-11-16 11:13:34 -05:00
Tejun Heo
34c06254ff cgroup: fix cftype->file_offset handling
6f60eade2433 ("cgroup: generalize obtaining the handles of and
notifying cgroup files") introduced cftype->file_offset so that the
handles for per-css file instances can be recorded.  These handles
then can be used, for example, to generate file modified
notifications.

Unfortunately, it made the wrong assumption that files are created
once for a given css and removed on its destruction.  Due to the
dependencies among subsystems, a css may be hidden from userland and
then later shown again.  This is implemented by removing and
re-creating the affected files, so the associated kernfs_node for a
given cgroup file may change over time.  This incorrect assumption led
to the corruption of css->files lists.

Reimplement cftype->file_offset handling so that cgroup_file->kn is
protected by a lock and updated as files are created and destroyed.
This also makes keeping them on per-cgroup list unnecessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: James Sedgwick <jsedgwick@fb.com>
Fixes: 6f60eade2433 ("cgroup: generalize obtaining the handles of and notifying cgroup files")
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2015-11-16 10:58:26 -05:00
Linus Torvalds
0ca9b67606 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Mostly updates to the perf tool plus two fixes to the kernel core code:

   - Handle tracepoint filters correctly for inherited events (Peter
     Zijlstra)

   - Prevent a deadlock in perf_lock_task_context (Paul McKenney)

   - Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
     Melo)

   - Print full source file paths when using 'perf annotate --print-line
     --full-paths' (Michael Petlan)

   - Fix 'perf probe -d' when just one out of uprobes and kprobes is
     enabled (Wang Nan)

   - Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
     tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)

   - Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
     the 'perf test' LLVM entries, when running it in-tree, to
     .gitignore (Yunlong Song)

   - libbpf error reporting improvements, using a strerror interface to
     more precisely tell the user about problems with the provided
     scriptlet, be it in C or as a ready made object file (Wang Nan)

   - Do not be case sensitive when searching for matching 'perf test'
     entries (Arnaldo Carvalho de Melo)

   - Inform the user about objdump failures in 'perf annotate' (Andi
     Kleen)

   - Improve the LLVM 'perf test' entry, introduce a new ones for BPF
     and kbuild tests to check the environment used by clang to compile
     .c scriptlets (Wang Nan)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
  tools include: Add compiler.h to list.h
  perf probe: Verify parameters in two functions
  perf session: Add missing newlines to some pr_err() calls
  perf annotate: Support full source file paths for srcline fix
  perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
  perf: Fix inherited events vs. tracepoint filters
  perf: Disable IRQs across RCU RS CS that acquires scheduler lock
  perf test: Do not be case sensitive when searching for matching tests
  perf test: Add 'perf test BPF'
  perf test: Enhance the LLVM tests: add kbuild test
  perf test: Enhance the LLVM test: update basic BPF test program
  perf bpf: Improve BPF related error messages
  perf tools: Make fetch_kernel_version() publicly available
  bpf tools: Add new API bpf_object__get_kversion()
  bpf tools: Improve libbpf error reporting
  perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
  perf annotate: Inform the user about objdump failures in --stdio
  perf stat: Make stat options global
  perf sched latency: Fix thread pid reuse issue
  ...
2015-11-15 09:36:24 -08:00
Linus Torvalds
051b29f279 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single fix to prevent math underflow in the numa balancing code"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix math underflow in task_tick_numa()
2015-11-15 09:35:33 -08:00
Linus Torvalds
511601bdbc Merge branches 'irq-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq and timer fixes from Thomas Gleixner:

 - An irq regression fix to restore the wakeup behaviour of chained
   interrupts.

 - A timer fix for a long standing race versus timers scheduled on a
   target cpu which got exposed by recent changes in the workqueue
   implementation.

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/PM: Restore system wake up from chained interrupts

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Use proper base migration in add_timer_on()
2015-11-15 09:30:48 -08:00