The scheduler can handle per cpu threads before the cpu is set to active and
it does not allow user space threads on the cpu before active is
set. Attaching to the scheduling domains is also not required before user
space threads can be handled.
Move the activation to the end of the hotplug state space. That also means
that deactivation is the first action when a cpu is shut down.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.597477199@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The alleged requirement that the migration notifier has a lower priority than
perf is completely undocumented and there is no indication at all that this is
true. perf does not even handle the CPU_ONLINE notification and perf really
has nothing to do with migration.
Move the CPU_ONLINE code into the sched_activate_cpu() state callback.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.421743581@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It really does not matter when we fold the load for the outgoing cpu. It's
almost dead anyway, so there is no harm if we fail to fold the few
microseconds which are required for going fully away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.328739226@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We can piggy pack that on the SCHED_STARTING state. It's not required before
the cpu actually comes online. Name the function proper as it has nothing to
do with migration.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.248226511@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The sync_rcu stuff is specificically for clearing bits in the active
mask, such that everybody will observe the bit cleared and will not
consider the cleared CPU for load-balancing etc.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160310120025.169219710@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that we reduced everything into single notifiers, it's simple to move them
into the hotplug state machine space.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is the last operation on the cpu before vanishing. No point in calling
that on CPU_DEAD.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We can maintain the ordering of the scheduler cpu hotplug functionality nicely
in one notifer. Get rid of the maze.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Prevent the SMP scheduler related notifiers to be executed before the smp
scheduler is initialized and install them early.
This is a preparatory change for further consolidation of the hotplug notifier
maze.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Start distangling the maze of hotplug notifiers in the scheduler.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to enable symmetric hotplug, we must mirror the online &&
!active state of cpu-down on the cpu-up side.
However, to retain sanity, limit this state to per-cpu kthreads.
Aside from the change to set_cpus_allowed_ptr(), which allow moving
the per-cpu kthreads on, the other critical piece is the cpu selection
for pinned tasks in select_task_rq(). This avoids dropping into
select_fallback_rq().
select_fallback_rq() cannot be allowed to select !active cpus because
its used to migrate user tasks away. And we do not want to move user
tasks onto cpus that are in transition.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160301152303.GV6356@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit:
26657848502b7847 ("perf/core: Verify we have a single perf_hw_context PMU")
forcefully prevents multiple PMUs from sharing perf_hw_context, as this
generally doesn't make sense. It is a common bug for uncore PMUs to
use perf_hw_context rather than perf_invalid_context, which this detects.
However, systems exist with heterogeneous CPUs (and hence heterogeneous
HW PMUs), for which sharing perf_hw_context is necessary, and possible
in some limited cases.
To make this work we have to perform some gymnastics, as we did in these
commits:
66eb579e66ecfea5 ("perf: allow for PMU-specific event filtering")
c904e32a69b7c779 ("arm: perf: filter unschedulable events")
To allow those systems to work, we must allow PMUs for heterogeneous
CPUs to share perf_hw_context, though we must still disallow sharing
otherwise to detect the common misuse of perf_hw_context.
This patch adds a new PERF_PMU_CAP_HETEROGENEOUS_CPUS for this, updates
the core logic to account for this, and makes use of it in the arm_pmu
code that is used for systems with heterogeneous CPUs. Comments are
added to make the rationale clear and hopefully avoid accidental abuse.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/20160426103346.GA20836@leverpostej
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export an additional common attribute for PMUs that support address range
filtering to let the perf userspace identify such PMUs in a uniform way.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-8-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Many instruction tracing PMUs out there support address range-based
filtering, which would, for example, generate trace data only for a
given range of instruction addresses, which is useful for tracing
individual functions, modules or libraries. Other PMUs may also
utilize this functionality to allow filtering to or filtering out
code at certain address ranges.
This patch introduces the interface for userspace to specify these
filters and for the PMU drivers to apply these filters to hardware
configuration.
The user interface is an ASCII string that is passed via an ioctl()
and specifies (in the form of an ASCII string) address ranges within
certain object files or within kernel. There is no special treatment
for kernel modules yet, but it might be a worthy pursuit.
The PMU driver interface basically adds two extra callbacks to the
PMU driver structure, one of which validates the filter configuration
proposed by the user against what the hardware is actually capable of
doing and the other one translates hardware-independent filter
configuration into something that can be programmed into the
hardware.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-6-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Trace filtering code needs an iterator that can go through all events in
a context, including inactive and filtered, to be able to update their
filters' address ranges based on mmap or exec events.
This patch changes perf_event_aux_ctx() to optionally do this.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-5-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For instruction trace filtering, namely, for communicating filter
definitions from userspace, I'd like to re-use the SET_FILTER code
that the tracepoints are using currently.
To that end, move the relevant code out from behind the
CONFIG_EVENT_TRACING dependency.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1461771888-10409-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All the atomic operations have their arguments the wrong way around;
make atomic_fetch_or() consistent and flip them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Specifically around the debugfs file creation calls,
I have no idea if they could ever possibly fail, but
this is core code (debug aside) so lets at least
check the return value and inform anything fishy.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160420041725.GC3472@linux-uzut.site
Signed-off-by: Ingo Molnar <mingo@kernel.org>
... remove the redundant second iteration, this is most
likely a copy/past buglet.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: waiman.long@hpe.com
Link: http://lkml.kernel.org/r/1460961103-24953-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment in calculate_imbalance() was introduced in commit:
2dd73a4f09be ("[PATCH] sched: implement smpnice")
which described the logic as it was then, but a later commit:
b18855500fc4 ("sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()")
.. complicated this logic some more so that the comment does not match anymore.
Update the comment to match the code.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-3-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling remnants
and dysfunctional knobs") deleted the power aware scheduling support.
This patch gets rid of the remaining power aware scheduling related
comments in the code as well.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461958364-675-2-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we're accessing rq_clock() (e.g. in sched_avg_update()) we should
update the rq clock before calling cpu_load_update(), otherwise any
time calculations will be stale.
All other paths currently call update_rq_clock().
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462304814-11715-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The max_idle_balance_cost and avg_idle values which are tracked and ar used to
capture short idle incidents, are not associated with schedstats, however the
information of these two values isn't printed out on !CONFIG_SCHEDSTATS kernels.
Fix this by moving the value printout out of the CONFIG_SCHEDSTATS section.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462250305-4523-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__compute_runnable_contrib() uses a loop to compute sum, whereas a
table lookup can do it faster in a constant amount of time.
The program to generate the constants is located at:
Documentation/scheduler/sched-avg.txt
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@arm.com
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1462226078-31904-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After cleaning up the sched metrics, there are two definitions that are
ambiguous and confusing: SCHED_LOAD_SHIFT and SCHED_LOAD_SHIFT.
Resolve this:
- Rename SCHED_LOAD_SHIFT to NICE_0_LOAD_SHIFT, which better reflects what
it is.
- Replace SCHED_LOAD_SCALE use with SCHED_CAPACITY_SCALE and remove SCHED_LOAD_SCALE.
Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-3-git-send-email-yuyang.du@intel.com
[ Rewrote the changelog and fixed the build on 32-bit kernels. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.
In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.
The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.
Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1459829551-21625-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike ran into the low load resolution limitation on his big machine.
So reenable these bits; nobody could ever reproduce/analyze the
reported power usage claim and Google has been running with this for
years as well.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The problem with the existing lock pinning is that each pin is of
value 1; this mean you can simply unpin if you know its pinned,
without having any extra information.
This scheme generates a random (16 bit) cookie for each pin and
requires this same cookie to unpin. This means you have to keep the
cookie in context.
No objsize difference for !LOCKDEP kernels.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to be able to pass around more than just the IRQ flags in the
future, add a rq_flags structure.
No difference in code generation for the x86_64-defconfig build I
tested.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Drop accidentally repeated word in comment.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
sigaltstack()'s reported previous state uses a somewhat odd
convention, but the concept of flag bits is new, and we can do the
flag bits sensibly. Specifically, let's just report them directly.
This will allow saving and restoring the sigaltstack state using
sigaltstack() to work correctly.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/94b291ec9fd47741a9264851e316e158ded0b00d.1462296606.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
ftrace subsystem of events that it can cause an oops. This file is only
writable by root, but still is a bug that needs to be fixed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXKSHAAAoJEKKk/i67LK/8wgQH/jEWQOJhxexkjG8v/RW5iabi
mGZQeJ/pYTWulW5OJItKGs38fgtSMkT5G4/UitB92l/8/JV4rFyrQrNp29Vfmzv6
KDnColJ/DYRQoY7+lz7hG0Y2Dft23qaj7l1hBIxtSYJgDC8Omq1dwHCRzzjyJzep
WPuA/f0W5QGRqupgP7zBd8o0hGO16faMFZQgBcSs263HFY9hssL/y4dfrdN/mi3e
VjfytGxjwyY6XLH7JTcuhy+F29h0Bo0BT1C9YuXuBuOq1D9ijjQaObUtuGqvsmn0
HrxLzllj/6w9bcUFxgAHp02yWN5TV8aWgHmvQYmdVWcupp3Vl+4OaO/w9X3MlCQ=
=l3NN
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Chunyu Hu noticed that if one writes into the trigger files within the
ftrace subsystem of events that it can cause an oops. This file is
only writable by root, but still is a bug that needs to be fixed"
* tag 'trace-fixes-v4.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't display trigger file for events that can't be enabled
Filtering of events requires the data to be written to the ring buffer
before it can be decided to filter or not. This is because the parameters of
the filter are based on the result that is written to the ring buffer and
not on the parameters that are passed into the trace functions.
The ftrace ring buffer is optimized for writing into the ring buffer and
committing. The discard procedure used when filtering decides the event
should be discarded is much more heavy weight. Thus, using a temporary
filter when filtering events can speed things up drastically.
Without a temp buffer we have:
# trace-cmd start -p nop
# perf stat -r 10 hackbench 50
0.790706626 seconds time elapsed ( +- 0.71% )
# trace-cmd start -e all
# perf stat -r 10 hackbench 50
1.566904059 seconds time elapsed ( +- 0.27% )
# trace-cmd start -e all -f 'common_preempt_count==20'
# perf stat -r 10 hackbench 50
1.690598511 seconds time elapsed ( +- 0.19% )
# trace-cmd start -e all -f 'common_preempt_count!=20'
# perf stat -r 10 hackbench 50
1.707486364 seconds time elapsed ( +- 0.30% )
The first run above is without any tracing, just to get a based figure.
hackbench takes ~0.79 seconds to run on the system.
The second run enables tracing all events where nothing is filtered. This
increases the time by 100% and hackbench takes 1.57 seconds to run.
The third run filters all events where the preempt count will equal "20"
(this should never happen) thus all events are discarded. This takes 1.69
seconds to run. This is 10% slower than just committing the events!
The last run enables all events and filters where the filter will commit all
events, and this takes 1.70 seconds to run. The filtering overhead is
approximately 10%. Thus, the discard and commit of an event from the ring
buffer may be about the same time.
With this patch, the numbers change:
# trace-cmd start -p nop
# perf stat -r 10 hackbench 50
0.778233033 seconds time elapsed ( +- 0.38% )
# trace-cmd start -e all
# perf stat -r 10 hackbench 50
1.582102692 seconds time elapsed ( +- 0.28% )
# trace-cmd start -e all -f 'common_preempt_count==20'
# perf stat -r 10 hackbench 50
1.309230710 seconds time elapsed ( +- 0.22% )
# trace-cmd start -e all -f 'common_preempt_count!=20'
# perf stat -r 10 hackbench 50
1.786001924 seconds time elapsed ( +- 0.20% )
The first run is again the base with no tracing.
The second run is all tracing with no filtering. It is a little slower, but
that may be well within the noise.
The third run shows that discarding all events only took 1.3 seconds. This
is a speed up of 23%! The discard is much faster than even the commit.
The one downside is shown in the last run. Events that are not discarded by
the filter will take longer to add, this is due to the extra copy of the
event.
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently register functions for events will be called
through the 'reg' field of event class directly without
any check when seting up triggers.
Triggers for events that don't support register through
debug fs (events under events/ftrace are for trace-cmd to
read event format, and most of them don't have a register
function except events/ftrace/functionx) can't be enabled
at all, and an oops will be hit when setting up trigger
for those events, so just not creating them is an easy way
to avoid the oops.
Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com
Cc: stable@vger.kernel.org # 3.14+
Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch implements the SS_AUTODISARM flag that can be OR-ed with
SS_ONSTACK when forming ss_flags.
When this flag is set, sigaltstack will be disabled when entering
the signal handler; more precisely, after saving sas to uc_stack.
When leaving the signal handler, the sigaltstack is restored by
uc_stack.
When this flag is used, it is safe to switch from sighandler with
swapcontext(). Without this flag, the subsequent signal will corrupt
the state of the switched-away sighandler.
To detect the support of this functionality, one can do:
err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
if (err && errno == EINVAL)
unsupported();
Signed-off-by: Stas Sergeev <stsp@list.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jason Low <jason.low2@hp.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds SS_FLAG_BITS - the mask that splits sigaltstack
mode values and bit-flags. Since there is no bit-flags yet, the
mask is defined to 0. The flags are added by subsequent patches.
With every new flag, the mask should have the appropriate bit cleared.
This makes sure if some flag is tried on a kernel that doesn't
support it, the -EINVAL error will be returned, because such a
flag will be treated as an invalid mode rather than the bit-flag.
That way the existence of the particular features can be probed
at run-time.
This change was suggested by Andy Lutomirski:
https://lkml.org/lkml/2016/3/6/158
Signed-off-by: Stas Sergeev <stsp@list.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1460665206-13646-3-git-send-email-stsp@list.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips,
from Sara Sharon.
2) Fix SKB size checks in batman-adv stack on receive, from Sven
Eckelmann.
3) Leak fix on mac80211 interface add error paths, from Johannes Berg.
4) Cannot invoke napi_disable() with BH disabled in myri10ge driver,
fix from Stanislaw Gruszka.
5) Fix sign extension problem when computing feature masks in
net_gso_ok(), from Marcelo Ricardo Leitner.
6) lan78xx driver doesn't count packets and packet lengths in its
statistics properly, fix from Woojung Huh.
7) Fix the buffer allocation sizes in pegasus USB driver, from Petko
Manolov.
8) Fix refcount overflows in bpf, from Alexei Starovoitov.
9) Unified dst cache handling introduced a preempt warning in
ip_tunnel, fix by resetting rather then setting the cached route.
From Paolo Abeni.
10) Listener hash collision test fix in soreuseport, from Craig Gallak
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
gre: do not pull header in ICMP error processing
net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case
tipc: only process unicast on intended node
cxgb3: fix out of bounds read
net/smscx5xx: use the device tree for mac address
soreuseport: Fix TCP listener hash collision
net: l2tp: fix reversed udp6 checksum flags
ip_tunnel: fix preempt warning in ip tunnel creation/updating
samples/bpf: fix trace_output example
bpf: fix check_map_func_compatibility logic
bpf: fix refcnt overflow
drivers: net: cpsw: use of_phy_connect() in fixed-link case
dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive
drivers: net: cpsw: don't ignore phy-mode if phy-handle is used
drivers: net: cpsw: fix segfault in case of bad phy-handle
drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config
MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver
gre: reject GUE and FOU in collect metadata mode
pegasus: fixes reported packet length
pegasus: fixes URB buffer allocation size;
...
In order to prepare the genirq layer for the concept of partitionned
percpu interrupts, let's allow an affinity to be associated with
such an interrupt. We introduce:
- irq_set_percpu_devid_partition: flag an interrupt as a percpu-devid
interrupt, and associate it with an affinity
- irq_get_percpu_devid_partition: allow the affinity of that interrupt
to be retrieved.
This will allow a driver to discover which CPUs the per-cpu interrupt
can actually fire on.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: devicetree@vger.kernel.org
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Link: http://lkml.kernel.org/r/1460365075-7316-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When iterating over the irq domain list, we try to match a domain
either by calling a match() function or by comparing a number
of fields passed as parameters.
Both approaches are a bit restrictive:
- match() is DT specific and only takes a device node
- the fallback case only deals with the fwnode_handle
It would be useful if we had a per-domain function that would
actually perform the matching check on the whole of the
irq_fwspec structure. This would allow for a domain to triage
matching attempts that need to extend beyond the fwnode.
Let's introduce irq_find_matching_fwspec(), which takes a full
blown irq_fwspec structure, and call into a select() function
implemented by the irqdomain. irq_find_matching_fwnode() is
made a wrapper around irq_find_matching_fwspec in order to
preserve compatibility.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: devicetree@vger.kernel.org
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Link: http://lkml.kernel.org/r/1460365075-7316-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make these functions return appropriate error codes when something goes
wrong.
Previously irq_destroy_ipi returned void making it impossible to notify
the caller if the request could not be fulfilled. Patch 1 in the series
added another condition in which this could fail in addition to the
existing ones. irq_reserve_ipi returned an unsigned int meaning it could
only return 0 on failure and give the caller no indication as to why the
request failed.
As time goes on there are likely to be further conditions added in which
these functions can fail. These APIs and the IPI IRQ domain are new in
4.6 and the number of existing call sites are low, changing the API now
has little impact on the code, while making it easier for these
functions to grow over time.
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: jason@lakedaemon.net
Cc: marc.zyngier@arm.com
Cc: ralf@linux-mips.org
Cc: Qais Yousef <qsyousef@gmail.com>
Cc: lisa.parratt@imgtec.com
Cc: jiang.liu@linux.intel.com
Link: http://lkml.kernel.org/r/1461568464-31701-2-git-send-email-matt.redfearn@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Previously irq_destroy_ipi() would destroy IPIs to all CPUs that were
configured by irq_reserve_ipi(). This change makes it possible to
destroy just a subset of the IPIs. This may be useful to remove IPIs to
CPUs that have been hot removed so that the IRQ numbers allocated within
the IPI domain can be re-used.
The original behaviour is restored by passing the complete mask that the
IPI was created with.
There are currently no users of this function that would break from the
API change.
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: jason@lakedaemon.net
Cc: marc.zyngier@arm.com
Cc: ralf@linux-mips.org
Cc: Qais Yousef <qsyousef@gmail.com>
Cc: lisa.parratt@imgtec.com
Cc: jiang.liu@linux.intel.com
Link: http://lkml.kernel.org/r/1461568464-31701-1-git-send-email-matt.redfearn@imgtec.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>