24693 Commits

Author SHA1 Message Date
Linus Torvalds
3cb6653552 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Nothing exciting from the irq side for this merge window:

   - a new driver for a Mediatek SoC

   - ACPI support for ARM GICV3

   - support for shared nested interrupts

   - the usual pile of fixes and updates all over te place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  irqchip/mbigen: Fix return value check in mbigen_device_probe()
  irqchip/mips-gic: Replace static map with dynamic
  irqchip/mips-gic: Remove device IRQ domain
  irqchip/mips-gic: Separate IPI reservation & usage tracking
  genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs
  genirq: Use cpumask_available() for check of cpumask variable
  cpumask: Add helper cpumask_available()
  irqchip/irq-imx-gpcv2: Clear OF_POPULATED flag
  irqchip/atmel-aic5: Handle suspend to RAM
  irqchip: Add Mediatek mtk-cirq driver
  dt-bindings: mtk-cirq: Add binding document
  irqchip/gic-v3-its: Add IORT hook for platform MSI support
  irqchip/mbigen: Add ACPI support
  irqchip/mbigen: Introduce mbigen_of_create_domain()
  irqchip/mbigen: Drop module owner
  platform-msi: Make platform_msi_create_device_domain() ACPI aware
  irqchip/gicv3-its: platform-msi: Scan MADT to create platform msi domain
  irqchip/gicv3-its: platform-msi: Refactor its_pmsi_init() to prepare for ACPI
  irqchip/gicv3-its: platform-msi: Refactor its_pmsi_prepare()
  irqchip/gic-v3-its: Keep the include header files in alphabetic order
  ...
2017-05-01 15:46:13 -07:00
Linus Torvalds
5db6db0d40 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess unification updates from Al Viro:
 "This is the uaccess unification pile. It's _not_ the end of uaccess
  work, but the next batch of that will go into the next cycle. This one
  mostly takes copy_from_user() and friends out of arch/* and gets the
  zero-padding behaviour in sync for all architectures.

  Dealing with the nocache/writethrough mess is for the next cycle;
  fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
  sold on access_ok() in there, BTW; just not in this pile), same for
  reducing __copy_... callsites, strn*... stuff, etc. - there will be a
  pile about as large as this one in the next merge window.

  This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
  HAVE_ARCH_HARDENED_USERCOPY is unconditional now
  CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
  m32r: switch to RAW_COPY_USER
  hexagon: switch to RAW_COPY_USER
  microblaze: switch to RAW_COPY_USER
  get rid of padding, switch to RAW_COPY_USER
  ia64: get rid of copy_in_user()
  ia64: sanitize __access_ok()
  ia64: get rid of 'segment' argument of __do_{get,put}_user()
  ia64: get rid of 'segment' argument of __{get,put}_user_check()
  ia64: add extable.h
  powerpc: get rid of zeroing, switch to RAW_COPY_USER
  esas2r: don't open-code memdup_user()
  alpha: fix stack smashing in old_adjtimex(2)
  don't open-code kernel_setsockopt()
  mips: switch to RAW_COPY_USER
  mips: get rid of tail-zeroing in primitives
  mips: make copy_from_user() zero tail explicitly
  mips: clean and reorder the forest of macros...
  mips: consolidate __invoke_... wrappers
  ...
2017-05-01 14:41:04 -07:00
Linus Torvalds
0e285e9088 Power management updates for v4.12-rc1
- Rework the intel_pstate driver's sysfs interface to make it
    more straightforward and more intuitive (Rafael Wysocki).
 
  - Make intel_pstate support all processors which advertise HWP
    (hardware-managed P-states) to the kernel in all operation modes
    and make it use the load-based P-state selection algorithm on a
    wider range of systems in the active mode (Rafael Wysocki).
 
  - Add cpufreq driver for Tegra186 (Mikko Perttunen).
 
  - Add support for Gemini Lake SoCs to intel_pstate (David Box).
 
  - Add support for MT8176 and MT817x to the Mediatek cpufreq driver
    and clean up that driver a bit (Daniel Kurtz).
 
  - Clean up intel_pstate and optimize it slightly (Rafael Wysocki).
 
  - Update the schedutil cpufreq governor, mostly to fix a couple of
    issues with it related to specific workloads, and rework its sysfs
    tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).
 
  - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
    (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
    YuanTian Tang).
 
  - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
    Uytterhoeven).
 
  - Add support for "always on" power domains to the genpd (generic
    power domains) framework and clean up that code somewhat (Ulf
    Hansson, Lina Iyer, Viresh Kumar).
 
  - Fix minor issues in the powernv cpuidle driver and clean it up
    (Anton Blanchard, Gautham Shenoy).
 
  - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and
    add an analogous boot-profiling utility called AnalyzeBoot to it
    (Todd Brandt).
 
  - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
    Scaling) driver (David Wu).
 
  - Fix minor issues in the cpuidle core, the intel_pstate_tracer
    utility, the devfreq framework and the PM core documentation
    (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZB5gGAAoJEILEb/54YlRx+78QAJRu+xAA9KtW2+loNyV8iBOB
 EFmQLrvz9jCDyYWsHE5huA1k6EVu5QE74HBfgDn4od9s1VqU1zWdEjqKYiaMwlCt
 EHxYCZ4YKeF31O3P3CtearBz9IXrckRx/XZ3F1jRsGGWooWv7o3U6PPN9iREmCzi
 9dB2j0UD4lCwrnpsDMrJ0GqLu4agn9pcIKDtu4VfszVwYtza0vOQvvlgg1fQS1jX
 BnNfaxN0lmpSlxDjtWfM//hfLzEWK8NlHiKWJFPnWFxJIAX1QL2QKnznF/Tqi5N5
 el9tQXCTRujlD7BLyl6FdsaowbiUUEcMqeh2k01Vz20WSmNDHIpfrV9MtzJ7biUK
 /DopyShUrpgJclKNF7BARJAJc19+PMkv3HMnOnsnhOsBNBpJjsL6FZPA95MMjI0o
 xmRQkixl31NWIMk60EIC6PaMLsxjhAWjiYi5D93bzkhnJTnQswvNb4ROPG+X2FCU
 6YgBogsYKkqp93TTJ49OFXIvu3o7NwOyEQQW8mnNY8ffaFdWuGzOX4HkOoKHCMTD
 rT0qT/2q+7LPK87YwTPIVtpGVltnCr/SVI/FtAGXPysyghu2Z+4GGP4eh2+pSAUj
 7Dqxdw3q7zs8ou6LOThTyNJrR+N/w1JPloprBleJR3TNcJjOy/SBjfp7yL0uopIK
 j5exr76ImVq7zzjyrYoa
 =6M3e
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "This time the majority of changes go to the cpufreq subsystem (and to
  the intel_pstate driver in particular) and there are some updates in
  the generic power domains framework, cpuidle, tools and a couple of
  other places.

  One thing worth mentioning is that the intel_pstate's sysfs interface
  has been reworked to be more consistent with the general expectations
  of the cpufreq core and less confusing, hopefully for the better.
  Also, we have a new cpufreq driver for Tegra186 and new hardware
  support in intel_pstata and the Mediatek cpufreq driver.

  Apart from that, the AnalyzeSuspend utility for system suspend
  profiling gets a companion called AnalyzeBoot for the analogous
  profiling of system boot and they both go into one place under
  tools/power/pm-graph/.

  The rest is mostly fixes, cleanups and code reorganization.

  Specifics:

   - Rework the intel_pstate driver's sysfs interface to make it more
     straightforward and more intuitive (Rafael Wysocki).

   - Make intel_pstate support all processors which advertise HWP
     (hardware-managed P-states) to the kernel in all operation modes
     and make it use the load-based P-state selection algorithm on a
     wider range of systems in the active mode (Rafael Wysocki).

   - Add cpufreq driver for Tegra186 (Mikko Perttunen).

   - Add support for Gemini Lake SoCs to intel_pstate (David Box).

   - Add support for MT8176 and MT817x to the Mediatek cpufreq driver
     and clean up that driver a bit (Daniel Kurtz).

   - Clean up intel_pstate and optimize it slightly (Rafael Wysocki).

   - Update the schedutil cpufreq governor, mostly to fix a couple of
     issues with it related to specific workloads, and rework its sysfs
     tunable and initialization a bit (Rafael Wysocki, Viresh Kumar).

   - Fix minor issues in the imx6q, dbx500 and qoriq cpufreq drivers
     (Christophe Jaillet, Irina Tirdea, Leonard Crestez, Viresh Kumar,
     YuanTian Tang).

   - Add file patterns for cpufreq DT bindings to MAINTAINERS (Geert
     Uytterhoeven).

   - Add support for "always on" power domains to the genpd (generic
     power domains) framework and clean up that code somewhat (Ulf
     Hansson, Lina Iyer, Viresh Kumar).

   - Fix minor issues in the powernv cpuidle driver and clean it up
     (Anton Blanchard, Gautham Shenoy).

   - Move the AnalyzeSuspend utility under tools/power/pm-graph/ and add
     an analogous boot-profiling utility called AnalyzeBoot to it (Todd
     Brandt).

   - Add rk3328 support to the rockchip-io AVS (Adaptive Voltage
     Scaling) driver (David Wu).

   - Fix minor issues in the cpuidle core, the intel_pstate_tracer
     utility, the devfreq framework and the PM core documentation
     (Chanwoo Choi, Doug Smythies, Johan Hovold, Marcin Nowakowski)"

* tag 'pm-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (56 commits)
  PM / runtime: Document autosuspend-helper side effects
  PM / runtime: Fix autosuspend documentation
  tools: power: pm-graph: Package makefile and man pages
  tools: power: pm-graph: AnalyzeBoot v2.0
  tools: power: pm-graph: AnalyzeSuspend v4.6
  cpufreq: Add Tegra186 cpufreq driver
  cpufreq: imx6q: Fix error handling code
  cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
  cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
  cpuidle: powernv: Avoid a branch in the core snooze_loop() loop
  cpuidle: powernv: Don't continually set thread priority in snooze_loop()
  cpuidle: powernv: Don't bounce between low and very low thread priority
  cpuidle: cpuidle-cps: remove unused variable
  tools/power/x86/intel_pstate_tracer: Adjust directory ownership
  cpufreq: schedutil: Use policy-dependent transition delays
  cpufreq: schedutil: Reduce frequencies slower
  PM / devfreq: Move struct devfreq_governor to devfreq directory
  PM / Domains: Ignore domain-idle-states that are not compatible
  cpufreq: intel_pstate: Add support for Gemini Lake
  powernv-cpuidle: Validate DT property array size
  ...
2017-05-01 14:09:46 -07:00
Linus Torvalds
9410091dd5 Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing major. Two notable fixes are Li's second stab at fixing the
  long-standing race condition in the mount path and suppression of
  spurious warning from cgroup_get(). All other changes are trivial"

* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: mark cgroup_get() with __maybe_unused
  cgroup: avoid attaching a cgroup root to two different superblocks, take 2
  cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
  cgroup: move cgroup_subsys_state parent field for cache locality
  cpuset: Remove cpuset_update_active_cpus()'s parameter.
  cgroup: switch to BUG_ON()
  cgroup: drop duplicate header nsproxy.h
  kernel: convert css_set.refcount from atomic_t to refcount_t
  kernel: convert cgroup_namespace.count from atomic_t to refcount_t
2017-05-01 13:52:24 -07:00
Linus Torvalds
ad1490bcd2 Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue update from Tejun Heo:
 "One trivial patch to use setup_deferrable_timer() instead of
  open-coding the initialization"

* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: use setup_deferrable_timer
2017-05-01 13:49:27 -07:00
Jiri Kosina
a0841609f6 Merge branches 'for-4.12/upstream' and 'for-4.12/klp-hybrid-consistency-model' into for-linus 2017-05-01 21:49:28 +02:00
Tejun Heo
310b4816a5 cgroup: mark cgroup_get() with __maybe_unused
a590b90d472f ("cgroup: fix spurious warnings on cgroup_is_dead() from
cgroup_sk_alloc()") converted most cgroup_get() usages to
cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
cgroup_get().  When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
unused warning for cgroup_get().

Silence the warning by adding __maybe_unused to cgroup_get().

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-05-01 15:24:14 -04:00
Linus Torvalds
694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Yonghong Song
332270fdc8 bpf: enhance verifier to understand stack pointer arithmetic
llvm 4.0 and above generates the code like below:
....
440: (b7) r1 = 15
441: (05) goto pc+73
515: (79) r6 = *(u64 *)(r10 -152)
516: (bf) r7 = r10
517: (07) r7 += -112
518: (bf) r2 = r7
519: (0f) r2 += r1
520: (71) r1 = *(u8 *)(r8 +0)
521: (73) *(u8 *)(r2 +45) = r1
....
and the verifier complains "R2 invalid mem access 'inv'" for insn #521.
This is because verifier marks register r2 as unknown value after #519
where r2 is a stack pointer and r1 holds a constant value.

Teach verifier to recognize "stack_ptr + imm" and
"stack_ptr + reg with const val" as valid stack_ptr with new offset.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 11:40:23 -04:00
Steven Rostedt (VMware)
73a757e631 ring-buffer: Return reader page back into existing ring buffer
When reading the ring buffer for consuming, it is optimized for splice,
where a page is taken out of the ring buffer (zero copy) and sent to the
reading consumer. When the read is finished with the page, it calls
ring_buffer_free_read_page(), which simply frees the page. The next time the
reader needs to get a page from the ring buffer, it must call
ring_buffer_alloc_read_page() which allocates and initializes a reader page
for the ring buffer to be swapped into the ring buffer for a new filled page
for the reader.

The problem is that there's no reason to actually free the page when it is
passed back to the ring buffer. It can hold it off and reuse it for the next
iteration. This completely removes the interaction with the page_alloc
mechanism.

Using the trace-cmd utility to record all events (causing trace-cmd to
require reading lots of pages from the ring buffer, and calling
ring_buffer_alloc/free_read_page() several times), and also assigning a
stack trace trigger to the mm_page_alloc event, we can see how many times
the ring_buffer_alloc_read_page() needed to allocate a page for the ring
buffer.

Before this change:

  # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
  # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
  9968

After this change:

  # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
  # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
  4

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-01 10:26:40 -04:00
Dan Williams
7138970383 mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.

The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)

But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.

One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)

So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

This speeds up things while also making the pmem refcounting more robust going
forward.

Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-01 09:15:53 +02:00
Zefan Li
9732adc5d6 cgroup: avoid attaching a cgroup root to two different superblocks, take 2
Commit bfb0b80db5f9 ("cgroup: avoid attaching a cgroup root to two
different superblocks") is broken.  Now we try to fix the race by
delaying the initialization of cgroup root refcnt until a superblock
has been allocated.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-28 18:04:54 -04:00
Rafael J. Wysocki
0807ee0f52 Merge branch 'pm-cpufreq'
* pm-cpufreq: (37 commits)
  cpufreq: Add Tegra186 cpufreq driver
  cpufreq: imx6q: Fix error handling code
  cpufreq: imx6q: Set max suspend_freq to avoid changes during suspend
  cpufreq: imx6q: Fix handling EPROBE_DEFER from regulator
  cpufreq: schedutil: Use policy-dependent transition delays
  cpufreq: schedutil: Reduce frequencies slower
  cpufreq: intel_pstate: Add support for Gemini Lake
  cpufreq: intel_pstate: Eliminate intel_pstate_get_min_max()
  cpufreq: intel_pstate: Do not walk policy->cpus
  cpufreq: intel_pstate: Introduce pid_in_use()
  cpufreq: intel_pstate: Drop struct cpu_defaults
  cpufreq: intel_pstate: Move cpu_defaults definitions
  cpufreq: intel_pstate: Add update_util callback to pstate_funcs
  cpufreq: intel_pstate: Use different utilization update callbacks
  cpufreq: intel_pstate: Modify check in intel_pstate_update_status()
  cpufreq: intel_pstate: Drop driver_registered variable
  cpufreq: intel_pstate: Skip unnecessary PID resets on init
  cpufreq: intel_pstate: Set HWP sampling interval once
  cpufreq: intel_pstate: Clean up intel_pstate_busy_pid_reset()
  cpufreq: intel_pstate: Fold intel_pstate_reset_all_pid() into the caller
  ...
2017-04-28 23:14:00 +02:00
Rafael J. Wysocki
2addac72af Merge schedutil governor updates for v4.12. 2017-04-28 23:13:33 +02:00
Hannes Frederic Sowa
d24f7c7fb9 bpf: bpf_lock on kallsysms doesn't need to be irqsave
Hannes rightfully spotted that the bpf_lock doesn't need to be
irqsave variant. We never perform any such updates where this
would be necessary (neither right now nor in future), therefore
relax this further.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 15:48:14 -04:00
Tejun Heo
a590b90d47 cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup_get() expected to be called only on live cgroups and triggers
warning on a dead cgroup; however, cgroup_sk_alloc() may be called
while cloning a socket which is left in an empty and removed cgroup
and thus may legitimately duplicate its reference on a dead cgroup.
This currently triggers the following warning spuriously.

 WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
 ...
  [<ffffffff8107e123>] __warn+0xd3/0xf0
  [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
  [<ffffffff810ff465>] cgroup_get+0x55/0x60
  [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
  [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
  [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
  [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
  [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
  [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
  [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
  [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
  [<ffffffff81837cb2>] ip6_input+0x32/0xb0
  [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
  [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
  [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
  [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
  [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
  [<ffffffff817787d8>] napi_gro_frags+0x208/0x270
  [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
  [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
  [<ffffffff81778b30>] net_rx_action+0x210/0x350
  [<ffffffff8188c426>] __do_softirq+0x106/0x2c7
  [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
  [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
  [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
  [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
  [<ffffffff8103edd1>] start_secondary+0xf1/0x100

This patch renames the existing cgroup_get() with the dead cgroup
warning to cgroup_get_live() after cgroup_kn_lock_live() and
introduces the new cgroup_get() which doesn't check whether the cgroup
is live or dead.

All existing cgroup_get() users except for cgroup_sk_alloc() are
converted to use cgroup_get_live().

Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
Cc: stable@vger.kernel.org # v4.5+
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-28 15:28:20 -04:00
Paul E. McKenney
b5fe223a4b srcu: Adjust default auto-expediting holdoff
The default value for the kernel boot parameter srcutree.exp_holdoff
is 50 microseconds, which is too long for good Tree SRCU performance
(compared to Classic SRCU) on the workloads tested by Mike Galbraith.
This commit therefore sets the default value to 25 microseconds, which
shows excellent results in Mike's testing.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-27 08:35:24 -07:00
Frederic Weisbecker
25e2d8c1b9 sched/cputime: Fix ksoftirqd cputime accounting regression
irq_time_read() returns the irqtime minus the ksoftirqd time. This
is necessary because irq_time_read() is used to substract the IRQ time
from the sum_exec_runtime of a task. If we were to include the softirq
time of ksoftirqd, this task would substract its own CPU time everytime
it updates ksoftirqd->sum_exec_runtime which would therefore never
progress.

But this behaviour got broken by:

  a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account")

... which now includes ksoftirqd softirq time in the time returned by
irq_time_read().

This has resulted in wrong ksoftirqd cputime reported to userspace
through /proc/stat and thus "top" not showing ksoftirqd when it should
after intense networking load.

ksoftirqd->stime happens to be correct but it gets scaled down by
sum_exec_runtime through task_cputime_adjusted().

To fix this, just account the strict IRQ time in a separate counter and
use it to report the IRQ time.

Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-27 09:08:26 +02:00
Eric Biggers
cda37124f4 fs: constify tree_descr arrays passed to simple_fill_super()
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory.  Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
David S. Miller
b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Paul E. McKenney
22607d66bb srcu: Specify auto-expedite holdoff time
On small systems, in the absence of readers, expedited SRCU grace
periods can complete in less than a microsecond.  This means that an
eight-CPU system can have all CPUs doing synchronize_srcu() in a tight
loop and almost always expedite.  This might actually be desirable in
some situations, but in general it is a good way to needlessly burn
CPU cycles.  And in those situations where it is desirable, your friend
is the function synchronize_srcu_expedited().

For other situations, this commit adds a kernel parameter that specifies
a holdoff between completing the last SRCU grace period and auto-expediting
the next.  If the next grace period starts before the holdoff expires,
auto-expediting is disabled.  The holdoff is 50 microseconds by default,
and can be tuned to the desired number of nanoseconds.  A value of zero
disables auto-expediting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 16:32:17 -07:00
Paul E. McKenney
2da4b2a7fd srcu: Expedite first synchronize_srcu() when idle
Classic SRCU in effect expedites the first synchronize_srcu() when SRCU
is idle, and Mike Galbraith demonstrated that some use cases do in fact
rely on this behavior.  In particular, Mike showed that Steven Rostedt's
hotplug stress script takes 55 seconds with Classic SRCU and more than
16 -minutes- when running Tree SRCU.  Assuming that each Tree SRCU's call
to synchronize_srcu() takes four milliseconds, this implies that Steven's
test invokes synchronize_srcu() in isolation, but more than once per
200 microseconds.  Mike used ftrace to demonstrate that the time between
successive calls to synchronize_srcu() ranged from 118 to 342 microseconds,
with one outlier at 80 milliseconds.  This data clearly indicates that
Tree SRCU needs to expedite the first invocation of synchronize_srcu()
during an SRCU idle period.

This commit therefor introduces a srcu_might_be_idle() function that
probabilistically checks whether or not SRCU is idle.  This function is
used by synchronize_rcu() as an additional criterion in deciding whether
or not to expedite.

(Hat trick to Peter Zijlstra for his earlier suggestion that this might
in fact be a problem.  Which for all I know might have motivated Mike to
look into it.)

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 16:32:16 -07:00
Paul E. McKenney
1e9a038b7f srcu: Expedited grace periods with reduced memory contention
Commit f60d231a87c5 ("srcu: Crude control of expedited grace periods")
introduced a per-srcu_struct atomic counter to track outstanding
requests for grace periods.  This works, but represents a memory-contention
bottleneck.  This commit therefore uses the srcu_node combining tree
to remove this bottleneck.

This commit adds new ->srcu_gp_seq_needed_exp fields to the
srcu_data, srcu_node, and srcu_struct structures, which track the
farthest-in-the-future grace period that must be expedited, which in
turn requires that all nearer-term grace periods also be expedited.
Requests for expediting start with the srcu_data structure, run up
through the srcu_node tree, and end at the srcu_struct structure.
Note that it may be necessary to expedite a grace period that just
now started, and this is handled by a new srcu_funnel_exp_start()
function, which is invoked when the grace period itself is already
in its way, but when that grace period was not marked as expedited.

A new srcu_get_delay() function returns zero if there is at least one
expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise.
This function is used to calculate delays:  Normal grace periods
are allowed to extend in order to cover more requests with a given
grace-period computation, which decreases per-request overhead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 16:32:16 -07:00
Paul E. McKenney
7f6733c3c6 srcu: Make rcutorture writer stalls print SRCU GP state
In the past, SRCU was simple enough that there was little point in
making the rcutorture writer stall messages print the SRCU grace-period
number state.  With the advent of Tree SRCU, this has changed.  This
commit therefore makes Classic, Tiny, and Tree SRCU report this state
to rcutorture as needed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 11:23:28 -07:00
Paul E. McKenney
c7e88067c1 srcu: Exact tracking of srcu_data structures containing callbacks
The current Tree SRCU implementation schedules a workqueue for every
srcu_data covered by a given leaf srcu_node structure having callbacks,
even if only one of those srcu_data structures actually contains
callbacks.  This is clearly inefficient for workloads that don't feature
callbacks everywhere all the time.  This commit therefore adds an array
of masks that are used by the leaf srcu_node structures to track exactly
which srcu_data structures contain callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 11:23:12 -07:00
Teng Qin
8fe4592438 bpf: map_get_next_key to return first key on NULL
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:57:45 -04:00
Naveen N. Rao
1758618827 kallsyms: Use bounded strnchr() when parsing string
When parsing for the <module:name> format, we use strchr() to look for
the separator, when we know that the module name can't be longer than
MODULE_NAME_LEN. Enforce the same using strnchr().

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-04-24 14:07:28 -07:00
Daniel Borkmann
e390b55d5a bpf: make bpf_xdp_adjust_head support mandatory
Now that also the last in-tree user of the xdp_adjust_head bit has
been removed, we can remove the flag from struct bpf_prog altogether.

This, at the same time, also makes sure that any future driver for
XDP comes with bpf_xdp_adjust_head() support right away.

A rejection based on this flag would also mean that tail calls
couldn't be used with such driver as per c2002f983767 ("bpf: fix
checking xdp_adjust_head on tail calls") fix, thus lets not allow
for it in the first place.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 16:18:10 -04:00
Linus Torvalds
fa8d7cdc84 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "The (hopefully) final fix for the irq affinity spreading logic"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/affinity: Fix calculating vectors to assign
2017-04-23 12:48:05 -07:00
Ingo Molnar
58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Eric W. Biederman
6c478ae920 signal: Make kill_proc_info static
There are no users outside of signal.c so make the function static so
the compiler and other developers have that information.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-21 22:46:25 -05:00
David S. Miller
fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
Eric W. Biederman
cad4ea546b rlimit: Properly call security_task_setrlimit
Modify do_prlimit to call security_task_setrlimit passing the task
whose rlimit we are changing not the tsk->group_leader.

In general this should not matter as the lsms implementing
security_task_setrlimit apparmor and selinux both examine the
task->cred to see what should be allowed on the destination task.

That task->cred is shared between tasks created with CLONE_THREAD
unless thread keyrings are in play, in which case both apparmor and
selinux create duplicate security contexts.

So the only time when it will matter which thread is passed to
security_task_setrlimit is if one of the threads of a process performs
an operation that changes only it's credentials.  At which point if a
thread has done that we don't want to hide that information from the
lsms.

So fix the call of security_task_setrlimit.  With the removal
of tsk->group_leader this makes the code slightly faster,
more comprehensible and maintainable.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-21 16:08:19 -05:00
David S. Miller
1f4407e254 net: Remove NET_CORE_BUDGET_USECS from sysctl binary interface.
We are not supposed to add new entries to this thing
any more.

Thanks to Eric Dumazet for noticing this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 15:59:52 -04:00
Matthew Whitehead
7acf8a1e8a Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning
Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.

Signed-off-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 13:22:34 -04:00
Paul E. McKenney
f2094107ac Merge branches 'doc.2017.04.12a', 'fixes.2017.04.19a' and 'srcu.2017.04.21a' into HEAD
doc.2017.04.12a: Documentation updates
fixes.2017.04.19a: Miscellaneous fixes
srcu.2017.04.21a: Parallelize SRCU callback handling
2017-04-21 06:00:13 -07:00
Paul E. McKenney
bcbfdd01dc rcu: Make non-preemptive schedule be Tasks RCU quiescent state
Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place.  However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.

To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-21 05:59:27 -07:00
Paul E. McKenney
0497b489b8 srcu: Expedite srcu_schedule_cbs_snp() callback invocation
Although Tree SRCU does reduce delays when there is at least one
synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp()
still waits for SRCU_INTERVAL before invoking callbacks.  Since
synchronize_srcu_expedited() now posts a callback and waits for
that callback to do a wakeup, this destroys the expedited nature of
synchronize_srcu_expedited().  This destruction became apparent to
Marc Zyngier in the guise of a guest-OS bootup slowdown from five
seconds to no fewer than forty seconds.

This commit therefore invokes callbacks immediately at the end of the
grace period when there is at least one synchronize_srcu_expedited()
invocation pending.  This brought Marc's guest-OS bootup times back
into the realm of reason.

Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
2017-04-21 05:59:27 -07:00
Paul E. McKenney
da915ad5cf srcu: Parallelize callback handling
Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
however, there are workloads that could result in a high volume of
concurrent invocations of call_srcu(), which with current SRCU would
result in excessive lock contention on the srcu_struct structure's
->queue_lock, which protects SRCU's callback lists.  This commit therefore
moves SRCU to per-CPU callback lists, thus greatly reducing contention.

Because a given SRCU instance no longer has a single centralized callback
list, starting grace periods and invoking callbacks are both more complex
than in the single-list Classic SRCU implementation.  Starting grace
periods and handling callbacks are now handled using an srcu_node tree
that is in some ways similar to the rcu_node trees used by RCU-bh,
RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
controlled by exactly the same Kconfig options and boot parameters that
control the shape of the rcu_node tree).

In addition, the old per-CPU srcu_array structure is now named srcu_data
and contains an rcu_segcblist structure named ->srcu_cblist for its
callbacks (and a spinlock to protect this).  The srcu_struct gets
an srcu_gp_seq that is used to associate callback segments with the
corresponding completion-time grace-period number.  These completion-time
grace-period numbers are propagated up the srcu_node tree so that the
grace-period workqueue handler can determine whether additional grace
periods are needed on the one hand and where to look for callbacks that
are ready to be invoked.

The srcu_barrier() function must now wait on all instances of the per-CPU
->srcu_cblist.  Because each ->srcu_cblist is protected by ->lock,
srcu_barrier() can remotely add the needed callbacks.  In theory,
it could also remotely start grace periods, but in practice doing so
is complex and racy.  And interestingly enough, it is never necessary
for srcu_barrier() to start a grace period because srcu_barrier() only
enqueues a callback when a callback is already present--and it turns out
that a grace period has to have already been started for this pre-existing
callback.  Furthermore, it is only the callback that srcu_barrier()
needs to wait on, not any particular grace period.  Therefore, a new
rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
callback into the same segment occupied by the last pre-existing callback
in the list.  The special case where all the pre-existing callbacks are
on a different list (because they are in the process of being invoked)
is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
segment, relying on the done-callbacks check that takes place after all
callbacks are inovked.

Note that the readers use the same algorithm as before.  Note that there
is a separate srcu_idx that tells the readers what counter to increment.
This unfortunately cannot be combined with srcu_gp_seq because they
need to be incremented at different times.

This commit introduces some ugly #ifdefs in rcutorture.  These will go
away when I feel good enough about Tree SRCU to ditch Classic SRCU.

Some crude performance comparisons, courtesy of a quickly hacked rcuperf
asynchronous-grace-period capability:

			Callback Queuing Overhead
			-------------------------
	# CPUS		Classic SRCU	Tree SRCU
	------          ------------    ---------
	     2              0.349 us     0.342 us
	    16             31.66  us     0.4   us
	    41             ---------     0.417 us

The times are the 90th percentiles, a statistic that was chosen to reject
the overheads of the occasional srcu_barrier() call needed to avoid OOMing
the test machine.  The rcuperf test hangs when running Classic SRCU at 41
CPUs, hence the line of dashes.  Despite the hacks to both the rcuperf code
and that statistics, this is a convincing demonstration of Tree SRCU's
performance and scalability advantages.

[1] https://lwn.net/Articles/309030/
[2] https://patchwork.kernel.org/patch/5108281/

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
2017-04-21 05:59:26 -07:00
Jason A. Donenfeld
69b348449b padata: get_next is never NULL
Per Dan's static checker warning, the code that returns NULL was removed
in 2010, so this patch updates the comments and fixes the code
assumptions.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-04-21 20:30:46 +08:00
Steven Rostedt (VMware)
dcc19d2809 tracing/ftrace: Allow for instances to trigger their own stacktrace probes
Have the stacktrace function trigger probe trigger stack traces within the
instance that they were added to in the set_ftrace_filter.

 ># cd /sys/kernel/debug/tracing
 ># mkdir instances/foo
 ># cd instances/foo
 ># echo schedule:stacktrace:1 > set_ftrace_filter
 ># cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 1/1   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
           <idle>-0     [001] .N.2   202.585010: <stack trace>
  =>
  => schedule
  => schedule_preempt_disabled
  => do_idle
  => cpu_startup_entry
  => start_secondary
  => verify_cpu

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:49 -04:00
Steven Rostedt (VMware)
2290f2c589 tracing/ftrace: Allow for the traceonoff probe be unique to instances
Have the traceon/off function probe triggers affect only the instance they
are set in. This required making the trace_on/off accessible for other files
in the tracing directory.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:48 -04:00
Steven Rostedt (VMware)
cab5037950 tracing/ftrace: Enable snapshot function trigger to work with instances
Modify the snapshot probe trigger to work with instances. This way the
snapshot function trigger will only affect the instance that it is added to
in the set_ftrace_filter file.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:48 -04:00
Steven Rostedt (VMware)
d2afd57a4b tracing/ftrace: Allow instances to have their own function probes
Pass around the local trace_array that is the descriptor for tracing
instances, when enabling and disabling probes. This by default sets the
enable/disable of event probe triggers to work with instances.

The other probes will need some more work to get them working with
instances.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:47 -04:00
Steven Rostedt (VMware)
6e4443199e tracing/ftrace: Add a better way to pass data via the probe functions
With the redesign of the registration and execution of the function probes
(triggers), data can now be passed from the setup of the probe to the probe
callers that are specific to the trace_array it is on. Although, all probes
still only affect the toplevel trace array, this change will allow for
instances to have their own probes separated from other instances and the
top array.

That is, something like the stacktrace probe can be set to trace only in an
instance and not the toplevel trace array. This isn't implement yet, but
this change sets the ground work for the change.

When a probe callback is triggered (someone writes the probe format into
set_ftrace_filter), it calls register_ftrace_function_probe() passing in
init_data that will be used to initialize the probe. Then for every matching
function, register_ftrace_function_probe() will call the probe_ops->init()
function with the init data that was passed to it, as well as an address to
a place holder that is associated with the probe and the instance. The first
occurrence will have a NULL in the pointer. The init() function will then
initialize it. If other probes are added, or more functions are part of the
probe, the place holder will be passed to the init() function with the place
holder data that it was initialized to the last time.

Then this place_holder is passed to each of the other probe_ops functions,
where it can be used in the function callback. When the probe_ops free()
function is called, it can be called either with the rip of the function
that is being removed from the probe, or zero, indicating that there are no
more functions attached to the probe, and the place holder is about to be
freed. This gives the probe_ops a way to free the data it assigned to the
place holder if it was allocade during the first init call.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:46 -04:00
Steven Rostedt (VMware)
7b60f3d876 ftrace: Dynamically create the probe ftrace_ops for the trace_array
In order to eventually have each trace_array instance have its own unique
set of function probes (triggers), the trace array needs to hold the ops and
the filters for the probes.

This is the first step to accomplish this. Instead of having the private
data of the probe ops point to the trace_array, create a separate list that
the trace_array holds. There's only one private_data for a probe, we need
one per trace_array. The probe ftrace_ops will be dynamically created for
each instance, instead of being static.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:46 -04:00
Steven Rostedt (VMware)
b5f081b563 tracing: Pass the trace_array into ftrace_probe_ops functions
Pass the trace_array associated to a ftrace_probe_ops into the probe_ops
func(), init() and free() functions. The trace_array is the descriptor that
describes a tracing instance. This will help create the infrastructure that
will allow having function probes unique to tracing instances.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:45 -04:00
Steven Rostedt (VMware)
04ec7bb642 tracing: Have the trace_array hold the list of registered func probes
Add a link list to the trace_array to hold func probes that are registered.
Currently, all function probes are the same for all instances as it was
before, that is, only the top level trace_array holds the function probes.
But this lays the ground work to have function probes be attached to
individual instances, and having the event trigger only affect events in the
given instance. But that work is still to be done.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:45 -04:00
Steven Rostedt (VMware)
8d70725e45 ftrace: If the hash for a probe fails to update then free what was initialized
If the ftrace_hash_move_and_update_ops() fails, and an ops->free() function
exists, then it needs to be called on all the ops that were added by this
registration.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:44 -04:00
Steven Rostedt (VMware)
eee8ded131 ftrace: Have the function probes call their own function
Now that the function probes have their own ftrace_ops, there's no reason to
continue using the ftrace_func_hash to find which probe to call in the
function callback. The ops that is passed in to the function callback is
part of the probe_ops to call.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20 22:06:43 -04:00