18766 Commits

Author SHA1 Message Date
Vladimir Davydov
9a3f4d85d5 page-cgroup: get rid of NR_PCG_FLAGS
It's not used anywhere today, so let's remove it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:18 -07:00
Johannes Weiner
747db954ca mm: memcontrol: use page lists for uncharge batching
Pages are now uncharged at release time, and all sources of batched
uncharges operate on lists of pages.  Directly use those lists, and
get rid of the per-task batching state.

This also batches statistics accounting, in addition to the res
counter charges, to reduce IRQ-disabling and re-enabling.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:18 -07:00
Johannes Weiner
00501b531c mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code and
reduces charging and uncharging overhead.  The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
 executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.31%              cat  [kernel.kallsyms]   [k] memset
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.38%              cat  [kernel.kallsyms]   [k] put_page
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn

After:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.48%           cat  [kernel.kallsyms]   [k] memset
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.46%           cat  [kernel.kallsyms]   [k] put_page
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
     1.30%           cat  [kernel.kallsyms]   [k] kfree

As you can see, the memcg footprint has shrunk quite a bit.

   text    data     bss     dec     hex filename
  37970    9892     400   48262    bc86 mm/memcontrol.o.old
  35239    9892     400   45531    b1db mm/memcontrol.o

This patch (of 4):

The memcg charge API charges pages before they are rmapped - i.e.  have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on.  Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again.  Revive lru_cache_add_active_or_unevictable().

[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:17 -07:00
Masami Hiramatsu
f96f56780c kprobes: Skip kretprobe hit in NMI context to avoid deadlock
Skip kretprobe hit in NMI context, because if an NMI happens
inside the critical section protected by kretprobe_table.lock
and another(or same) kretprobe hit, pre_kretprobe_handler
tries to lock kretprobe_table.lock again.
Normal interrupts have no problem because they are disabled
with the lock.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20140804031016.11433.65539.stgit@kbuild-fedora.novalocal
[ Minor edits for clarity. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-08-08 10:38:04 +02:00
Ionut Alexa
2577d92ebd kernel/acct.c: fix coding style warnings and errors
Signed-off-by: Ionut Alexa <ionut.m.alexa@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Al Viro
3064c3563b death to mnt_pinned
Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone.  Then attach the pin to original
vfsmount.  Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out.  Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.

mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Al Viro
efb170c228 take fs_pin stuff to fs/*
Add a new field to fs_pin - kill(pin).  That's what umount and r/o remount
will be calling for all pins attached to vfsmount and superblock resp.
Called after bumping the refcount, so it won't go away under us.  Dropping
the refcount is responsibility of the instance.  All generic stuff moved to
fs/fs_pin.c; the next step will rip all the knowledge of kernel/acct.c from
fs/super.c and fs/namespace.c.  After that - death to mnt_pin(); it was
intended to be usable as generic mechanism for code that wants to attach
objects to vfsmount, so that they would not make the sucker busy and
would get killed on umount.  Never got it right; it remained acct.c-specific
all along.  Now it's very close to being killable.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
1629d0eb3e start carving bsd_acct_struct up
pull generic parts into struct fs_pin.  Eventually we want those
to replace mnt_pin()/mnt_unpin() mess; that stuff will move to
fs/*.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
215748e67d acct: move mnt_pin() upwards.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
17c0a5aaff make acct_kill() wait for file closing.
Do actual closing of file via schedule_work().  And use
__fput_sync() there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
2798d4ce61 acct: get rid of acct_lock for acct->count
* make acct->count atomic and acct freeing - rcu-delayed.
* instead of grabbing acct_lock around the places where we take a reference,
do that under rcu_read_lock() with atomic_long_inc_not_zero().
* have the new acct locked before making ns->bacct point to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
215752fce3 acct: get rid of acct_list
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
54a4d58a64 acct: simplify check_free_space()
a) file can't be NULL
b) file can't be changed under us
c) all writes are serialized by acct->lock; no need to mess with
spinlock there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
b8f00e6be4 acct: new lifetime rules
Do not reuse bsd_acct_struct after closing the damn thing.
Structure lifetime is controlled by refcount now.  We also
have a mutex in there, held over closing and writing (the
file is O_APPEND, so we are not losing any concurrency).

As the result, we do not need to bother with get_file()/fput()
on log write anymore.  Moreover, do_acct_process() only needs
acct itself; file and pidns are picked from it.

Killed instances are distinguished by having NULL ->ns.
Refcount is protected by acct_lock; anybody taking the
mutex needs to grab a reference first.

The things will get a lot simpler in the next commits - this
is just the minimal chunk switching to the new lifetime rules.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
9df7fa16ee acct: serialize acct_on()
brute-force - on a global mutex that isn't nested into anything.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
795a2f22a8 acct() should honour the limits from the very beginning
We need to check free space on the first write to freshly opened log.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
e25ff11ff1 split the slow path in acct_process() off
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
cdd37e2309 separate namespace-independent parts of filling acct_t
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
ed44724b79 acct: switch to __kernel_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
ecfdb33d1f acct: encode_comp_t(0) is 0, fortunately...
There was an amusing bogosity in ac_rw calculation - it tried to
do encode_comp_t(encode_comp_t(0) / 1024).  Seeing that comp_t is
a 3-bit exponent + 13-bit mantissa... it's a good thing that 0 is
represented by all-bits-clear.

The history of that one is interesting - it was introduced in
2.1.68pre1, when acct.c had been reworked and moved to separate
file.  Two months later (2.1.86) somebody has noticed that the
sucker won't compile - there was no task_struct::io_usage.
At which point the ac_io calculation had changed from
encode_comp_t(current->io_usage) to encode_comp_t(0) and the
bug in the next line (absolutely real back then, had it ever
managed to compile) become a harmless bogosity.  Looks like
nobody has ever noticed until now.

Anyway, let's bury that idiocy now that it got noticed.  17 years
is long enough...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
82df9c8beb Merge commit 'ccbf62d8a284cf181ac28c8e8407dd077d90dd4b' into for-next
backmerge to avoid kernel/acct.c conflict
2014-08-07 14:07:57 -04:00
Linus Torvalds
33caee3992 Merge branch 'akpm' (patchbomb from Andrew Morton)
Merge incoming from Andrew Morton:
 - Various misc things.
 - arch/sh updates.
 - Part of ocfs2.  Review is slow.
 - Slab updates.
 - Most of -mm.
 - printk updates.
 - lib/ updates.
 - checkpatch updates.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
  checkpatch: update $declaration_macros, add uninitialized_var
  checkpatch: warn on missing spaces in broken up quoted
  checkpatch: fix false positives for --strict "space after cast" test
  checkpatch: fix false positive MISSING_BREAK warnings with --file
  checkpatch: add test for native c90 types in unusual order
  checkpatch: add signed generic types
  checkpatch: add short int to c variable types
  checkpatch: add for_each tests to indentation and brace tests
  checkpatch: fix brace style misuses of else and while
  checkpatch: add --fix option for a couple OPEN_BRACE misuses
  checkpatch: use the correct indentation for which()
  checkpatch: add fix_insert_line and fix_delete_line helpers
  checkpatch: add ability to insert and delete lines to patch/file
  checkpatch: add an index variable for fixed lines
  checkpatch: warn on break after goto or return with same tab indentation
  checkpatch: emit a warning on file add/move/delete
  checkpatch: add test for commit id formatting style in commit log
  checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
  checkpatch: improve "no space after cast" test
  checkpatch: allow multiple const * types
  ...
2014-08-06 21:14:42 -07:00
Linus Torvalds
7725131982 ACPI and power management updates for 3.17-rc1
- ACPICA update to upstream version 20140724.  That includes
    ACPI 5.1 material (support for the _CCA and _DSD predefined names,
    changes related to the DMAR and PCCT tables and ARM support among
    other things) and cleanups related to using ACPICA's header files.
    A major part of it is related to acpidump and the core code used
    by that utility.  Changes from Bob Moore, David E Box, Lv Zheng,
    Sascha Wildner, Tomasz Nowicki, Hanjun Guo.
 
  - Radix trees for memory bitmaps used by the hibernation core from
    Joerg Roedel.
 
  - Support for waking up the system from suspend-to-idle (also known
    as the "freeze" sleep state) using ACPI-based PCI wakeup signaling
    (Rafael J Wysocki).
 
  - Fixes for issues related to ACPI button events (Rafael J Wysocki).
 
  - New device ID for an ACPI-enumerated device included into the
    Wildcat Point PCH from Jie Yang.
 
  - ACPI video updates related to backlight handling from Hans de Goede
    and Linus Torvalds.
 
  - Preliminary changes needed to support ACPI on ARM from Hanjun Guo
    and Graeme Gregory.
 
  - ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui.
 
  - Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros
    (Rafael J Wysocki).
 
  - ACPI-based device hotplug cleanups from Wei Yongjun and
    Rafael J Wysocki.
 
  - Cleanups and improvements related to system suspend from
    Lan Tianyu, Randy Dunlap and Rafael J Wysocki.
 
  - ACPI battery cleanup from Wei Yongjun.
 
  - cpufreq core fixes from Viresh Kumar.
 
  - Elimination of a deadband effect from the cpufreq ondemand
    governor and intel_pstate driver cleanups from Stratos Karafotis.
 
  - 350MHz CPU support for the powernow-k6 cpufreq driver from
    Mikulas Patocka.
 
  - Fix for the imx6 cpufreq driver from Anson Huang.
 
  - cpuidle core and governor cleanups from Daniel Lezcano,
    Sandeep Tripathy and Mohammad Merajul Islam Molla.
 
  - Build fix for the big_little cpuidle driver from Sachin Kamat.
 
  - Configuration fix for the Operation Performance Points (OPP)
    framework from Mark Brown.
 
  - APM cleanup from Jean Delvare.
 
  - cpupower utility fixes and cleanups from Peter Senna Tschudin,
    Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas Renninger.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJT4nhtAAoJEILEb/54YlRxtZEP/2rtVQFSFdAW8l0Xm1SeSsl4
 EnZpSNT1TFn+NdG23vSIot5Jzdz1/dLfeoJEbXpoVt4DPC9/PK4HPlv5FEDQYfh5
 srftvvGcAva969sXzSBRNUeR+M8Yd2RdoYCfmqTEUjzf8GJLL4jC0VAIwMtsQklt
 EbiQX8JaHQS7RIql7MDg1N2vaTo+zxkf39Kkcl56usmO/uATP7cAPjFreF/xQ3d8
 OyBhz1cOXIhPw7bd9Dv9AgpJzA8WFpktDYEgy2sluBWMv+mLYjdZRCFkfpIRzmea
 pt+hJDeAy8ZL6/bjWCzz2x6wG7uJdDLblreI28sgnJx/VHR3Co6u4H1BqUBj18ct
 CHV6zQ55WFmx9/uJqBtwFy333HS2ysJziC5ucwmg8QjkvAn4RK8S0qHMfRvSSaHj
 F9ejnHGxyrc3zzfsngUf/VXIp67FReaavyKX3LYxjHjMPZDMw2xCtCWEpUs52l2o
 fAbkv8YFBbUalIv0RtELH5XnKQ2ggMP8UgvT74KyfXU6LaliH8lEV20FFjMgwrPI
 sMr2xk04eS8mNRNAXL8OMMwvh6DY/Qsmb7BVg58RIw6CdHeFJl834yztzcf7+j56
 4oUmA16QYBCFA3udGQ3Tb07mi8XTfrMdTOGA0koQG9tjswKXuLUXUk9WAXZe4vml
 ItRpZKE86BCs3mLJMYre
 =ZODv
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management updates from Rafael Wysocki:
 "Again, ACPICA leads the pack (47 commits), followed by cpufreq (18
  commits) and system suspend/hibernation (9 commits).

  From the new code perspective, the ACPICA update brings ACPI 5.1 to
  the table, including a new device configuration object called _DSD
  (Device Specific Data) that will hopefully help us to operate device
  properties like Device Trees do (at least to some extent) and changes
  related to supporting ACPI on ARM.

  Apart from that we have hibernation changes making it use radix trees
  to store memory bitmaps which should speed up some operations carried
  out by it quite significantly.  We also have some power management
  changes related to suspend-to-idle (the "freeze" sleep state) support
  and more preliminary changes needed to support ACPI on ARM (outside of
  ACPICA).

  The rest is fixes and cleanups pretty much everywhere.

  Specifics:

   - ACPICA update to upstream version 20140724.  That includes ACPI 5.1
     material (support for the _CCA and _DSD predefined names, changes
     related to the DMAR and PCCT tables and ARM support among other
     things) and cleanups related to using ACPICA's header files.  A
     major part of it is related to acpidump and the core code used by
     that utility.  Changes from Bob Moore, David E Box, Lv Zheng,
     Sascha Wildner, Tomasz Nowicki, Hanjun Guo.

   - Radix trees for memory bitmaps used by the hibernation core from
     Joerg Roedel.

   - Support for waking up the system from suspend-to-idle (also known
     as the "freeze" sleep state) using ACPI-based PCI wakeup signaling
     (Rafael J Wysocki).

   - Fixes for issues related to ACPI button events (Rafael J Wysocki).

   - New device ID for an ACPI-enumerated device included into the
     Wildcat Point PCH from Jie Yang.

   - ACPI video updates related to backlight handling from Hans de Goede
     and Linus Torvalds.

   - Preliminary changes needed to support ACPI on ARM from Hanjun Guo
     and Graeme Gregory.

   - ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui.

   - Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros
     (Rafael J Wysocki).

   - ACPI-based device hotplug cleanups from Wei Yongjun and Rafael J
     Wysocki.

   - Cleanups and improvements related to system suspend from Lan
     Tianyu, Randy Dunlap and Rafael J Wysocki.

   - ACPI battery cleanup from Wei Yongjun.

   - cpufreq core fixes from Viresh Kumar.

   - Elimination of a deadband effect from the cpufreq ondemand governor
     and intel_pstate driver cleanups from Stratos Karafotis.

   - 350MHz CPU support for the powernow-k6 cpufreq driver from Mikulas
     Patocka.

   - Fix for the imx6 cpufreq driver from Anson Huang.

   - cpuidle core and governor cleanups from Daniel Lezcano, Sandeep
     Tripathy and Mohammad Merajul Islam Molla.

   - Build fix for the big_little cpuidle driver from Sachin Kamat.

   - Configuration fix for the Operation Performance Points (OPP)
     framework from Mark Brown.

   - APM cleanup from Jean Delvare.

   - cpupower utility fixes and cleanups from Peter Senna Tschudin,
     Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas
     Renninger"

* tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (118 commits)
  ACPI / LPSS: add LPSS device for Wildcat Point PCH
  ACPI / PNP: Replace faulty is_hex_digit() by isxdigit()
  ACPICA: Update version to 20140724.
  ACPICA: ACPI 5.1: Update for PCCT table changes.
  ACPICA/ARM: ACPI 5.1: Update for GTDT table changes.
  ACPICA/ARM: ACPI 5.1: Update for MADT changes.
  ACPICA/ARM: ACPI 5.1: Update for FADT changes.
  ACPICA: ACPI 5.1: Support for the _CCA predifined name.
  ACPICA: ACPI 5.1: New notify value for System Affinity Update.
  ACPICA: ACPI 5.1: Support for the _DSD predefined name.
  ACPICA: Debug object: Add current value of Timer() to debug line prefix.
  ACPICA: acpihelp: Add UUID support, restructure some existing files.
  ACPICA: Utilities: Fix local printf issue.
  ACPICA: Tables: Update for DMAR table changes.
  ACPICA: Remove some extraneous printf arguments.
  ACPICA: Update for comments/formatting. No functional changes.
  ACPICA: Disassembler: Add support for the ToUUID opererator (macro).
  ACPICA: Remove a redundant cast to acpi_size for ACPI_OFFSET() macro.
  ACPICA: Work around an ancient GCC bug.
  ACPI / processor: Make it possible to get local x2apic id via _MAT
  ...
2014-08-06 20:34:19 -07:00
Linus Torvalds
6b22df74f7 SCSI misc on 20140806
This patch set consists of the usual driver updates (ufs, storvsc, pm8001
 hpsa).  It also has removal of the user space target driver code (everyone is
 using LIO now), a partial PCI MSI-X update, more multi-queue updates,
 conversion to 64 bit LUNs (so we could theoretically cope with any LUN
 returned by a device) and placeholder support for the ZBC device type (Shingle
 drives), plus an assortment of minor updates and bug fixes.
 
 Signed-off-by: James Bottomley <JBottomley@Parallels.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJT4mS9AAoJEDeqqVYsXL0Mq34H/2AeXiM8GEVO3PIsBtF3TFZ9
 poJvAyb8t//+VwAIVLHU9wrssIrIcyvNQmNHH/InGt5rOaXwGQRsnEc73bBtot4b
 aC1t+hAnp2Ddvu6phmyUg7iY2GmQhAoZmeaj7krGIu2XgtLGiPg26eSsgk4Yv/U9
 cuULEuOc/UnTj3w5VK8SvpyXMybVF6oQhSrS1slOglfFwPTlTI/NHU9xo7Wc3qHT
 VifHXNphIvye5EH8zwtKX5p8qCrFW0pevJwyfPz7Hp2CTA9XYKx3SoeOh+n9F9ez
 udBBggg7Vb1tb4mPKUoZ78UrtCVdFSCmesBU/RJe7cIh8daKaO5MVr3WPSx2JhM=
 =yGai
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This patch set consists of the usual driver updates (ufs, storvsc,
  pm8001 hpsa).  It also has removal of the user space target driver
  code (everyone is using LIO now), a partial PCI MSI-X update, more
  multi-queue updates, conversion to 64 bit LUNs (so we could
  theoretically cope with any LUN returned by a device) and placeholder
  support for the ZBC device type (Shingle drives), plus an assortment
  of minor updates and bug fixes"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (143 commits)
  scsi: do not issue SCSI RSOC command to Promise Vtrak E610f
  vmw_pvscsi: Use pci_enable_msix_exact() instead of pci_enable_msix()
  pm8001: Fix invalid return when request_irq() failed
  lpfc: Remove superfluous call to pci_disable_msix()
  isci: Use pci_enable_msix_exact() instead of pci_enable_msix()
  bfa: Use pci_enable_msix_exact() instead of pci_enable_msix()
  bfa: Cleanup bfad_setup_intr() function
  bfa: Do not call pci_enable_msix() after it failed once
  fnic: Use pci_enable_msix_exact() instead of pci_enable_msix()
  scsi: use short driver name for per-driver cmd slab caches
  scsi_debug: support scsi-mq, queues and locks
  Drivers: add blist flags
  scsi: ufs: fix endianness sparse warnings
  scsi: ufs: make undeclared functions static
  bnx2i: Update driver version to 2.7.10.1
  pm8001: fix a memory leak in nvmd_resp
  pm8001: fix update_flash
  pm8001: fix a memory leak in flash_update
  pm8001: Cleaning up uninitialized variables
  pm8001: Fix to remove null pointer checks that could never happen
  ...
2014-08-06 20:10:32 -07:00
Neil Zhang
d25d9feced kernel/printk/printk.c: fix bool assignements
Fix coccinelle warnings.

Signed-off-by: Neil Zhang <zhangwm@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Jan Kara
5874af2003 printk: enable interrupts before calling console_trylock_for_printk()
We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()).  However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().

We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
  https://lkml.org/lkml/2014/5/30/242,
  https://lkml.org/lkml/2014/6/28/521).

Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Alex Elder
249771b830 printk: miscellaneous cleanups
Some small cleanups to kernel/printk/printk.c.  None of them should
cause any change in behavior.

  - When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX.
  - When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX
    definition; delete it.
  - Pull an assignment out of a conditional expression in console_setup().
  - Use isdigit() in console_setup() rather than open coding it.
  - In update_console_cmdline(), drop a NUL-termination assignment;
    the strlcpy() call that precedes it guarantees it's not needed.
  - Simplify some logic in printk_timed_ratelimit().

Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Alex Elder
e99aa46166 printk: use a clever macro
Use the IS_ENABLED() macro rather than #ifdef blocks to set certain
global values.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Alex Elder
0b90fec3b9 printk: fix some comments
Fix a few comments that don't accurately describe their corresponding
code.  It also fixes some minor typographical errors.

Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Alex Elder
42a9dc0b3d printk: rename DEFAULT_MESSAGE_LOGLEVEL
Commit a8fe19ebfbfd ("kernel/printk: use symbolic defines for console
loglevels") makes consistent use of symbolic values for printk() log
levels.

The naming scheme used is different from the one used for
DEFAULT_MESSAGE_LOGLEVEL though.  Change that symbol name to be
MESSAGE_LOGLEVEL_DEFAULT for consistency.  And because the value of that
symbol comes from a similarly-named config option, rename
CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.

Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Alex Elder
e97e1267e9 printk: tweak do_syslog() to match comments
In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
only needs to know whether there's any data available to read (and not
its size).  These callers only check for non-zero return.  As a
shortcut, do_syslog() returns the difference between what has been
logged and what has been "seen."

The comments say that the "count of records" should be returned but it's
not.  Instead it returns (log_next_idx - syslog_idx), which is a
difference between buffer offsets--and the result could be negative.

The behavior is the same (it'll be zero or not in the same cases), but
the count of records is more meaningful and it matches what the comments
say.  So change the code to return that.

Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Luis R. Rodriguez
23b2899f7f printk: allow increasing the ring buffer depending on the number of CPUs
The default size of the ring buffer is too small for machines with a
large amount of CPUs under heavy load.  What ends up happening when
debugging is the ring buffer overlaps and chews up old messages making
debugging impossible unless the size is passed as a kernel parameter.
An idle system upon boot up will on average spew out only about one or
two extra lines but where this really matters is on heavy load and that
will vary widely depending on the system and environment.

There are mechanisms to help increase the kernel ring buffer for tracing
through debugfs, and those interfaces even allow growing the kernel ring
buffer per CPU.  We also have a static value which can be passed upon
boot.  Relying on debugfs however is not ideal for production, and
relying on the value passed upon bootup is can only used *after* an
issue has creeped up.  Instead of being reactive this adds a proactive
measure which lets you scale the amount of contributions you'd expect to
the kernel ring buffer under load by each CPU in the worst case
scenario.

We use num_possible_cpus() to avoid complexities which could be
introduced by dynamically changing the ring buffer size at run time,
num_possible_cpus() lets us use the upper limit on possible number of
CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
which is used to specify the maximum amount of contributions to the
kernel ring buffer in the worst case before the kernel ring buffer flips
over, the size is specified as a power of 2.  The total amount of
contributions made by each CPU must be greater than half of the default
kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
an increase upon bootup.  The kernel ring buffer is increased to the
next power of two that would fit the required minimum kernel ring buffer
size plus the additional CPU contribution.  For example if LOG_BUF_SHIFT
is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
in order to trigger an increase of the kernel ring buffer.  With a
LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
possible CPUs to trigger an increase.  If you had 128 possible CPUs the
amount of minimum required kernel ring buffer bumps to:

   ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB

Since we require the ring buffer to be a power of two the new required
size would be 1024 KB.

This CPU contributions are ignored when the "log_buf_len" kernel
parameter is used as it forces the exact size of the ring buffer to an
expected power of two value.

[pmladek@suse.cz: fix build]
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Tested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:23 -07:00
Luis R. Rodriguez
f54051722e printk: make dynamic units clear for the kernel ring buffer
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Suggested-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:23 -07:00
Luis R. Rodriguez
c0a318a361 printk: move power of 2 practice of ring buffer size to a helper
In practice the power of 2 practice of the size of the kernel ring
buffer remains purely historical but not a requirement, specially now
that we have LOG_ALIGN and use it for both static and dynamic
allocations.  It could have helped with implicit alignment back in the
days given the even the dynamically sized ring buffer was guaranteed to
be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
__LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
be allowed only if it was > __LOG_BUF_LEN and we always ended up
rounding the log_buf_len=n to the next power of 2 with
roundup_pow_of_two(), any multiple of 2 then should be also architecture
aligned.  These assumptions of course relied heavily on
CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
change this.

We now have precise alignment requirements set for the log buffer size
for both static and dynamic allocations, but lets upkeep the old
practice of using powers of 2 for its size to help with easy expected
scalable values and the allocators for dynamic allocations.  We'll reuse
this later so move this into a helper.

Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:23 -07:00
Luis R. Rodriguez
7030017752 printk: make dynamic kernel ring buffer alignment explicit
We have to consider alignment for the ring buffer both for the default
static size, and then also for when an dynamic allocation is made when
the log_buf_len=n kernel parameter is passed to set the size
specifically to a size larger than the default size set by the
architecture through CONFIG_LOG_BUF_SHIFT.

The default static kernel ring buffer can be aligned properly if
architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
value it can be reduced to a non aligned value.  Commit 6ebb017de9
("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
Lunn ensures the static buffer is always aligned and the decision of
alignment is done by the compiler by using __alignof__(struct log).

When log_buf_len=n is used we allocate the ring buffer dynamically.
Dynamic allocation varies, for the early allocation called before
setup_arch() memblock_virt_alloc() requests a page aligment and for the
default kernel allocation memblock_virt_alloc_nopanic() requests no
special alignment, which in turn ends up aligning the allocation to
SMP_CACHE_BYTES, which is L1 cache aligned.

Since we already have the required alignment for the kernel ring buffer
though we can do better and request explicit alignment for LOG_ALIGN.
This does that to be safe and make dynamic allocation alignment
explicit.

Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Tested-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Petr Mladek <pmladek@suse.cz>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Arun KS <arunks.linux@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:23 -07:00
Sasha Levin
618fde8721 kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path
The rarely-executed memry-allocation-failed callback path generates a
WARN_ON_ONCE() when smp_call_function_single() succeeds.  Presumably
it's supposed to warn on failures.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:22 -07:00
David Rientjes
fb794bcbb4 mm, oom: remove unnecessary exit_state check
The oom killer scans each process and determines whether it is eligible
for oom kill or whether the oom killer should abort because of
concurrent memory freeing.  It will abort when an eligible process is
found to have TIF_MEMDIE set, meaning it has already been oom killed and
we're waiting for it to exit.

Processes with task->mm == NULL should not be considered because they
are either kthreads or have already detached their memory and killing
them would not lead to memory freeing.  That memory is only freed after
exit_mm() has returned, however, and not when task->mm is first set to
NULL.

Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
is no longer considered for oom kill, but only until exit_mm() has
returned.  This was fragile in the past because it relied on
exit_notify() to be reached before no longer considering TIF_MEMDIE
processes.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:21 -07:00
David Rientjes
ed4d4902eb mm, hugetlb: remove hugetlb_zero and hugetlb_infinity
They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
passing extra2 == NULL is equivalent to infinity.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:19 -07:00
Fabian Frederick
656c3b79f7 kernel/watchdog.c: convert printk/pr_warning to pr_foo()
Replace some obsolete functions.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Fabian Frederick
bab5e2d652 kernel/auditfilter.c: replace count*size kmalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:12 -07:00
Lee, Chun-Yi
84c91b7ae0 PM / hibernate: avoid unsafe pages in e820 reserved regions
When the machine doesn't well handle the e820 persistent when hibernate
resuming, then it may cause page fault when writing image to snapshot
buffer:

[   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
[   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
[   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
[   17.933469] Oops: 0002 [#1] SMP
...

The ffff880069d4f000 page is in e820 reserved region of resume boot
kernel:

[    0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
...
[    0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]

So snapshot.c mark the pfn to forbidden pages map. But, this
page is also in the memory bitmap in snapshot image because it's an
original page used by image kernel, so it will also mark as an
unsafe(free) page in prepare_image().

That means the page in e820 when resuming mark as "forbidden" and
"free", it causes get_buffer() treat it as an allocated unsafe page.
Then snapshot_write_next() return this page to load_image, load_image
writing content to this address, but this page didn't really allocated
. So, we got page fault.

Although the root cause is from BIOS, I think aggressive check and
significant message in kernel will better then a page fault for
issue tracking, especially when serial console unavailable.

This patch adds code in mark_unsafe_pages() for check does free pages in
nosave region. If so, then it print message and return fault to stop whole
S4 resume process:

[    8.166004] PM: Image loading progress:   0%
[    8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
[    8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
[    8.926633] PM: Error -14 resuming
[    8.933534] PM: Failed to load hibernation image, recovering.

Reviewed-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
[rjw: Subject]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-06 23:50:07 +02:00
Steven Rostedt (Red Hat)
651e22f270 ring-buffer: Always reset iterator to reader page
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
 00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
 f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
 ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
 [<c18796b3>] dump_stack+0x4b/0x75
 [<c103a0e3>] warn_slowpath_common+0x7e/0x95
 [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
 [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
 [<c103a185>] warn_slowpath_fmt+0x33/0x35
 [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
 [<c10bed04>] trace_find_cmdline+0x40/0x64
 [<c10c3c16>] trace_print_context+0x27/0xec
 [<c10c4360>] ? trace_seq_printf+0x37/0x5b
 [<c10c0b15>] print_trace_line+0x319/0x39b
 [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
 [<c10c13b1>] s_show+0x192/0x1ab
 [<c10bfd9a>] ? s_next+0x5a/0x7c
 [<c112e76e>] seq_read+0x267/0x34c
 [<c1115a25>] vfs_read+0x8c/0xef
 [<c112e507>] ? seq_lseek+0x154/0x154
 [<c1115ba2>] SyS_read+0x54/0x7f
 [<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid.  The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Cc: stable@vger.kernel.org # 2.6.28+
Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-08-06 16:06:34 -04:00
Steven Rostedt (Red Hat)
021de3d904 ring-buffer: Up rb_iter_peek() loop count to 3
After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:

 WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
 Modules linked in: ipt_MASQUERADE sunrpc [...]
 CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
  0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
  ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
  ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
 Call Trace:
  [<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
  [<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
  [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
  [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
  [<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
  [<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
  [<ffffffff810c74a3>] ? s_start+0xd7/0x17b
  [<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
  [<ffffffff8114cf94>] ? seq_read+0x148/0x361
  [<ffffffff81132d98>] ? vfs_read+0x93/0xf1
  [<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
  [<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2

Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!

rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.

That is, we have this:

 1. iter->head > iter->head_page->page->commit
    (rb_inc_iter() which moves the iter to the next page)
    try again

 2. event = rb_iter_head_event()
    event->type_len == RINGBUF_TYPE_TIME_EXTEND
    rb_advance_iter()
    try again

 3. read the event.

But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.

Up the counter to 3.

Cc: stable@vger.kernel.org # 2.6.37+
Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-08-06 16:06:33 -04:00
Mel Gorman
372ba8cb46 cpuidle: menu: Lookup CPU runqueues less
The menu governer makes separate lookups of the CPU runqueue to get
load and number of IO waiters but it can be done with a single lookup.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-08-06 21:17:45 +02:00
Linus Torvalds
ae045e2455 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Steady transitioning of the BPF instructure to a generic spot so
      all kernel subsystems can make use of it, from Alexei Starovoitov.

   2) SFC driver supports busy polling, from Alexandre Rames.

   3) Take advantage of hash table in UDP multicast delivery, from David
      Held.

   4) Lighten locking, in particular by getting rid of the LRU lists, in
      inet frag handling.  From Florian Westphal.

   5) Add support for various RFC6458 control messages in SCTP, from
      Geir Ola Vaagland.

   6) Allow to filter bridge forwarding database dumps by device, from
      Jamal Hadi Salim.

   7) virtio-net also now supports busy polling, from Jason Wang.

   8) Some low level optimization tweaks in pktgen from Jesper Dangaard
      Brouer.

   9) Add support for ipv6 address generation modes, so that userland
      can have some input into the process.  From Jiri Pirko.

  10) Consolidate common TCP connection request code in ipv4 and ipv6,
      from Octavian Purdila.

  11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

  12) Generic resizable RCU hash table, with intial users in netlink and
      nftables.  From Thomas Graf.

  13) Maintain a name assignment type so that userspace can see where a
      network device name came from (enumerated by kernel, assigned
      explicitly by userspace, etc.) From Tom Gundersen.

  14) Automatic flow label generation on transmit in ipv6, from Tom
      Herbert.

  15) New packet timestamping facilities from Willem de Bruijn, meant to
      assist in measuring latencies going into/out-of the packet
      scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
  cxgb4 : Disable recursive mailbox commands when enabling vi
  net: reduce USB network driver config options.
  tg3: Modify tg3_tso_bug() to handle multiple TX rings
  amd-xgbe: Perform phy connect/disconnect at dev open/stop
  amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
  net: sun4i-emac: fix memory leak on bad packet
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  Revert "net: phy: Set the driver when registering an MDIO bus device"
  cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
  team: Simplify return path of team_newlink
  bridge: Update outdated comment on promiscuous mode
  net-timestamp: ACK timestamp for bytestreams
  net-timestamp: TCP timestamping
  net-timestamp: SCHED timestamp on entering packet scheduler
  net-timestamp: add key to disambiguate concurrent datagrams
  net-timestamp: move timestamp flags out of sk_flags
  net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
  cxgb4i : Move stray CPL definitions to cxgb4 driver
  tcp: reduce spurious retransmits due to transient SACK reneging
  qlcnic: Initialize dcbnl_ops before register_netdev
  ...
2014-08-06 09:38:14 -07:00
Linus Torvalds
bb2cbf5e93 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "In this release:

   - PKCS#7 parser for the key management subsystem from David Howells
   - appoint Kees Cook as seccomp maintainer
   - bugfixes and general maintenance across the subsystem"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
  X.509: Need to export x509_request_asymmetric_key()
  netlabel: shorter names for the NetLabel catmap funcs/structs
  netlabel: fix the catmap walking functions
  netlabel: fix the horribly broken catmap functions
  netlabel: fix a problem when setting bits below the previously lowest bit
  PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
  tpm: simplify code by using %*phN specifier
  tpm: Provide a generic means to override the chip returned timeouts
  tpm: missing tpm_chip_put in tpm_get_random()
  tpm: Properly clean sysfs entries in error path
  tpm: Add missing tpm_do_selftest to ST33 I2C driver
  PKCS#7: Use x509_request_asymmetric_key()
  Revert "selinux: fix the default socket labeling in sock_graft()"
  X.509: x509_request_asymmetric_keys() doesn't need string length arguments
  PKCS#7: fix sparse non static symbol warning
  KEYS: revert encrypted key change
  ima: add support for measuring and appraising firmware
  firmware_class: perform new LSM checks
  security: introduce kernel_fw_from_file hook
  PKCS#7: Missing inclusion of linux/err.h
  ...
2014-08-06 08:06:39 -07:00
Richard Weinberger
828b1f65d2 Rip out get_signal_to_deliver()
Now we can turn get_signal() to the main function.

Signed-off-by: Richard Weinberger <richard@nod.at>
2014-08-06 13:03:44 +02:00
Richard Weinberger
10b1c7ac8b Clean up signal_delivered()
- Pass a ksignal struct to it
 - Remove unused regs parameter
 - Make it private as it's nowhere outside of kernel/signal.c is used

Signed-off-by: Richard Weinberger <richard@nod.at>
2014-08-06 13:03:43 +02:00
Richard Weinberger
df5601f9c3 tracehook_signal_handler: Remove sig, info, ka and regs
These parameters are nowhere used, so we can remove them.

Signed-off-by: Richard Weinberger <richard@nod.at>
2014-08-06 13:03:43 +02:00
Linus Torvalds
e7fda6c4c3 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and time updates from Thomas Gleixner:
 "A rather large update of timers, timekeeping & co

   - Core timekeeping code is year-2038 safe now for 32bit machines.
     Now we just need to fix all in kernel users and the gazillion of
     user space interfaces which rely on timespec/timeval :)

   - Better cache layout for the timekeeping internal data structures.

   - Proper nanosecond based interfaces for in kernel users.

   - Tree wide cleanup of code which wants nanoseconds but does hoops
     and loops to convert back and forth from timespecs.  Some of it
     definitely belongs into the ugly code museum.

   - Consolidation of the timekeeping interface zoo.

   - A fast NMI safe accessor to clock monotonic for tracing.  This is a
     long standing request to support correlated user/kernel space
     traces.  With proper NTP frequency correction it's also suitable
     for correlation of traces accross separate machines.

   - Checkpoint/restart support for timerfd.

   - A few NOHZ[_FULL] improvements in the [hr]timer code.

   - Code move from kernel to kernel/time of all time* related code.

   - New clocksource/event drivers from the ARM universe.  I'm really
     impressed that despite an architected timer in the newer chips SoC
     manufacturers insist on inventing new and differently broken SoC
     specific timers.

[ Ed. "Impressed"? I don't think that word means what you think it means ]

   - Another round of code move from arch to drivers.  Looks like most
     of the legacy mess in ARM regarding timers is sorted out except for
     a few obnoxious strongholds.

   - The usual updates and fixlets all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  timekeeping: Fixup typo in update_vsyscall_old definition
  clocksource: document some basic timekeeping concepts
  timekeeping: Use cached ntp_tick_length when accumulating error
  timekeeping: Rework frequency adjustments to work better w/ nohz
  timekeeping: Minor fixup for timespec64->timespec assignment
  ftrace: Provide trace clocks monotonic
  timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
  seqcount: Add raw_write_seqcount_latch()
  seqcount: Provide raw_read_seqcount()
  timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
  timekeeping: Create struct tk_read_base and use it in struct timekeeper
  timekeeping: Restructure the timekeeper some more
  clocksource: Get rid of cycle_last
  clocksource: Move cycle_last validation to core code
  clocksource: Make delta calculation a function
  wireless: ath9k: Get rid of timespec conversions
  drm: vmwgfx: Use nsec based interfaces
  drm: i915: Use nsec based interfaces
  timekeeping: Provide ktime_get_raw()
  hangcheck-timer: Use ktime_get_ns()
  ...
2014-08-05 17:46:42 -07:00