The tilegx USB OHCI support needs the bounce pool since we're not
using the IOMMU to handle 32-bit addresses.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Merge Andrew's remaining patches for 3.5:
"Nine fixes"
* Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (9 commits)
mm: fix lost kswapd wakeup in kswapd_stop()
m32r: make memset() global for CONFIG_KERNEL_BZIP2=y
m32r: add memcpy() for CONFIG_KERNEL_GZIP=y
m32r: consistently use "suffix-$(...)"
m32r: fix 'fix breakage from "m32r: use generic ptrace_resume code"' fallout
m32r: fix pull clearing RESTORE_SIGMASK into block_sigmask() fallout
m32r: remove duplicate definition of PTRACE_O_TRACESYSGOOD
mn10300: fix "pull clearing RESTORE_SIGMASK into block_sigmask()" fallout
bootmem: make ___alloc_bootmem_node_nopanic() really nopanic
Offlining memory may block forever, waiting for kswapd() to wake up
because kswapd() does not check the event kthread->should_stop before
sleeping.
The proper pattern, from Documentation/memory-barriers.txt, is:
--- waker ---
event_indicated = 1;
wake_up_process(event_daemon);
--- sleeper ---
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (event_indicated)
break;
schedule();
}
set_current_state() may be wrapped by:
prepare_to_wait();
In the kswapd() case, event_indicated is kthread->should_stop.
=== offlining memory (waker) ===
kswapd_stop()
kthread_stop()
kthread->should_stop = 1
wake_up_process()
wait_for_completion()
=== kswapd_try_to_sleep (sleeper) ===
kswapd_try_to_sleep()
prepare_to_wait()
.
.
schedule()
.
.
finish_wait()
The schedule() needs to be protected by a test of kthread->should_stop,
which is wrapped by kthread_should_stop().
Reproducer:
Do heavy file I/O in background.
Do a memory offline/online in a tight loop
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In reaction to commit 99ab7b19440a ("mm: sparse: fix usemap allocation
above node descriptor section") Johannes said:
| while backporting the below patch, I realised that your fix busted
| f5bf18fa22f8 again. The problem was not a panicking version on
| allocation failure but when the usemap size was too large such that
| goal + size > limit triggers the BUG_ON in the bootmem allocator. So
| we need a version that passes limit ONLY if the usemap is smaller than
| the section.
after checking the code, the name of ___alloc_bootmem_node_nopanic()
does not reflect the fact.
Make bootmem really not panic.
Hope will kill bootmem sooner.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CMA and DMA-mapping fixes from Marek Szyprowski:
"Another set of minor fixups for recently merged Contiguous Memory
Allocator and ARM DMA-mapping changes. Those patches fix mysterious
crashes on systems with CMA and Himem enabled as well as some corner
cases caused by typical off-by-one bug."
* 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
ARM: dma-mapping: modify condition check while freeing pages
mm: cma: fix condition check when setting global cma area
mm: cma: don't replace lowmem pages with highmem
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit fd3142a59af2012a7c5dc72ec97a4935ff1c5fc6 broke
slob since a piece of a change for a later patch slipped into
it.
Fengguang Wu writes:
The commit crashes the kernel w/o any dmesg output (the attached one is
created by the script as a summary for that run). This is very
reproducible in kvm for the attached config.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
memblock_free_reserved_regions() calls memblock_free(), but
memblock_free() would double reserved.regions too, so we could free the
old range for reserved.regions.
Also tj said there is another bug which could be related to this.
| I don't think we're saving any noticeable
| amount by doing this "free - give it to page allocator - reserve
| again" dancing. We should just allocate regions aligned to page
| boundaries and free them later when memblock is no longer in use.
in that case, when DEBUG_PAGEALLOC, will get panic:
memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
BUG: unable to handle kernel paging request at ffff88102febd948
IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155
PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU 0
Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
RIP: 0010:[<ffffffff836a5774>] [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155
See the discussion at https://lkml.org/lkml/2012/6/13/469
So try to allocate with PAGE_SIZE alignment and free it later.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit f5bf18fa22f8 ("bootmem/sparsemem: remove limit constraint
in alloc_bootmem_section"), usemap allocations may easily be placed
outside the optimal section that holds the node descriptor, even if
there is space available in that section. This results in unnecessary
hotplug dependencies that need to have the node unplugged before the
section holding the usemap.
The reason is that the bootmem allocator doesn't guarantee a linear
search starting from the passed allocation goal but may start out at a
much higher address absent an upper limit.
Fix this by trying the allocation with the limit at the section end,
then retry without if that fails. This keeps the fix from f5bf18fa22f8
of not panicking if the allocation does not fit in the section, but
still makes sure to try to stay within the section at first.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [3.3.x, 3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 238305bb4d41 ("mm: remove sparsemem allocation details from the
bootmem allocator") introduced a bug in the allocation goal calculation
that put section usemaps not in the same section as the node
descriptors, creating unnecessary hotplug dependencies between them:
node 0 must be removed before remove section 16399
node 1 must be removed before remove section 16399
node 2 must be removed before remove section 16399
node 3 must be removed before remove section 16399
node 4 must be removed before remove section 16399
node 5 must be removed before remove section 16399
node 6 must be removed before remove section 16399
The reason is that it applies PAGE_SECTION_MASK to the physical address
of the node descriptor when finding a suitable place to put the usemap,
when this mask is actually intended to be used with PFNs. Because the
PFN mask is wider, the target address will point beyond the wanted
section holding the node descriptor and the node must be offlined before
the section holding the usemap can go.
Fix this by extending the mask to address width before use.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem_add_to_page_cache() has three callsites, but only one of them wants
the radix_tree_preload() (an exceptional entry guarantees that the radix
tree node is present in the other cases), and only that site can achieve
mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a no-op in the
other cases). We did it this way originally to reflect
add_to_page_cache_locked(); but it's confusing now, so move the radix_tree
preloading and mem_cgroup uncharging to that one caller.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When adding the page_private checks before calling shmem_replace_page(), I
did realize that there is a further race, but thought it too unlikely to
need a hurried fix.
But independently I've been chasing why a mem cgroup's memory.stat
sometimes shows negative rss after all tasks have gone: I expected it to
be a stats gathering bug, but actually it's shmem swapping's fault.
It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
the page may have been removed from swapcache before getting the lock; or
it may have been freed and reused and be back in swapcache; and it can
even be using the same swap location as before (page_private same).
The swapoff case is already secure against this (swap cannot be reused
until the whole area has been swapped off, and a new swapped on); and
shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check for
the expected radix_tree entry - but a little too late.
By that time, we might have already decided to shmem_replace_page(): I
don't know of a problem from that, but I'd feel more at ease not to do so
spuriously. And we have already done mem_cgroup_cache_charge(), on
perhaps the wrong mem cgroup: and this charge is not then undone on the
error path, because PageSwapCache ends up preventing that.
It's this last case which causes the occasional negative rss in
memory.stat: the page is charged here as cache, but (sometimes) found to
be anon when eventually it's uncharged - and in between, it's an
undeserved charge on the wrong memcg.
Fix this by adding an earlier check on the radix_tree entry: it's
inelegant to descend the tree twice, but swapping is not the fast path,
and a better solution would need a pair (try+commit) of memcg calls, and a
rework of shmem_replace_page() to keep out of the swapcache.
We can use the added shmem_confirm_swap() function to replace the
find_get_page+page_cache_release we were already doing on the error path.
And add a comment on that -EEXIST: it seems a peculiar errno to be using,
but originates from its use in radix_tree_insert().
[It can be surprising to see positive rss left in a memcg's memory.stat
after all tasks have gone, since it is supposed to count anonymous but not
shmem. Aside from sharing anon pages via fork with a task in some other
memcg, it often happens after swapping: because a swap page can't be freed
while under writeback, nor while locked. So it's not an error, and these
residual pages are easily freed once pressure demands.]
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"). I believe
it's correct, and it's been nice to have from rc1 to rc6; but as the
original commit said:
I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
would be of any use to them on tmpfs. This code adds 92 lines and 752
bytes on x86_64 - is that bloat or worthwhile?
Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs to
the dumb generic support for v3.5. We can always reinstate it later if
useful, and anyone needing it in a hurry can just get it out of git.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should goto error to release memory resource if hotadd_new_pgdat()
failed.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki ISIMATU <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If page migration cannot charge the temporary page to the memcg,
migrate_pages() will return -ENOMEM. This isn't considered in memory
compaction however, and the loop continues to iterate over all
pageblocks trying to isolate and migrate pages. If a small number of
very large memcgs happen to be oom, however, these attempts will mostly
be futile leading to an enormous amout of cpu consumption due to the
page migration failures.
This patch will short circuit and fail memory compaction if
migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case
some migrations were successful so that the page allocator will retry.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit dad1743e5993f1 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t->si_addr. This
would prevent application level recovery from the fault.
Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 3.4+
kmemcheck_alloc_shadow() requires irqs to be enabled, so wait to disable
them until after its called for __GFP_WAIT allocations.
This fixes a warning for such allocations:
WARNING: at kernel/lockdep.c:2739 lockdep_trace_alloc+0x14e/0x1c0()
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Move the mutex handling into the common kmem_cache_create()
function.
Then we can also move more checks out of SLAB's kmem_cache_create()
into the common code.
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Use the mutex definition from SLAB and make it the common way to take a sleeping lock.
This has the effect of using a mutex instead of a rw semaphore for SLUB.
SLOB gains the use of a mutex for kmem_cache_create serialization.
Not needed now but SLOB may acquire some more features later (like slabinfo
/ sysfs support) through the expansion of the common code that will
need this.
Reviewed-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
All allocators have some sort of support for the bootstrap status.
Setup a common definition for the boot states and make all slab
allocators use that definition.
Reviewed-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Kmem_cache_create() does a variety of sanity checks but those
vary depending on the allocator. Use the strictest tests and put them into
a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM.
This patch has the effect of adding sanity checks for SLUB and SLOB
under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
If list_for_each_entry, etc complete a traversal of the list, the iterator
variable ends up pointing to an address at an offset from the list head,
and not a meaningful structure. Thus this value should not be used after
the end of the iterator. The patch replaces s->name by al->name, which is
referenced nearby.
This problem was found using Coccinelle (http://coccinelle.lip6.fr/).
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).
The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The filesystem layer expects pages in the block device's mapping to not
be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
currently replace lowmem pages with highmem pages, leading to crashes in
filesystem code such as the one below:
Unable to handle kernel NULL pointer dereference at virtual address 00000400
pgd = c0c98000
[00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
Internal error: Oops: 817 [#1] PREEMPT SMP ARM
CPU: 0 Not tainted (3.5.0-rc5+ #80)
PC is at __memzero+0x24/0x80
...
Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
Backtrace:
[<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
[<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
r4:c15337f0
[<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
[<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
[<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
r6:beccdcf0 r5:00074000 r4:beccdbbc
[<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)
Fix this by replacing only highmem pages with highmem.
Reported-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Pull block bits from Jens Axboe:
"As vacation is coming up, thought I'd better get rid of my pending
changes in my for-linus branch for this iteration. It contains:
- Two patches for mtip32xx. Killing a non-compliant sysfs interface
and moving it to debugfs, where it belongs.
- A few patches from Asias. Two legit bug fixes, and one killing an
interface that is no longer in use.
- A patch from Jan, making the annoying partition ioctl warning a bit
less annoying, by restricting it to !CAP_SYS_RAWIO only.
- Three bug fixes for drbd from Lars Ellenberg.
- A fix for an old regression for umem, it hasn't really worked since
the plugging scheme was changed in 3.0.
- A few fixes from Tejun.
- A splice fix from Eric Dumazet, fixing an issue with pipe
resizing."
* 'for-linus' of git://git.kernel.dk/linux-block:
scsi: Silence unnecessary warnings about ioctl to partition
block: Drop dead function blk_abort_queue()
block: Mitigate lock unbalance caused by lock switching
block: Avoid missed wakeup in request waitqueue
umem: fix up unplugging
splice: fix racy pipe->buffers uses
drbd: fix null pointer dereference with on-congestion policy when diskless
drbd: fix list corruption by failing but already aborted reads
drbd: fix access of unallocated pages and kernel panic
xen/blkfront: Add WARN to deal with misbehaving backends.
blkcg: drop local variable @q from blkg_destroy()
mtip32xx: Create debugfs entries for troubleshooting
mtip32xx: Remove 'registers' and 'flags' from sysfs
blkcg: fix blkg_alloc() failure path
block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
block: fix return value on cfq_init() failure
mtip32xx: Remove version.h header file inclusion
xen/blkback: Copy id field when doing BLKIF_DISCARD.
During kmem_cache_init_late(), we transition to the LATE state,
and after some more work, to the FULL state, its last state
This is quite different from slub, that will only transition to
its last state (previously SYSFS), in a (late)initcall, after a lot
more of the kernel is ready.
This means that in slab, we have no way to taking actions dependent
on the initialization of other pieces of the kernel that are supposed
to start way after kmem_init_late(), such as cgroups initialization.
To achieve more consistency in this behavior, that patch only
transitions to the UP state in kmem_init_late. In my analysis,
setup_cpu_cache() should be happy to test for >= UP, instead of
== FULL. It also has passed some tests I've made.
We then only mark FULL state after the reap timers are in place,
meaning that no further setup is expected.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Commit 8c138b only sits in Pekka's and linux-next tree now, which tries
to replace obj_size(cachep) with cachep->object_size, but has a typo in
kmem_cache_free() by using "size" instead of "object_size", which casues
some regressions.
Reported-and-tested-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct
kmem_cache") renamed the kmem_cache structure's "next" field to "list"
but forgot to update one instance in leaks_show().
Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
A consistent name with slub saves us an acessor function.
In both caches, this field represents the same thing. We would
like to use it from the mem_cgroup code.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Conflicts:
include/linux/mmzone.h
Synced with Linus' tree so that trivial patch can be applied
on top of up-to-date code properly.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Since there are five lists in LRU cache, the array nr in get_scan_count
should be:
nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch enabled the tlb flush range support in generic mmu layer.
Most of arch has self tlb flush range support, like ARM/IA64 etc.
X86 arch has no this support in hardware yet. But another instruction
'invlpg' can implement this function in some degree. So, enable this
feather in generic layer for x86 now. and maybe useful for other archs
in further.
Generic mmu_gather struct is protected by micro
HAVE_GENERIC_MMU_GATHER. Other archs that has flush range supported
own self mmu_gather struct. So, now this change is safe for them.
In future we may unify this struct and related functions on multiple
archs.
Thanks for Peter Zijlstra time and time reminder for multiple
architecture code safe!
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-7-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the range passed to mbind() is not allocated on nodes set in the
nodemask, it migrates the pages to respect the constraint.
The final formal of migrate_pages() is a mode of type enum migrate_mode,
not a boolean. do_mbind() is currently passing "true" which is the
equivalent of MIGRATE_SYNC_LIGHT. This should instead be MIGRATE_SYNC
for synchronous page migration.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_memory_core_early() asks memblock for a range of memory then try
to reserve it. If the reserved region array lacks space for the new
range, memblock_double_array() is called to allocate more space for the
array. If memblock is used to allocate memory for the new array it can
end up using a range that overlaps with the range originally allocated in
__alloc_memory_core_early(), leading to possible data corruption.
With this patch memblock_double_array() now calls memblock_find_in_range()
with a narrowed candidate range (in cases where the reserved.regions array
is being doubled) so any memory allocated will not overlap with the
original range that was being reserved. The range is narrowed by passing
in the starting address and size of the previously allocated range. Then
the range above the ending address is searched and if a candidate is not
found, the range below the starting address is searched.
Signed-off-by: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings in mm/memory.c:
Warning(mm/memory.c:1377): No description found for parameter 'start'
Warning(mm/memory.c:1377): Excess function parameter 'address' description in 'zap_page_range'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings such as
Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'
Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea asked for addr, end, vma->vm_start, and vma->vm_end to be emitted
when !rwsem_is_locked(&tlb->mm->mmap_sem). Otherwise, debugging the
underlying issue is more difficult.
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If use_hierarchy is set, reclaim testing soon oopses in css_is_ancestor()
called from __mem_cgroup_same_or_subtree() called from page_referenced():
when processes are exiting, it's easy for mm_match_cgroup() to pass along
a NULL memcg coming from a NULL mm->owner.
Check for that in __mem_cgroup_same_or_subtree(). Return true or false?
False because we cannot know if it was in the hierarchy, but also false
because it's better not to count a reference from an exiting process.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The divide in p->signal->oom_score_adj * totalpages / 1000 within
oom_badness() was causing an overflow of the signed long data type.
This adds both the root bias and p->signal->oom_score_adj before doing the
normalization which fixes the issue and also cleans up the calculation.
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
get_freelist(), unfreeze_partials() are only called with interrupt disabled,
so __cmpxchg_double_slab() is suitable.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
slab_node() could access current->mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.
Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.
Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.
I believe the original mempolicy code did that in fact,
so it's likely a regression.
v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma <asharma@fb.com>
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Mackey <tdmackey@twitter.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Minchan Kim reports that when a system has many swap areas, and tmpfs
swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
back the page cannot locate it, and the read fails with -ENOMEM.
Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
swp_entry_to_pte()) technique for determining maximum usable swap
offset, without stopping to realize that that actually depends upon the
pte swap encoding shifting swap offset to the higher bits and truncating
it there. Whereas our radix_tree swap encoding leaves offset in the
lower bits: it's swap "type" (that is, index of swap area) that was
truncated.
Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().
This does not reduce the usable size of a swap area any further, it
leaves it as claimed when making the original commit: no change from 3.0
on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
per swapfile on i386 with PAE. It's not a change I would have risked
five years ago, but with x86_64 supported for ten years, I believe it's
appropriate now.
Hmm, and what if some architecture implements its swap pte with offset
encoded below type? That would equally break the maximum usable swap
offset check. Happily, they all follow the same tradition of encoding
offset above type, but I'll prepare a check on that for next.
Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core updates (RCU and locking) from Ingo Molnar:
"Most of the diffstat comes from the RCU slow boot regression fixes,
but there's also a debuggability improvements/fixes."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
memblock: Document memblock_is_region_{memory,reserved}()
rcu: Precompute RCU_FAST_NO_HZ timer offsets
rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
rcu: RCU_FAST_NO_HZ detection of callback adoption
spinlock: Indicate that a lockup is only suspected
kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
panic: Make panic_on_oops configurable
The size of the slab object is frequently needed. Since we now
have a size field directly in the kmem_cache structure there is no
need anymore of the obj_size macro/function.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Define a struct that describes common fields used in all slab allocators.
A slab allocator either uses the common definition (like SLOB) or is
required to provide members of kmem_cache with the definition given.
After that it will be possible to share code that
only operates on those fields of kmem_cache.
The patch basically takes the slob definition of kmem cache and
uses the field namees for the other allocators.
It also standardizes the names used for basic object lengths in
allocators:
object_size Struct size specified at kmem_cache_create. Basically
the payload expected to be used by the subsystem.
size The size of memory allocator for each object. This size
is larger than object_size and includes padding, alignment
and extra metadata for each object (f.e. for debugging
and rcu).
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>