linux

mirror of https://github.com/FEX-Emu/linux.git synced 2024-12-16 14:02:10 +00:00

Author	SHA1	Message	Date
Minchan Kim	48db54ee2f	mm/migration: fix page corruption during hugepage migration If migrate_huge_page by memory-failure fails , it calls put_page in itself to decrease page reference and caller of migrate_huge_page also calls putback_lru_pages. It can do double free of page so it can make page corruption on page holder. In addtion, clean of pages on caller is consistent behavior with migrate_pages by `cf608ac19c` ("mm: compaction: fix COMPACTPAGEFAILED counting"). Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-02 16:03:18 -08:00
Andrea Arcangeli	57fc4a5ee3	mm: when migrate_pages returns 0, all pages must have been released In some cases migrate_pages could return zero while still leaving a few pages in the pagelist (and some caller wouldn't notice it has to call putback_lru_pages after commit `cf608ac19c` ("mm: compaction: fix COMPACTPAGEFAILED counting")). Add one missing putback_lru_pages not added by commit `cf608ac19c` ("mm: compaction: fix COMPACTPAGEFAILED counting"). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-02 16:03:18 -08:00
Michal Hocko	552b372ba9	memsw: deprecate noswapaccount kernel parameter and schedule it for removal noswapaccount couldn't be used to control memsw for both on/off cases so we have added swapaccount[=0\|1] parameter. This way we can turn the feature in two ways noswapaccount resp. swapaccount=0. We have kept the original noswapaccount but I think we should remove it after some time as it just makes more command line parameters without any advantages and also the code to handle parameters is uglier if we want both parameters. Signed-off-by: Michal Hocko <mhocko@suse.cz> Requested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-02 16:03:18 -08:00
Michal Hocko	fceda1bf49	memsw: handle swapaccount kernel parameter correctly __setup based kernel command line parameters handlers which are handled in obsolete_checksetup are provided with the parameter value including = (more precisely everything right after the parameter name). This means that the current implementation of swapaccount[=1\|0] doesn't work at all because if there is a value for the parameter then we are testing for "0" resp. "1" but we are getting "=0" resp. "=1" and if there is no parameter value we are getting an empty string rather than NULL. The original noswapccount parameter, which doesn't care about the value, works correctly. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-02 16:03:18 -08:00
Michel Lespinasse	fdf4c587a7	mlock: operate on any regions with protection != PROT_NONE As Tao Ma noticed, change `5ecfda0` breaks blktrace. This is because blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ, so my attempt to not unnecessarity break COW during mlock ended up causing mlock to fail with a permission problem. I am proposing to let mlock ignore vma protection in all cases except PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions (as in the blktrace case, which broke at `5ecfda0`) or for PROT_EXEC regions (which seem to me like they were always broken). Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-02 10:20:50 +11:00
Eric Paris	2a7dba391e	fs/vfs/security: pass last path component to LSM on inode creation SELinux would like to implement a new labeling behavior of newly created inodes. We currently label new inodes based on the parent and the creating process. This new behavior would also take into account the name of the new object when deciding the new label. This is not the (supposed) full path, just the last component of the path. This is very useful because creating /etc/shadow is different than creating /etc/passwd but the kernel hooks are unable to differentiate these operations. We currently require that userspace realize it is doing some difficult operation like that and than userspace jumps through SELinux hoops to get things set up correctly. This patch does not implement new behavior, that is obviously contained in a seperate SELinux patch, but it does pass the needed name down to the correct LSM hook. If no such name exists it is fine to pass NULL. Signed-off-by: Eric Paris <eparis@redhat.com>	2011-02-01 11:12:29 -05:00
Linus Torvalds	4fda116852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm: kmemleak: Allow kmemleak metadata allocations to fail kmemleak: remove memset by using kzalloc	2011-01-31 12:55:38 +10:00
Catalin Marinas	6ae4bd1f0b	kmemleak: Allow kmemleak metadata allocations to fail This patch adds __GFP_NORETRY and __GFP_NOMEMALLOC flags to the kmemleak metadata allocations so that it has a smaller effect on the users of the kernel slab allocator. Since kmemleak allocations can now fail more often, this patch also reduces the verbosity by passing __GFP_NOWARN and not dumping the stack trace when a kmemleak allocation fails. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Toralf Förster <toralf.foerster@gmx.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: David Rientjes <rientjes@google.com> Cc: Ted Ts'o <tytso@mit.edu>	2011-01-27 18:32:06 +00:00
Jesper Juhl	0a08739e81	kmemleak: remove memset by using kzalloc We don't need to memset if we just use kzalloc() rather than kmalloc() in kmemleak_test_init(). Signed-off-by: Jesper Juhl <jj@chaosbits.net> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2011-01-27 18:31:51 +00:00
KAMEZAWA Hiroyuki	52dbb90509	memcg: fix race at move_parent around compound_order() A fix up mem_cgroup_move_parent() which use compound_order() in asynchronous manner. This compound_order() may return unknown value because we don't take lock. Use PageTransHuge() and HPAGE_SIZE instead of it. Also clean up for mem_cgroup_move_parent(). - remove unnecessary initialization of local variable. - rename charge_size -> page_size - remove unnecessary (wrong) comment. - added a comment about THP. Note: Current design take compound_page_lock() in caller of move_account(). This should be revisited when we implement direct move_task of hugepage without splitting. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:04 +10:00
KAMEZAWA Hiroyuki	3d37c4a919	memcg: bugfix check mem_cgroup_disabled() at split fixup mem_cgroup_disabled() should be checked at splitting. If disabled, no heavy work is necesary. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:03 +10:00
KAMEZAWA Hiroyuki	01c88e2d6b	memcg: fix account leak at failure of memsw acconting Commit `4b53433468` ("memcg: clean up try_charge main loop") removes a cancel of charge at case: memory charge-> success. mem+swap charge-> failure. This leaks usage of memory. Fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: <stable@kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:03 +10:00
Minchan Kim	28bd65781c	mm: migration: clarify migrate_pages() comment Callers of migrate_pages should putback_lru_pages to return pages isolated to LRU or free list. Now comment is rather confusing. It says caller always have to call it. It is more clear to point out that the caller has to call it if migrate_pages's return value isn't zero. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:02 +10:00
Andrea Arcangeli	33a938774f	mm: compaction: don't depend on HUGETLB_PAGE Commit `5d6892407` ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled") causes this warning during the configuration process: warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU) COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP either, it is also useful for regular alloc_pages(order > 0) including the very kernel stack during fork (THREAD_ORDER = 1). It's always better to enable COMPACTION. The warning should be an error because we would end up with MIGRATION not selected, and COMPACTION wouldn't work without migration (despite it seems to build with an inline migrate_pages returning -ENOSYS). I'd also like to remove EXPERIMENTAL: compaction has been in the kernel for some releases (for full safety the default remains disabled which I think is enough). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Luca Tettamanti <kronos.it@gmail.com> Tested-by: Luca Tettamanti <kronos.it@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:02 +10:00
Jesper Juhl	8dba474f03	mm/memcontrol.c: fix uninitialized variable use in mem_cgroup_move_parent() In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps to the 'put_back' label ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge); if (ret \|\| !parent) goto put_back; where we'll if (charge > PAGE_SIZE) compound_unlock_irqrestore(page, flags); but, we have not assigned anything to 'flags' at this point, nor have we called 'compound_lock_irqsave()' (which is what sets 'flags'). The 'put_back' label should be moved below the call to compound_unlock_irqrestore() as per this patch. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:01 +10:00
David Rientjes	2ff754fa8f	mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator Commit `0e093d9976` ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") uncovered a livelock in the page allocator that resulted in tasks infinitely looping trying to find memory and kswapd running at 100% cpu. The issue occurs because drain_all_pages() is called immediately following direct reclaim when no memory is freed and try_to_free_pages() returns non-zero because all zones in the zonelist do not have their all_unreclaimable flag set. When draining the per-cpu pagesets back to the buddy allocator for each zone, the zone->pages_scanned counter is cleared to avoid erroneously setting zone->all_unreclaimable later. The problem is that no pages may actually be drained and, thus, the unreclaimable logic never fails direct reclaim so the oom killer may be invoked. This apparently only manifested after wait_iff_congested() was introduced and the zone was full of anonymous memory that would not congest the backing store. The page allocator would infinitely loop if there were no other tasks waiting to be scheduled and clear zone->pages_scanned because of drain_all_pages() as the result of this change before kswapd could scan enough pages to trigger the reclaim logic. Additionally, with every loop of the page allocator and in the reclaim path, kswapd would be kicked and would end up running at 100% cpu. In this scenario, current and kswapd are all running continuously with kswapd incrementing zone->pages_scanned and current clearing it. The problem is even more pronounced when current swaps some of its memory to swap cache and the reclaimable logic then considers all active anonymous memory in the all_unreclaimable logic, which requires a much higher zone->pages_scanned value for try_to_free_pages() to return zero that is never attainable in this scenario. Before wait_iff_congested(), the page allocator would incur an unconditional timeout and allow kswapd to elevate zone->pages_scanned to a level that the oom killer would be called the next time it loops. The fix is to only attempt to drain pcp pages if there is actually a quantity to be drained. The unconditional clearing of zone->pages_scanned in free_pcppages_bulk() need not be changed since other callers already ensure that draining will occur. This patch ensures that free_pcppages_bulk() will actually free memory before calling into it from drain_all_pages() so zone->pages_scanned is only cleared if appropriate. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:01 +10:00
David Rientjes	f33261d75b	mm: fix deferred congestion timeout if preferred zone is not allowed Before `0e093d9976` ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone"), preferred_zone was only used for NUMA statistics, to determine the zoneidx from which to allocate from given the type requested, and whether to utilize memory compaction. wait_iff_congested(), though, uses preferred_zone to determine if the congestion wait should be deferred because its dirty pages are backed by a congested bdi. This incorrectly defers the timeout and busy loops in the page allocator with various cond_resched() calls if preferred_zone is not allowed in the current context, usually consuming 100% of a cpu. This patch ensures preferred_zone is an allowed zone in the fastpath depending on whether current is constrained by its cpuset or nodes in its mempolicy (when the nodemask passed is non-NULL). This is correct since the fastpath allocation always passes ALLOC_CPUSET when trying to allocate memory. In the slowpath, this patch resets preferred_zone to the first zone of the allowed type when the allocation is not constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET. This patch also ensures preferred_zone is from the set of allowed nodes when called from within direct reclaim since allocations are always constrained by cpusets in this context (it is blockable). Both of these uses of cpuset_current_mems_allowed are protected by get_mems_allowed(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:00 +10:00
Andrew Morton	f95ba941d1	mm/pgtable-generic.c: fix CONFIG_SWAP=n build mips (and sparc32): In file included from arch/mips/include/asm/tlb.h:21, from mm/pgtable-generic.c:9: include/asm-generic/tlb.h: In function `tlb_flush_mmu': include/asm-generic/tlb.h:76: error: implicit declaration of function `release_pages' include/asm-generic/tlb.h: In function `tlb_remove_page': include/asm-generic/tlb.h:105: error: implicit declaration of function `page_cache_release' free_pages_and_swap_cache() and free_page_and_swap_cache() are macros which call release_pages() and page_cache_release(). The obvious fix is to include pagemap.h in swap.h, where those macros are defined. But that breaks sparc for weird reasons. So fix it within mm/pgtable-generic.c instead. Reported-by: Yoichi Yuasa <yuasa@linux-mips.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Sam Ravnborg <sam@ravnborg.org> Cc: Sergei Shtylyov <sshtylyov@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:49:58 +10:00
Johannes Weiner	713735b423	memcg: correctly order reading PCG_USED and pc->mem_cgroup The placement of the read-side barrier is confused: the writer first sets pc->mem_cgroup, then PCG_USED. The read-side barrier has to be between testing PCG_USED and reading pc->mem_cgroup. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
Jan Kara	382e27daa5	mm: fix truncate_setsize() comment Contrary to what the comment says, truncate_setsize() should be called before filesystem truncated blocks. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki	987eba66e0	memcg: fix rmdir, force_empty with THP Now, when THP is enabled, memcg's rmdir() function is broken because move_account() for THP page is not supported. This will cause account leak or -EBUSY issue at rmdir(). This patch fixes the issue by supporting move_account() THP pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki	ece35ca810	memcg: fix LRU accounting with THP memory cgroup's LRU stat should take care of size of pages because Transparent Hugepage inserts hugepage into LRU. If this value is the number wrong, memory reclaim will not work well. Note: only head page of THP's huge page is linked into LRU. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki	ca3e021417	memcg: fix USED bit handling at uncharge in THP Now, under THP: at charge: - PageCgroupUsed bit is set to all page_cgroup on a hugepage. ....set to 512 pages. at uncharge - PageCgroupUsed bit is unset on the head page. So, some pages will remain with "Used" bit. This patch fixes that Used bit is set only to the head page. Used bits for tail pages will be set at splitting if necessary. This patch adds this lock order: compound_lock() -> page_cgroup_move_lock(). [akpm@linux-foundation.org: fix warning] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:06 -08:00
KAMEZAWA Hiroyuki	e401f1761c	memcg: modify accounting function for supporting THP better mem_cgroup_charge_statisics() was designed for charging a page but now, we have transparent hugepage. To fix problems (in following patch) it's required to change the function to get the number of pages as its arguments. The new function gets following as argument. - type of page rather than 'pc' - size of page which is accounted. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:05 -08:00
Johannes Weiner	82478fb7bc	mm: compaction: prevent division-by-zero during user-requested compaction Up until `3e7d344` ("mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim"), compaction skipped calculating the fragmentation index of a zone when compaction was explicitely requested through the procfs knob. However, when compaction_suitable was introduced, it did not come with an extra check for order == -1, set on explicit compaction requests, and passed this order on to the fragmentation index calculation, where it overshifts the number of requested pages, leading to a division by zero. This patch makes sure that order == -1 is recognized as the flag it is rather than passing it along as valid order parameter. [akpm@linux-foundation.org: add comment, per Mel] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:05 -08:00
Jesper Juhl	3305de51bf	mm/vmscan.c: remove duplicate include of compaction.h Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:05 -08:00
Tomi Valkeinen	abb65272a1	memblock: fix memblock_is_region_memory() memblock_is_region_memory() uses reserved memblocks to search for the given region, while it should use the memory memblocks. I encountered the problem with OMAP's framebuffer ram allocation. Normally the ram is allocated dynamically, and this function is not called. However, if we want to pass the framebuffer from the bootloader to the kernel (to retain the boot image), this function is used to check the validity of the kernel parameters for the framebuffer ram area. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@nokia.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:05 -08:00
Johannes Weiner	453c719261	thp: keep highpte mapped until it is no longer needed Two users reported THP-related crashes on 32-bit x86 machines. Their oops reports indicated an invalid pte, and subsequent code inspection showed that the highpte is actually used after unmap. The fix is to unmap the pte only after all operations against it are finished. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Ilya Dryomov <idryomov@gmail.com> Reported-by: werner <w.landgraf@ru.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Steven Rostedt <rostedt@goodmis.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-20 17:02:05 -08:00
Linus Torvalds	83896fb5e5	Revert "mm: simplify code of swap.c" This reverts commit `d8505dee1a`. Chris Mason ended up chasing down some page allocation errors and pages stuck waiting on the IO scheduler, and was able to narrow it down to two commits: commit `744ed14427` ("mm: batch activate_page() to reduce lock contention") and `d8505dee1a` ("mm: simplify code of swap.c"). This reverts the second one. Reported-and-debugged-by: Chris Mason <chris.mason@oracle.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: linux-mm <linux-mm@kvack.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-17 14:42:34 -08:00
Linus Torvalds	7a608572a2	Revert "mm: batch activate_page() to reduce lock contention" This reverts commit `744ed14427`. Chris Mason ended up chasing down some page allocation errors and pages stuck waiting on the IO scheduler, and was able to narrow it down to two commits: commit `744ed14427` ("mm: batch activate_page() to reduce lock contention") and `d8505dee1a` ("mm: simplify code of swap.c"). This reverts the first of them. Reported-and-debugged-by: Chris Mason <chris.mason@oracle.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: linux-mm <linux-mm@kvack.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-17 14:42:19 -08:00
Andrea Arcangeli	b3697c0255	fix non-x86 build failure in pmdp_get_and_clear pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as BUG() and they were defined only to diminish the risk of build issues on not-x86 archs and to be consistent with the generic pte methods previously defined in include/asm-generic/pgtable.h. But they are causing more trouble than they were supposed to solve, so it's simpler not to define them when THP is off. This is also correcting the export of pmdp_splitting_flush which is currently unused (x86 isn't using the generic implementation in mm/pgtable-generic.c and no other arch needs that [yet]). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Sam Ravnborg <sam@ravnborg.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-16 15:05:44 -08:00
H Hartley Sweeten	68a1b19559	mm/slab.c: make local symbols static Local symbols should be static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>	2011-01-15 13:28:36 +02:00
Pekka Enberg	597fb188cb	Merge branch 'slub/hotplug' into slab/urgent	2011-01-15 13:28:17 +02:00
Linus Torvalds	52cfd503ad	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits) ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework ACPI: fix resource check message ACPI / Battery: Update information on info notification and resume ACPI: Drop device flag wake_capable ACPI: Always check if _PRW is present before trying to evaluate it ACPI / PM: Check status of power resources under mutexes ACPI / PM: Rename acpi_power_off_device() ACPI / PM: Drop acpi_power_nocheck ACPI / PM: Drop acpi_bus_get_power() Platform / x86: Make fujitsu_laptop use acpi_bus_update_power() ACPI / Fan: Rework the handling of power resources ACPI / PM: Register power resource devices as soon as they are needed ACPI / PM: Register acpi_power_driver early ACPI / PM: Add function for updating device power state consistently ACPI / PM: Add function for device power state initialization ACPI / PM: Introduce __acpi_bus_get_power() ACPI / PM: Introduce function for refcounting device power resources ACPI / PM: Add functions for manipulating lists of power resources ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes ACPICA: Update version to 20101209 ...	2011-01-13 20:15:35 -08:00
Daisuke Nishimura	50de1dd967	memcg: fix memory migration of shmem swapcache In the current implementation mem_cgroup_end_migration() decides whether the page migration has succeeded or not by checking "oldpage->mapping". But if we are tring to migrate a shmem swapcache, the page->mapping of it is NULL from the begining, so the check would be invalid. As a result, mem_cgroup_end_migration() assumes the migration has succeeded even if it's not, so "newpage" would be freed while it's not uncharged. This patch fixes it by passing mem_cgroup_end_migration() the result of the page migration. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:51 -08:00
Jesper Juhl	17295c88a1	memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then followed by memset() to zero the memory. This can be more efficiently achieved by using kzalloc() and vzalloc(). There's also one situation where we can use kzalloc_node() - this is what's new in this version of the patch. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:51 -08:00
Daisuke Nishimura	dfe076b097	memcg: fix deadlock between cpuset and memcg Commit `b1dd693e` ("memcg: avoid deadlock between move charge and try_charge()") can cause another deadlock about mmap_sem on task migration if cpuset and memcg are mounted onto the same mount point. After the commit, cgroup_attach_task() has sequence like: cgroup_attach_task() ss->can_attach() cpuset_can_attach() mem_cgroup_can_attach() down_read(&mmap_sem) (1) ss->attach() cpuset_attach() mpol_rebind_mm() down_write(&mmap_sem) (2) up_write(&mmap_sem) cpuset_migrate_mm() do_migrate_pages() down_read(&mmap_sem) up_read(&mmap_sem) mem_cgroup_move_task() mem_cgroup_clear_mc() up_read(&mmap_sem) We can cause deadlock at (2) because we've already aquire the mmap_sem at (1). But the commit itself is necessary to fix deadlocks which have existed before the commit like: Ex.1) move charge \| try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() \| down_write(&mmap_sem) mc.moving_task = current \| .. mem_cgroup_precharge_mc() \| __mem_cgroup_try_charge() mem_cgroup_count_precharge() \| prepare_to_wait() down_read(&mmap_sem) \| if (mc.moving_task) -> cannot aquire the lock \| -> true \| schedule() \| -> move charge should wake it up Ex.2) move charge \| try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() \| mc.moving_task = current \| mem_cgroup_precharge_mc() \| mem_cgroup_count_precharge() \| down_read(&mmap_sem) \| .. \| up_read(&mmap_sem) \| \| down_write(&mmap_sem) mem_cgroup_move_task() \| .. mem_cgroup_move_charge() \| __mem_cgroup_try_charge() down_read(&mmap_sem) \| prepare_to_wait() -> cannot aquire the lock \| if (mc.moving_task) \| -> true \| schedule() \| -> move charge should wake it up This patch fixes all of these problems by: 1. revert the commit. 2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge() has released the mmap_sem. 3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel all extra charges, wake up all waiters, and retry trylock. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:51 -08:00
Minchan Kim	043d18b1e5	memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list() Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:50 -08:00
Johannes Weiner	f3e8eb70b1	memcg: fix unit mismatch in memcg oom limit calculation Adding the number of swap pages to the byte limit of a memory control group makes no sense. Convert the pages to bytes before adding them. The only user of this code is the OOM killer, and the way it is used means that the error results in a higher OOM badness value. Since the cgroup limit is the same for all tasks in the cgroup, the error should have no practical impact at the moment. But let's not wait for future or changing users to trip over it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:50 -08:00
KAMEZAWA Hiroyuki	dbd4ea78f0	memcg: add lock to synchronize page accounting and migration Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page accounting and migration code. This reworks the locking scheme of _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable. 1. If pages are being migrated from a memcg, then updates to that memcg page statistics are protected by grabbing PCG_MOVE_LOCK using move_lock_page_cgroup(). In an upcoming commit, memcg dirty page accounting will be updating memcg page accounting (specifically: num writeback pages) from IRQ context (softirq). Avoid a deadlocking nested spin lock attempt by disabling irq on the local processor when grabbing the PCG_MOVE_LOCK. 2. lock for update_page_stat is used only for avoiding race with move_account(). So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem is between mem_cgroup_update_page_stat() and mem_cgroup_move_account_page(). Trade-off: * Changing lock_page_cgroup() to always disable IRQ (or local_bh) has some impacts on performance and I think it's bad to disable IRQ when it's not necessary. * adding a new lock makes move_account() slower. Score is here. Performance Impact: moving a 8G anon process. Before: real 0m0.792s user 0m0.000s sys 0m0.780s After: real 0m0.854s user 0m0.000s sys 0m0.842s This score is bad but planned patches for optimization can reduce this impact. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:50 -08:00
Greg Thelen	2a7106f2cb	memcg: create extensible page stat update routines Replace usage of the mem_cgroup_update_file_mapped() memcg statistic update routine with two new routines: * mem_cgroup_inc_page_stat() * mem_cgroup_dec_page_stat() As before, only the file_mapped statistic is managed. However, these more general interfaces allow for new statistics to be more easily added. New statistics are added with memcg dirty page accounting. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Andrea Righi <arighi@develer.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:50 -08:00
Shaohua Li	744ed14427	mm: batch activate_page() to reduce lock contention The zone->lru_lock is heavily contented in workload where activate_page() is frequently used. We could do batch activate_page() to reduce the lock contention. The batched pages will be added into zone list when the pool is full or page reclaim is trying to drain them. For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes, processes shared map to the file. Each process read access the whole file and then exit. The process exit will do unmap_vmas() and cause a lot of activate_page() call. In such workload, we saw about 58% total time reduction with below patch. Other workloads with a lot of activate_page also benefits a lot too. I tested some microbenchmarks: case-anon-cow-rand-mt 0.58% case-anon-cow-rand -3.30% case-anon-cow-seq-mt -0.51% case-anon-cow-seq -5.68% case-anon-r-rand-mt 0.23% case-anon-r-rand 0.81% case-anon-r-seq-mt -0.71% case-anon-r-seq -1.99% case-anon-rx-rand-mt 2.11% case-anon-rx-seq-mt 3.46% case-anon-w-rand-mt -0.03% case-anon-w-rand -0.50% case-anon-w-seq-mt -1.08% case-anon-w-seq -0.12% case-anon-wx-rand-mt -5.02% case-anon-wx-seq-mt -1.43% case-fork 1.65% case-fork-sleep -0.07% case-fork-withmem 1.39% case-hugetlb -0.59% case-lru-file-mmap-read-mt -0.54% case-lru-file-mmap-read 0.61% case-lru-file-mmap-read-rand -2.24% case-lru-file-readonce -0.64% case-lru-file-readtwice -11.69% case-lru-memcg -1.35% case-mmap-pread-rand-mt 1.88% case-mmap-pread-rand -15.26% case-mmap-pread-seq-mt 0.89% case-mmap-pread-seq -69.72% case-mmap-xread-rand-mt 0.71% case-mmap-xread-seq-mt 0.38% The most significent are: case-lru-file-readtwice -11.69% case-mmap-pread-rand -15.26% case-mmap-pread-seq -69.72% which use activate_page a lot. others are basically variations because each run has slightly difference. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:50 -08:00
Shaohua Li	d8505dee1a	mm: simplify code of swap.c Clean up code and remove duplicate code. Next patch will use pagevec_lru_move_fn introduced here too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:50 -08:00
Andrew Morton	c06b1fca18	mm/page_alloc.c: don't cache `current' in a local It's old-fashioned and unneeded. akpm:/usr/src/25> size mm/page_alloc.o text data bss dec hex filename 39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before) 39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after) Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Hugh Dickins	fd4a4663db	mm: fix hugepage migration 2.6.37 added an unmap_and_move_huge_page() for memory failure recovery, but its anon_vma handling was still based around the 2.6.35 conventions. Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma, drop_anon_vma in the same way as we're now changing unmap_and_move(). I don't particularly like to propose this for stable when I've not seen its problems in practice nor tested the solution: but it's clearly out of synch at present. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: <stable@kernel.org> [2.6.37, 2.6.36] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Hugh Dickins	1ce82b69e9	mm: fix migration hangs on anon_vma lock Increased usage of page migration in mmotm reveals that the anon_vma locking in unmap_and_move() has been deficient since 2.6.36 (or even earlier). Review at the time of `f18194275c` ("mm: fix hang on anon_vma->root->lock") missed the issue here: the anon_vma to which we get a reference may already have been freed back to its slab (it is in use when we check page_mapped, but that can change), and so its anon_vma->root may be switched at any moment by reuse in anon_vma_prepare. Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's not: just rely on page_lock_anon_vma() to do all the hard thinking for us, then we don't need any rcu read locking over here. In removing the rcu_unlock label: since PageAnon is a bit in page->mapping, it's impossible for a !page->mapping page to be anon; but insert VM_BUG_ON in case the implementation ever changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: <stable@kernel.org> [2.6.37, 2.6.36] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Hugh Dickins	2919bfd075	ksm: drain pagevecs to lru It was hard to explain the page counts which were causing new LTP tests of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: CAI Qian <caiqian@redhat.com> Cc:Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Eric B Munson	73ae31e598	hugetlb: fix handling of parse errors in sysfs When parsing changes to the huge page pool sizes made from userspace via the sysfs interface, bogus input values are being covered up by nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0 when strict_strtoul returns an error. This can cause an infinite loop in the nr_hugepages_store code. This patch changes the return value for these functions to -EINVAL when strict_strtoul returns an error. Signed-off-by: Eric B Munson <emunson@mgebm.net> Reported-by: CAI Qian <caiqian@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Eric B Munson <emunson@mgebm.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Eric B Munson	adbe8726dc	hugetlb: do not allow pagesize >= MAX_ORDER pool adjustment Huge pages with order >= MAX_ORDER must be allocated at boot via the kernel command line, they cannot be allocated or freed once the kernel is up and running. Currently we allow values to be written to the sysfs and sysctl files controling pool size for these huge page sizes. This patch makes the store functions for nr_hugepages and nr_overcommit_hugepages return -EINVAL when the pool for a page size >= MAX_ORDER is changed. [akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()] [caiqian@redhat.com: add checking in hugetlb_overcommit_handler()] Signed-off-by: Eric B Munson <emunson@mgebm.net> Reported-by: CAI Qian <caiqian@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Michal Hocko	08d4a24659	hugetlb: check the return value of string conversion in sysctl handler proc_doulongvec_minmax may fail if the given buffer doesn't represent a valid number. If we provide something invalid we will initialize the resulting value (nr_overcommit_huge_pages in this case) to a random value from the stack. The issue was introduced by `a3d0c6aa` when the default handler has been replaced by the helper function where we do not check the return value. Reproducer: echo "" > /proc/sys/vm/nr_overcommit_hugepages [akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: CAI Qian <caiqian@redhat.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:49 -08:00
Andrew Morton	684265d4a3	mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() As it stands this code will degenerate into a busy-wait if the calling task has signal_pending(). Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Rolf Eike Beer	84bc227d7f	mm/dmapool.c: take lock only once in dma_pool_free() dma_pool_free() scans for the page to free in the pool list holding the pool lock. Then it releases the lock basically to acquire it immediately again. Modify the code to only take the lock once. This will do some additional loops and computations with the lock held in if memory debugging is activated. If it is not activated the only new operations with this lock is one if and one substraction. Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
KyongHo Cho	43506fad21	mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists The previous approach of calucation of combined index was page_idx & ~(1 << order)) but we have same result with page_idx & buddy_idx This reduces instructions slightly as well as enhances readability. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix used-unintialised warning] Signed-off-by: KyongHo Cho <pullip.cho@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Jiri Kosina	5520e89485	brk: fix min_brk lower bound computation for COMPAT_BRK Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still be overriden by randomize_va_space sysctl. If this is the case, the min_brk computation in sys_brk() implementation is wrong, as it solely takes into account COMPAT_BRK setting, assuming that brk start is not randomized. But that might not be the case if randomize_va_space sysctl has been set to '2' at the time the binary has been loaded from disk. In such case, the check has to be done in a same way as in !CONFIG_COMPAT_BRK case. In addition to that, the check for the COMPAT_BRK case introduced back in `a5b4592c` ("brk: make sys_brk() honor COMPAT_BRK when computing lower bound") is slightly wrong -- the lower bound shouldn't be mm->end_code, but mm->end_data instead, as that's where the legacy applications expect brk section to start (i.e. immediately after last global variable). [akpm@linux-foundation.org: fix comment] Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Jesper Juhl	32d6feadf4	mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common() The NODEMASK_ALLOC macro may dynamically allocate memory for its second argument ('nodes_allowed' in this context). In nr_hugepages_store_common() we may abort early if strict_strtoul() fails, but in that case we do not free the memory already allocated to 'nodes_allowed', causing a memory leak. This patch closes the leak by freeing the memory in the error path. [akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim] Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Mel Gorman	29c1f677d4	mm: migration: use rcu_dereference_protected when dereferencing the radix tree slot during file page migration migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous pages, as introduced by git commit `989f89c57e` ("fix rcu_read_lock() in page migraton"). The point of the RCU protection there is part of getting a stable reference to anon_vma and is only held for anon pages as file pages are locked which is sufficient protection against freeing. However, while a file page's mapping is being migrated, the radix tree is double checked to ensure it is the expected page. This uses radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held triggering the following warning. [ 173.674290] =================================================== [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ] [ 173.676016] --------------------------------------------------- [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection! [ 173.676016] [ 173.676016] other info that might help us debug this: [ 173.676016] [ 173.676016] [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0 [ 173.676016] 1 lock held by hugeadm/2899: [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab [ 173.676016] [ 173.676016] stack backtrace: [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild [ 173.676016] Call Trace: [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86 [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39 [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107 [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107 [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa This patch introduces radix_tree_deref_slot_protected() which calls rcu_dereference_protected(). Users of it must pass in the mapping->tree_lock that is protecting this dereference. Holding the tree lock protects against parallel updaters of the radix tree meaning that rcu_dereference_protected is allowable. [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Milton Miller <miltonm@bga.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> [2.6.37.early] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Andrea Arcangeli	22e5c47ee2	thp: add compound_trans_head() helper Cleanup some code with common compound_trans_head helper. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Andrea Arcangeli	29ad768cfc	thp: KSM on THP This makes KSM full operational with THP pages. Subpages are scanned while the hugepage is still in place and delivering max cpu performance, and only if there's a match and we're going to deduplicate memory, the single hugepages with the subpage match is split. There will be no false sharing between ksmd and khugepaged. khugepaged won't collapse 2m virtual regions with KSM pages inside. ksmd also should only split pages when the checksum matches and we're likely to split an hugepage for some long living ksm page (usual ksm heuristic to avoid sharing pages that get de-cowed). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:48 -08:00
Andrea Arcangeli	60ab3244ec	thp: khugepaged: make khugepaged aware about madvise MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after mmap and before touching the memory. While this is enough for most usages, it's little effort to make madvise more dynamic at runtime on an existing mapping by making khugepaged aware about madvise. MADV_HUGEPAGE: register in khugepaged immediately without waiting a page fault (that may not ever happen if all pages are already mapped and the "enabled" knob was set to madvise during the initial page faults). MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop collapsing pages where not needed. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:47 -08:00
Andrea Arcangeli	a664b2d855	thp: madvise(MADV_NOHUGEPAGE) Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be hugepage backed. Return -EINVAL if the vma is not of an anonymous type, or the feature isn't built into the kernel. Never silently return success. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:47 -08:00
Andrea Arcangeli	37c2ac7872	thp: compound_trans_order Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:47 -08:00
Andrea Arcangeli	91600e9e59	thp: fix memory-failure hugetlbfs vs THP collision hugetlbfs was changed to allow memory failure to migrate the hugetlbfs pages and that broke THP as split_huge_page was then called on hugetlbfs pages too. compound_head/order was also run unsafe on THP pages that can be splitted at any time. All compound_head() invocations in memory-failure.c that are run on pages that aren't pinned and that can be freed and reused from under us (while compound_head is running) are buggy because compound_head can return a dangling pointer, but I'm not fixing this as this is a generic memory-failure bug not specific to THP but it applies to hugetlbfs too, so I can fix it later after THP is merged upstream. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:47 -08:00
Andrea Arcangeli	14d1a55cd2	thp: add debug checks for mapcount related invariants Add debug checks for invariants that if broken could lead to mapcount vs page_mapcount debug checks to trigger later in split_huge_page. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:47 -08:00
Rik van Riel	9992af1029	thp: scale nr_rotated to balance memory pressure Make sure we scale up nr_rotated when we encounter a referenced transparent huge page. This ensures pageout scanning balance is not distorted when there are huge pages on the LRU. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Rik van Riel	2c888cfbc1	thp: fix anon memory statistics with transparent hugepages Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU statistics, so the Active(anon) and Inactive(anon) statistics in /proc/meminfo are correct. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Rik van Riel	97562cd243	thp: disable transparent hugepages by default on small systems On small systems, the extra memory used by the anti-fragmentation memory reserve and simply because huge pages are smaller than large pages can easily outweigh the benefits of less TLB misses. A less obvious concern is if run on a NUMA machine with asymmetric node sizes and one of them is very small. The reserve could make the node unusable. In case of the crashdump kernel, OOMs have been observed due to the anti-fragmentation memory reserve taking up a large fraction of the crashdump image. This patch disables transparent hugepages on systems with less than 1GB of RAM, but the hugepage subsystem is fully initialized so administrators can enable THP through /sys if desired. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Avi Kiviti <avi@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Andrea Arcangeli	c5a73c3d55	thp: use compaction for all allocation orders It makes no sense not to enable compaction for small order pages as we don't want to end up with bad order 2 allocations and good and graceful order 9 allocations. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Andrea Arcangeli	5a03b051ed	thp: use compaction in kswapd for GFP_ATOMIC order > 0 This takes advantage of memory compaction to properly generate pages of order > 0 if regular page reclaim fails and priority level becomes more severe and we don't reach the proper watermarks. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Andrea Arcangeli	878aee7d6b	thp: freeze khugepaged and ksmd It's unclear why schedule friendly kernel threads can't be taken away by the CPU through the scheduler itself. It's safer to stop them as they can trigger memory allocation, if kswapd also freezes itself to avoid generating I/O they have too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Andrea Arcangeli	8ee53820ed	thp: mmu_notifier_test_young For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's correct to return 0 when shadow_access_mask is zero, it requires gup-fast to set the referenced bit). qemu-kvm access already sets the young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we relay on gup-fast to signal the page is in use... We also need to check the young bits on the secondary pagetables for NPT and not nested shadow mmu as the data may never get accessed again by the primary pte. Without this closer accuracy, we'd have to remove the heuristic that avoids collapsing hugepages in hugepage virtual regions that have not even a single subpage in use. ->test_young is full backwards compatible with GRU and other usages that don't have young bits in pagetables set by the hardware and that should nuke the secondary mmu mappings when ->clear_flush_young runs just like EPT does. Removing the heuristic that checks the young bit in khugepaged/collapse_huge_page completely isn't so bad either probably but I thought it was worth it and this makes it reliable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Andrea Arcangeli	4b7167b9ff	thp: don't allow transparent hugepage support without PSE Archs implementing Transparent Hugepage Support must implement a function called has_transparent_hugepage to be sure the virtual or physical CPU supports Transparent Hugepages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	94fcc585fb	thp: avoid breaking huge pmd invariants in case of vma_adjust failures An huge pmd can only be mapped if the corresponding 2M virtual range is fully contained in the vma. At times the VM calls split_vma twice, if the first split_vma succeeds and the second fail, the first split_vma remains in effect and it's not rolled back. For split_vma or vma_adjust to fail an allocation failure is needed so it's a very unlikely event (the out of memory killer would normally fire before any allocation failure is visible to kernel and userland and if an out of memory condition happens it's unlikely to happen exactly here). Nevertheless it's safer to ensure that no huge pmd can be left around if the vma is adjusted in a way that can't fit hugepages anymore at the new vm_start/vm_end address. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	bc835011af	thp: transhuge isolate_migratepages() It's not worth migrating transparent hugepages during compaction. Those hugepages don't create fragmentation. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	5d6892407c	thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled With transparent hugepage support we need compaction for the "defrag" sysfs controls to be effective. At the moment THP hangs the system if COMPACTION isn't selected, as without COMPACTION lumpy reclaim wouldn't be entirely disabled. So at the moment it's not orthogonal. When lumpy will be removed from the VM I can remove the select COMPACTION in theory, but then 99% of THP users would be still doing a mistake in disabling compaction, even if the mistake won't return in fatal runtime but just slightly degraded performance. So from a theoretical standpoing forcing the below select is not needed (the dependency isn't strict nor at compile time nor at runtime) but from a practical standpoint it is safer. If anybody really wants THP to run without compaction, it'd be such a weird setup that editing the Kconfig file to allow it will be surely not a problem. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	13ece886d9	thp: transparent hugepage config choice Allow to choose between the always\|madvise default for page faults and khugepaged at config time. madvise guarantees zero risk of higher memory footprint for applications (applications using madvise(MADV_HUGEPAGE) won't risk to use any more memory by backing their virtual regions with hugepages). Initially set the default to N and don't depend on EMBEDDED. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	ce83d2174e	thp: allocate memory in khugepaged outside of mmap_sem write mode This tries to be more friendly to filesystem in userland, with userland backends that allocate memory in the I/O paths and that could deadlock if khugepaged holds the mmap_sem write mode of the userland backend while allocating memory. Memory allocation may wait for writeback I/O completion from the daemon that may be blocked in the mmap_sem read mode if a page fault happens and the daemon wasn't using mlock for the memory required for the I/O submission and completion. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	0bbbc0b33d	thp: add numa awareness to hugepage allocations It's mostly a matter of replacing alloc_pages with alloc_pages_vma after introducing alloc_pages_vma. khugepaged needs special handling as the allocation has to happen inside collapse_huge_page where the vma is known and an error has to be returned to the outer loop to sleep alloc_sleep_millisecs in case of failure. But it retains the more efficient logic of handling allocation failures in khugepaged in case of CONFIG_NUMA=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	d39d33c332	thp: enable direct defrag With memory compaction in, and lumpy-reclaim disabled, it seems safe enough to defrag memory during the (synchronous) transparent hugepage page faults (TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged (async) hugepage allocations that was already enabled even before memory compaction was in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Andrea Arcangeli	f000565adb	thp: set recommended min free kbytes If transparent hugepage is enabled initialize min_free_kbytes to an optimal value by default. This moves the hugeadm algorithm in kernel. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Johannes Weiner	cd7548ab36	thp: mprotect: transparent huge page support Natively handle huge pmds when changing page tables on behalf of mprotect(). I left out update_mmu_cache() because we do not need it on x86 anyway but more importantly the interface works on ptes, not pmds. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Johannes Weiner	b36f5b0710	thp: mprotect: pass vma down to page table walkers Flushing the tlb for huge pmds requires the vma's anon_vma, so pass along the vma instead of the mm, we can always get the latter when we need it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Johannes Weiner	0ca1634d41	thp: mincore transparent hugepage support Handle transparent huge page pmd entries natively instead of splitting them into subpages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Johannes Weiner	f2d6bfe9ff	thp: add x86 32bit support Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Andrea Arcangeli	5f24ce5fd3	thp: remove PG_buddy PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be added to page->flags without overflowing (because of the sparse section bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move the memory hotplug code from _mapcount to lru.next to avoid any risk of clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use lru.next even more easily than the mapcount instead. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	21ae5b0175	thp: skip transhuge pages in ksm for now Skip transhuge pages in ksm for now. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	b15d00b6af	thp: khugepaged vma merge register in khugepaged if the vma grows. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	ba76149f47	thp: khugepaged Add khugepaged to relocate fragmented pages into hugepages if new hugepages become available. (this is indipendent of the defrag logic that will have to make new hugepages available) The fundamental reason why khugepaged is unavoidable, is that some memory can be fragmented and not everything can be relocated. So when a virtual machine quits and releases gigabytes of hugepages, we want to use those freely available hugepages to create huge-pmd in the other virtual machines that may be running on fragmented memory, to maximize the CPU efficiency at all times. The scan is slow, it takes nearly zero cpu time, except when it copies data (in which case it means we definitely want to pay for that cpu time) so it seems a good tradeoff. In addition to the hugepages being released by other process releasing memory, we have the strong suspicion that the performance impact of potentially defragmenting hugepages during or before each page fault could lead to more performance inconsistency than allocating small pages at first and having them collapsed into large pages later... if they prove themselfs to be long lived mappings (khugepaged scan is slow so short lived mappings have low probability to run into khugepaged if compared to long lived mappings). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	79134171df	thp: transparent hugepage vmstat Add hugepage stat information to /proc/vmstat and /proc/meminfo. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	b9bbfbe30a	thp: memcg huge memory Add memcg charge/uncharge to hugepage faults in huge_memory.c. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Daisuke Nishimura	152c9ccb75	thp: transhuge-memcg: commit tail pages at charge By this patch, when a transparent hugepage is charged, not only the head page but also all the tail pages are committed, IOW pc->mem_cgroup and pc->flags of tail pages are set. Without this patch: - Tail pages are not linked to any memcg's LRU at splitting. This causes many problems, for example, the charged memcg's directory can never be rmdir'ed because it doesn't have enough pages to scan to make the usage decrease to 0. - "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in root cgroup is calculated by the stat not by res_counter(since 2.6.32), it would be incorrect too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	ec1685109f	thp: memcg compound Teach memcg to charge/uncharge compound pages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:43 -08:00
Andrea Arcangeli	500d65d471	thp: pmd_trans_huge migrate bugcheck No pmd_trans_huge should ever materialize in migration ptes areas, because we split the hugepage before migration ptes are instantiated. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	0af4e98b6b	thp: madvise(MADV_HUGEPAGE) Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage backed. Return -EINVAL if the vma is not of an anonymous type, or the feature isn't built into the kernel. Never silently return success. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	f66055ab6f	thp: verify pmd_trans_huge isn't leaking pte_trans_huge must not leak in certain vmas like the mmio special pfn or filebacked mappings. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	05759d380a	thp: split_huge_page anon_vma ordering dependency This documents how split_huge_page is safe vs new vma inserctions into the anon_vma that may have already released the anon_vma->lock but not established pmds yet when split_huge_page starts. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Hugh Dickins	8a07651ee8	thp: transparent hugepage core fixlet If you configure THP in addition to HUGETLB_PAGE on x86_32 without PAE, the p?d-folding works out that munlock_vma_pages_range() can crash to follow_page()'s pud_huge() BUG_ON(flags & FOLL_GET): it needs the same VM_HUGETLB check already there on the pmd_huge() line. Conveniently, openSUSE provides a "blogd" which tests this out at startup! Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	71e3aac072	thp: transparent hugepage core Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL102410241024) int main() { char p = malloc(SIZE), p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	5c3240d92e	thp: don't alloc harder for gfp nomemalloc even if nowait Not worth throwing away the precious reserved free memory pool for allocations that can fail gracefully (either through mempool or because they're transhuge allocations later falling back to 4k allocations). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	32dba98e08	thp: _GFP_NO_KSWAPD Transparent hugepage allocations must be allowed not to invoke kswapd or any other kind of indirect reclaim (especially when the defrag sysfs is control disabled). It's unacceptable to swap out anonymous pages (potentially anonymous transparent hugepages) in order to create new transparent hugepages. This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual machine and so having it suffer an unbearable slowdown, so another one with guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running memory intensive workloads, makes no sense). If a transparent hugepage allocation fails the slowdown is minor and there is total fallback, so kswapd should never be asked to swapout memory to allow the high order allocation to succeed. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:41 -08:00
Andrea Arcangeli	47ad8475c0	thp: clear_copy_huge_page Move the copy/clear_huge_page functions to common code to share between hugetlb.c and huge_memory.c. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:41 -08:00

1 2 3 4 5 ...

4800 Commits