linux/mm
Johannes Weiner a756cf5908 mm: try to distribute dirty pages fairly across zones
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones.  And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list.  This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place.  As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon.  The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case.  With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations.  Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation.  Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

			Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

		seconds			nr_vmscan_write
		        (stddev)	       min|     median|        max
xfs
vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
patched:	 550.996( 3.802)	     0.000|      0.000|      0.000

fuse-ntfs
vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
patched:	 558.049(17.914)	     0.000|      0.000|     43.000

btrfs
vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
patched:	 563.365(11.368)	     0.000|      0.000|   1362.000

ext4
vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
patched:	 568.806(17.496)	     0.000|      0.000|      0.000

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 16:30:43 -08:00
..
backing-dev.c freezer: implement and use kthread_freezable_should_stop() 2011-11-21 12:32:23 -08:00
bootmem.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
bounce.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c convert 'memory' sysdev_class to a regular subsystem 2011-12-21 14:48:43 -08:00
debug-pagealloc.c debug-pagealloc: add support for highmem pages 2011-10-31 17:30:48 -07:00
dmapool.c mm: fix implicit stat.h usage in dmapool.c 2011-10-31 09:20:12 -04:00
fadvise.c fadvise: only initiate writeback for specified range with FADV_DONTNEED 2012-01-10 16:30:43 -08:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
filemap.c should_remove_suid(): inode->i_mode is umode_t 2012-01-03 22:55:14 -05:00
fremap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
highmem.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
huge_memory.c thp: reduce khugepaged freezing latency 2011-12-09 07:50:28 -08:00
hugetlb.c mm/hugetlb.c: fix virtual address handling in hugetlb fault 2012-01-10 16:30:42 -08:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: thp: tail page refcounting fix 2011-11-02 16:06:57 -07:00
Kconfig Merge branch 'master' into x86/memblock 2011-11-28 09:46:22 -08:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
ksm.c oom: fix race while temporarily setting current's oom_score_adj 2011-10-31 17:30:45 -07:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c fs: kill i_alloc_sem 2011-07-20 20:47:46 -04:00
Makefile Cross Memory Attach 2011-10-31 17:30:44 -07:00
memblock.c memblock: Reimplement memblock allocation using reverse free area iterator 2011-12-08 10:22:09 -08:00
memcontrol.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-01-09 14:46:52 -08:00
memory_hotplug.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
memory-failure.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
memory.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
mempolicy.c mm/mempolicy.c: refix mbind_range() vma issue 2011-12-29 16:31:57 -08:00
mempool.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
migrate.c mm: migrate: one less atomic operation 2012-01-10 16:30:41 -08:00
mincore.c mm: clarify the radix_tree exceptional cases 2011-08-03 14:25:24 -10:00
mlock.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
mm_init.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmap.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
mmu_context.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmu_notifier.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmzone.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
mprotect.c thp: mprotect: transparent huge page support 2011-01-13 17:32:44 -08:00
mremap.c thp: mremap support and TLB optimization 2011-10-31 17:30:48 -07:00
msync.c
nobootmem.c Merge branch 'master' into x86/memblock 2011-11-28 09:46:22 -08:00
nommu.c xen: map foreign pages for shared rings by updating the PTEs directly 2011-11-16 12:13:08 -05:00
oom_kill.c Merge branch 'master' into pm-sleep 2011-12-21 21:59:45 +01:00
page_alloc.c mm: try to distribute dirty pages fairly across zones 2012-01-10 16:30:43 -08:00
page_cgroup.c mm/page_cgroup.c: quiet sparse noise 2011-11-02 16:07:00 -07:00
page_io.c block: kill off REQ_UNPLUG 2011-03-10 08:52:27 +01:00
page_isolation.c
page-writeback.c mm: try to distribute dirty pages fairly across zones 2012-01-10 16:30:43 -08:00
pagewalk.c pagewalk: fix code comment for THP 2011-07-25 20:57:09 -07:00
percpu-km.c
percpu-vm.c percpu: fix chunk range calculation 2011-11-22 08:09:46 -08:00
percpu.c percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses 2011-12-15 11:41:40 -08:00
pgtable-generic.c mm/pgtable-generic.c: fix CONFIG_SWAP=n build 2011-01-26 10:49:58 +10:00
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
process_vm_access.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
rmap.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
shmem.c vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
slab.c tracing/mm: Move include of trace/events/kmem.h out of header into slab.c 2012-01-09 14:19:33 -08:00
slob.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
slub.c slub: min order when debug_guardpage_minorder > 0 2012-01-10 16:30:43 -08:00
sparse-vmemmap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
sparse.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
swap_state.c fs: move code out of buffer.c 2012-01-03 22:54:07 -05:00
swap.c mm: add free_hot_cold_page_list() helper 2012-01-10 16:30:41 -08:00
swapfile.c mm: avoid livelock on !__GFP_FS allocations 2012-01-10 16:30:42 -08:00
thrash.c mm/thrash.c: quiet sparse noise 2011-10-31 17:30:50 -07:00
truncate.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
util.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
vmalloc.c Merge branch 'devel-stable' into for-linus 2012-01-05 13:24:33 +00:00
vmscan.c vmscan: add task name to warn_scan_unevictable() messages 2012-01-10 16:30:43 -08:00
vmstat.c mm/vmstat.c: cache align vm_stat 2011-10-31 17:30:51 -07:00