linux/fs
David Rientjes a63d83f427 oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead.  This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits.  This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:02 -07:00
..
9p 9p: fix sparse warnings in new xattr code 2010-08-02 14:28:38 -05:00
adfs kill spurious reference to vmtruncate 2010-05-27 22:15:42 -04:00
affs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
afs AFS: Fix the module init error handling 2010-08-07 14:23:37 -07:00
autofs fs/: do not fallback to default_llseek() when readdir() uses BKL 2010-05-27 09:12:56 -07:00
autofs4 fs/autofs4: use memdup_user 2010-05-27 09:12:41 -07:00
befs fix typos concerning "initiali[zs]e" 2010-06-16 18:05:05 +02:00
bfs rename the generic fsync implementations 2010-05-27 22:06:06 -04:00
btrfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2010-07-19 19:33:02 -07:00
cachefiles fscache: convert operation to use workqueue instead of slow-work 2010-07-22 22:58:47 +02:00
ceph ceph: use complete_all and wake_up_all 2010-07-27 13:11:17 -07:00
cifs Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 2010-08-07 12:54:46 -07:00
coda drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
configfs fix setattr error handling in sysfs, configfs 2010-06-04 17:16:29 -04:00
cramfs
debugfs Add x64 support to debugfs 2010-05-19 22:41:57 -04:00
devpts Simplify devpts_get_sb() failure exits 2010-05-21 18:31:12 -04:00
dlm fs/dlm: Drop unnecessary null test 2010-08-05 14:23:45 -05:00
ecryptfs Merge branch 'master' into for-next 2010-08-04 15:14:38 +02:00
efs
exofs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
exportfs
ext2 ext2: update ctime when changing the file's permission by setfacl 2010-06-25 01:20:37 +02:00
ext3 ext3: Fix dirtying of journalled buffers in data=journal mode 2010-08-05 21:28:28 +02:00
ext4 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2010-08-07 13:03:53 -07:00
fat fat: convert to use the new truncate convention. 2010-05-27 22:16:02 -04:00
freevxfs Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
fscache fscache: fix build on !CONFIG_SYSCTL 2010-07-24 11:10:09 +02:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse 2010-08-07 13:18:36 -07:00
gfs2 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-08-07 12:57:07 -07:00
hfs include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
hfsplus hfsplus: Push down BKL into ioctl function 2010-05-17 05:27:03 +02:00
hostfs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
hpfs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
hppfs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
hugetlbfs rename the generic fsync implementations 2010-05-27 22:06:06 -04:00
isofs fs/: do not fallback to default_llseek() when readdir() uses BKL 2010-05-27 09:12:56 -07:00
jbd ext3: Fix set but unused variables 2010-07-21 16:01:47 +02:00
jbd2 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2010-08-07 13:03:53 -07:00
jffs2 Fix up trivial spelling errors ('taht' -> 'that') 2010-07-21 09:25:42 -07:00
jfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-05-30 09:11:11 -07:00
lockd include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
logfs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
minix Minix: Clean up left over label 2010-06-04 17:16:30 -04:00
ncpfs Fix comment spelling errors in ncpfs/inode.c 2010-06-28 11:56:32 +02:00
nfs Merge branch 'nfs-for-2.6.36' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 2010-08-07 13:19:36 -07:00
nfs_common include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
nfsd Merge branch 'for-2.6.36' of git://linux-nfs.org/~bfields/linux 2010-08-07 14:24:41 -07:00
nilfs2 nilfs2: reject filesystem with unsupported block size 2010-07-25 23:29:21 +09:00
nls
notify Saner locking around deactivate_super() 2010-05-21 18:31:14 -04:00
ntfs drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
ocfs2 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2010-08-07 13:03:53 -07:00
omfs rename the generic fsync implementations 2010-05-27 22:06:06 -04:00
openpromfs
partitions Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:23:07 -07:00
proc oom: badness heuristic rewrite 2010-08-09 20:45:02 -07:00
qnx4 rename the generic fsync implementations 2010-05-27 22:06:06 -04:00
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-08-07 12:57:07 -07:00
ramfs fs: convert simple fs to new truncate 2010-05-27 22:15:47 -04:00
reiserfs Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
romfs
smbfs kill spurious reference to vmtruncate 2010-05-27 22:15:42 -04:00
squashfs squashfs: fix name reading in squashfs_xattr_get 2010-05-23 08:27:42 +01:00
sysfs sysfs: sysfs_chmod_file's attr can be const 2010-08-05 13:53:34 -07:00
sysv sysvfs: fix NULL deref. when allocating new inode 2010-06-29 15:29:32 -07:00
ubifs Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 2010-08-03 14:37:02 -07:00
udf udf: super.c Fix warning: variable 'sbi' set but not used 2010-08-02 14:57:40 +02:00
ufs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-05-30 09:11:11 -07:00
xfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-08-07 12:57:07 -07:00
aio.c aio: fix wrong subsystem comments 2010-08-05 13:21:23 -07:00
anon_inodes.c Revert "anon_inode: set S_IFREG on the anon_inode" 2010-05-27 22:03:05 -04:00
attr.c fs: introduce new truncate sequence 2010-05-27 22:15:33 -04:00
bad_inode.c drop unused dentry argument to ->fsync 2010-05-27 22:05:02 -04:00
binfmt_aout.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
binfmt_elf_fdpic.c binfmt_elf_fdpic: Fix clear_user() error handling 2010-06-01 08:11:06 -07:00
binfmt_elf.c coredump: pass mm->flags as a coredump parameter for consistency 2010-03-06 11:26:46 -08:00
binfmt_em86.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
binfmt_flat.c flat: tweak default stack alignment 2010-06-29 15:29:31 -07:00
binfmt_misc.c
binfmt_script.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
binfmt_som.c
bio-integrity.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
bio.c Merge branch 'master' into for-linus 2010-03-19 08:05:10 +01:00
block_dev.c block_dev: always serialize exclusive open attempts 2010-08-04 11:17:10 -07:00
buffer.c fs: introduce new truncate sequence 2010-05-27 22:15:33 -04:00
char_dev.c Fix init ordering of /dev/console vs callers of modprobe 2010-08-06 09:17:02 -07:00
compat_binfmt_elf.c elf coredump: replace ELF_CORE_EXTRA_* macros by functions 2010-03-06 11:26:45 -08:00
compat_ioctl.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-08-04 15:31:02 -07:00
compat.c update email address 2010-07-19 10:56:54 +02:00
dcache.c mm: add context argument to shrinker callback 2010-07-19 14:56:17 +10:00
dcookies.c
direct-io.c direct-io: move aio_complete into ->end_io 2010-07-27 11:56:06 -04:00
drop_caches.c new helper: iterate_supers() 2010-05-21 18:31:16 -04:00
eventfd.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
eventpoll.c sched, wait: Use wrapper functions 2010-05-11 17:43:58 +02:00
exec.c Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing 2010-08-07 17:06:54 -07:00
fcntl.c fs/fcntl.c:kill_fasync_rcu() fa_lock must be IRQ-safe 2010-06-29 15:29:32 -07:00
fifo.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
file_table.c get rid of the magic around f_count in aio 2010-05-27 22:03:07 -04:00
file.c fs: remove all rcu head initializations, except on_stack initializations 2010-06-14 16:37:26 -07:00
filesystems.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
fs_struct.c
fs-writeback.c writeback: simplify the write back thread queue 2010-07-06 08:59:53 +02:00
generic_acl.c fs: xattr_handler table should be const 2010-05-21 18:31:18 -04:00
inode.c mm: add context argument to shrinker callback 2010-07-19 14:56:17 +10:00
internal.h Bury __put_super_and_need_restart() 2010-05-21 18:31:16 -04:00
ioctl.c Introduce freeze_super and thaw_super for the fsfreeze ioctl 2010-05-21 18:31:18 -04:00
ioprio.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
Kconfig fs/Kconfig: Fix typo Userpace -> Userspace 2010-07-20 17:30:22 +02:00
Kconfig.binfmt
libfs.c wrong type for 'magic' argument in simple_fill_super() 2010-06-04 17:16:28 -04:00
locks.c Merge branch 'for-next' into for-linus 2010-03-08 16:55:37 +01:00
Makefile Take statfs variants to fs/statfs.c 2010-05-21 18:31:17 -04:00
mbcache.c mm: add context argument to shrinker callback 2010-07-19 14:56:17 +10:00
mpage.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
namei.c security: make LSMs explicitly mask off permissions 2010-08-02 15:35:07 +10:00
namespace.c Merge branch 'next' into for-linus 2010-05-18 08:57:00 +10:00
nfsctl.c Switch may_open() and break_lease() to passing O_... 2010-03-03 13:00:21 -05:00
no-block.c
open.c vfs: re-introduce MAY_CHDIR 2010-08-02 15:35:06 +10:00
pipe.c pipe: fix check in "set size" fcntl 2010-06-10 19:08:34 +02:00
pnode.c Kill CL_PROPAGATION, sanitize fs/pnode.c:get_source() 2010-03-03 13:00:22 -05:00
pnode.h VFS: Clean up shared mount flag propagation 2010-03-03 14:07:55 -05:00
posix_acl.c
read_write.c vfs: introduce noop_llseek() 2010-05-27 09:12:56 -07:00
read_write.h
readdir.c
select.c Add generic sys_old_select() 2010-03-12 15:52:32 -08:00
seq_file.c seq_file: fix new kernel-doc warnings 2010-03-07 15:48:26 -08:00
signalfd.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
splice.c splice: check f_mode for seekable file 2010-06-30 08:12:37 +02:00
stack.c
stat.c
statfs.c Take statfs variants to fs/statfs.c 2010-05-21 18:31:17 -04:00
super.c fs: fix superblock iteration race 2010-06-29 10:38:22 -07:00
sync.c Merge branch 'master' into for-linus 2010-06-01 12:42:12 +02:00
timerfd.c fs/timerfd.c: make use of wait_event_interruptible_locked_irq() 2010-05-20 13:21:42 -07:00
utimes.c
xattr_acl.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
xattr.c fs: xattr_handler table should be const 2010-05-21 18:31:18 -04:00