linux/fs
David Gibson 90481622d7 hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.

Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed.  If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.

Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.

This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation.  It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from.  hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now.  The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.

subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.

Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1

v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.

Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 17:54:59 -07:00
..
9p Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs 2012-01-10 15:09:01 -08:00
adfs vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
affs affs: propagate umode_t 2012-01-03 22:55:04 -05:00
afs Merge branch 'kmap_atomic' of git://github.com/congwang/linux 2012-03-21 09:40:26 -07:00
autofs4 autofs: work around unhappy compat problem on x86-64 2012-02-25 12:10:27 -08:00
befs vfs: fix the stupidity with i_dentry in inode destructors 2012-01-03 22:52:40 -05:00
bfs switch ->create() to umode_t 2012-01-03 22:54:53 -05:00
btrfs Merge branch 'kmap_atomic' of git://github.com/congwang/linux 2012-03-21 09:40:26 -07:00
cachefiles fs: move code out of buffer.c 2012-01-03 22:54:07 -05:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2012-02-02 15:47:33 -08:00
cifs CIFS: Do not kmalloc under the flocks spinlock 2012-03-06 21:50:15 -06:00
coda coda: switch coda_cnode_make() to sane API as well, clean coda_lookup() 2012-01-10 11:13:16 -05:00
configfs configfs: convert to umode_t 2012-01-03 22:54:57 -05:00
cramfs cramfs: Fix typo in inode.c 2012-02-21 11:40:35 +01:00
debugfs Merge 3.3-rc2 into the driver-core-next branch. 2012-02-02 11:24:44 -08:00
devpts tty: rework pty count limiting 2012-01-24 14:01:01 -08:00
dlm dlm: Do not allocate a fd for peeloff 2012-03-08 13:52:09 -08:00
ecryptfs ecryptfs: fix printk format warning for size_t 2012-02-28 16:55:30 -08:00
efs vfs: fix the stupidity with i_dentry in inode destructors 2012-01-03 22:52:40 -05:00
exofs exofs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:22 +08:00
exportfs
ext2 ext2: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:22 +08:00
ext3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2012-01-09 12:51:21 -08:00
ext4 Merge branch 'for_linus' into for_linus_merged 2012-01-10 11:54:07 -05:00
fat Merge branch 'usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb 2012-01-09 12:09:47 -08:00
freevxfs fs: propagate umode_t, misc bits 2012-01-03 22:55:10 -05:00
fscache FS-Cache: Fix __fscache_uncache_all_inode_pages()'s outer loop 2011-07-21 10:59:16 -07:00
fuse fuse: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:22 +08:00
gfs2 gfs2: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:23 +08:00
hfs vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
hfsplus hfsplus: creation of hidden dir on mount can fail 2012-01-10 17:48:52 -05:00
hostfs vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
hpfs switch ->mknod() to umode_t 2012-01-03 22:54:54 -05:00
hppfs vfs: for usbfs, etc. internal vfsmounts ->mnt_sb->s_root == ->mnt_root 2012-01-03 22:52:41 -05:00
hugetlbfs hugepages: fix use after free bug in "quota" handling 2012-03-21 17:54:59 -07:00
isofs isofs: inode leak on mount failure 2012-01-09 10:48:11 -05:00
jbd Power management updates for 3.4 2012-03-21 10:15:51 -07:00
jbd2 Power management updates for 3.4 2012-03-21 10:15:51 -07:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-03-20 21:12:50 -07:00
jfs Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm 2012-01-08 13:10:57 -08:00
lockd module_param: make bool parameters really bool (drivers & misc) 2012-01-13 09:32:20 +10:30
logfs logfs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:24 +08:00
minix minix: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:24 +08:00
ncpfs vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
nfs nfs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:24 +08:00
nfs_common
nfsd Merge branch 'for-3.3' of git://linux-nfs.org/~bfields/linux 2012-01-14 12:26:41 -08:00
nilfs2 nilfs2: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:24 +08:00
nls NLS: raname "maxlen" to "maxout" in UTF conversion routines 2011-11-26 19:58:47 -08:00
notify fsnotify: don't BUG in fsnotify_destroy_mark() 2012-01-14 18:01:42 -08:00
ntfs Merge branch 'kmap_atomic' of git://github.com/congwang/linux 2012-03-21 09:40:26 -07:00
ocfs2 ocfs2: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:25 +08:00
omfs omfs: propagate umode_t 2012-01-03 22:55:01 -05:00
openpromfs vfs: fix the stupidity with i_dentry in inode destructors 2012-01-03 22:52:40 -05:00
proc procfs: mark thread stack correctly in proc/<pid>/maps 2012-03-21 17:54:58 -07:00
pstore pstore: gracefully handle NULL pstore_info functions 2011-11-18 13:49:00 -08:00
qnx4 qnx4: don't leak ->BitMap on late failure exits 2012-01-19 13:54:36 -05:00
quota quota: Fix deadlock with suspend and quotas 2012-02-13 20:45:39 -05:00
ramfs pohmelfs: propagate umode_t 2012-01-03 22:55:07 -05:00
reiserfs Merge branch 'kmap_atomic' of git://github.com/congwang/linux 2012-03-21 09:40:26 -07:00
romfs MTD pull for 3.3 2012-01-10 13:45:22 -08:00
squashfs squashfs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:25 +08:00
sysfs Revert "sysfs: Kill nlink counting." 2012-03-08 13:03:10 -08:00
sysv vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb 2012-01-06 23:16:53 -05:00
ubifs ubifs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:26 +08:00
udf udf: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:26 +08:00
ufs vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
xfs xfs: make inode quota check more general 2012-02-21 10:12:43 -06:00
aio.c fs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:21 +08:00
anon_inodes.c vfs: dont chain pipe/anon/socket on superblock s_inodes list 2011-07-26 12:57:09 -04:00
attr.c switch is_sxid() to umode_t 2012-01-03 22:55:11 -05:00
bad_inode.c switch ->mknod() to umode_t 2012-01-03 22:54:54 -05:00
binfmt_aout.c aout: move setup_arg_pages() prior to reading/mapping the binary 2012-03-05 13:51:32 -08:00
binfmt_elf_fdpic.c consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling 2011-07-20 01:43:10 -04:00
binfmt_elf.c regset: Prevent null pointer reference on readonly regsets 2012-03-02 11:38:15 -08:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb 2012-01-06 23:16:53 -05:00
binfmt_script.c
binfmt_som.c
bio-integrity.c fs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:21 +08:00
bio.c bio: don't overflow in bio_get_nr_vecs() 2012-02-08 22:07:18 +01:00
block_dev.c block: Fix NULL pointer dereference in sd_revalidate_disk 2012-03-02 10:38:33 +01:00
buffer.c fs: move code out of buffer.c 2012-01-03 22:54:07 -05:00
char_dev.c char_dev.c: fix up some whitespace errors 2011-12-13 11:18:17 -08:00
compat_binfmt_elf.c
compat_ioctl.c ppp: Replace uses of <linux/if_ppp.h> with <linux/ppp-ioctl.h> 2012-03-04 20:41:38 -05:00
compat.c vfs: fix compat_sys_stat() handling of overflows in st_nlink 2012-02-13 20:45:39 -05:00
dcache.c Merge branch 'dcache-word-accesses' 2012-03-19 16:37:28 -07:00
dcookies.c
direct-io.c Restore direct_io / truncate locking API 2012-02-23 15:56:21 -08:00
drop_caches.c
eventfd.c
eventpoll.c Don't limit non-nested epoll paths 2012-03-18 12:25:04 -07:00
exec.c Merge branch 'kmap_atomic' of git://github.com/congwang/linux 2012-03-21 09:40:26 -07:00
fcntl.c
fhandle.c vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb 2012-01-06 23:16:53 -05:00
fifo.c
file_table.c vfs: prevent remount read-only if pending removes 2012-01-06 23:20:13 -05:00
file.c
filesystems.c vfs: convert fs_supers to hlist 2012-01-03 22:52:39 -05:00
fs_struct.c
fs-writeback.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-03-20 21:12:50 -07:00
generic_acl.c switch posix_acl_equiv_mode() to umode_t * 2011-08-01 02:10:06 -04:00
inode.c restore smp_mb() in unlock_new_inode() 2012-03-10 17:07:28 -05:00
internal.h vfs: protect remounting superblock read-only 2012-01-06 23:20:12 -05:00
ioctl.c vfs: fix up ENOIOCTLCMD error handling 2012-01-05 15:40:12 -08:00
ioprio.c block: strip out locking optimization in put_io_context() 2012-02-07 07:51:30 +01:00
Kconfig vfs: use 'unsigned long' accesses for dcache name comparison and hashing 2012-03-08 18:08:44 -08:00
Kconfig.binfmt fs: binfmt_elf: create Kconfig variable for PIE randomization 2012-01-10 16:30:51 -08:00
libfs.c fs: move code out of buffer.c 2012-01-03 22:54:07 -05:00
locks.c vfs: fix handling of lock allocation failure in lease-break case 2011-12-26 10:25:26 -08:00
Makefile Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z 2012-01-06 23:15:54 -05:00
mbcache.c
mount.h vfs: keep list of mounts for each superblock 2012-01-06 23:20:12 -05:00
mpage.c fs: remove unneeded plug in mpage_readpages() 2012-01-12 09:19:54 +01:00
namei.c fs/namei.c: fix warnings on 32-bit 2012-03-21 17:54:54 -07:00
namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2012-01-08 13:21:22 -08:00
no-block.c
open.c switch security_path_chmod() to struct path * 2012-01-06 23:16:53 -05:00
pipe.c fs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:21 +08:00
pnode.c vfs: switch pnode.h macros to struct mount * 2012-01-03 22:57:11 -05:00
pnode.h vfs: switch pnode.h macros to struct mount * 2012-01-03 22:57:11 -05:00
posix_acl.c vfs: pass all mask flags check_acl and posix_acl_permission 2011-10-28 14:58:54 +02:00
proc_namespace.c vfs: switch ->show_options() to struct dentry * 2012-01-06 23:19:54 -05:00
read_write.c Cross Memory Attach 2011-10-31 17:30:44 -07:00
read_write.h
readdir.c
select.c sys_poll: fix incorrect type for 'timeout' parameter 2012-02-21 17:24:20 -08:00
seq_file.c seq_file: fix mishandling of consecutive pread() invocations. 2012-03-21 17:54:54 -07:00
signalfd.c epoll: ep_unregister_pollwait() can use the freed pwq->whead 2012-02-24 11:42:50 -08:00
splice.c fs: remove the second argument of k[un]map_atomic() 2012-03-20 21:48:21 +08:00
stack.c filesystems: add set_nlink() 2011-11-02 12:53:43 +01:00
stat.c readlinkat: ensure we return ENOENT for the empty pathname for normal lookups 2011-11-02 12:53:42 +01:00
statfs.c vfs: new helper - vfs_ustat() 2012-01-03 22:53:07 -05:00
super.c vfs: Provide function to get superblock and wait for it to thaw 2012-02-13 20:45:38 -05:00
sync.c fs: move code out of buffer.c 2012-01-03 22:54:07 -05:00
timerfd.c
utimes.c
xattr_acl.c
xattr.c vfs: mnt_drop_write_file() 2012-01-03 22:52:40 -05:00