Commit Graph

166518 Commits

Author SHA1 Message Date
Ben Blum
102a775e36 cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids
struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file.  Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:58 -07:00
Paul Menage
8f3ff20862 cgroups: revert "cgroups: fix pid namespace bug"
The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.

The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved.  The attach interface may need to be revised if this becomes the
case.

Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count.  This scheme should introduce no extra
overhead in the fork path when there's no contention.

The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.

This patch:

Revert

commit 096b7fe012
Author:     Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700

    cgroups: fix pid namespace bug

This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.

The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches.  I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:58 -07:00
Paul Menage
2c6ab6d200 cgroups: allow cgroup hierarchies to be created with no bound subsystems
This patch removes the restriction that a cgroup hierarchy must have at
least one bound subsystem.  The mount option "none" is treated as an
explicit request for no bound subsystems.

A hierarchy with no subsystems can be useful for plain task tracking, and
is also a step towards the support for multiply-bindable subsystems.

As part of this change, the hierarchy id is no longer calculated from the
bitmask of subsystems in the hierarchy (since this is not guaranteed to be
unique) but is allocated via an ida.  Reference counts on cgroups from
css_set objects are now taken explicitly one per hierarchy, rather than
one per subsystem.

Example usage:

mount -t cgroup -o none,name=foo cgroup /mnt/cgroup

Based on the "no-op"/"none" subsystem concept proposed by
kamezawa.hiroyu@jp.fujitsu.com

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:58 -07:00
Paul Menage
7717f7ba92 cgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup
Currently the cgroups code makes the assumption that the subsystem
pointers in a struct css_set uniquely identify the hierarchy->cgroup
mappings associated with the css_set; and there's no way to directly
identify the associated set of cgroups other than by indirecting through
the appropriate subsystem state pointers.

This patch removes the need for that assumption by adding a back-pointer
from struct cg_cgroup_link object to its associated cgroup; this allows
the set of cgroups to be determined by traversing the cg_links list in
the struct css_set.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:58 -07:00
Paul Menage
fe6934354f cgroups: move the cgroup debug subsys into cgroup.c to access internal state
While it's architecturally clean to have the cgroup debug subsystem be
completely independent of the cgroups framework, it limits its usefulness
for debugging the contents of internal data structures.  Move the debug
subsystem code into the scope of all the cgroups data structures to make
more detailed debugging possible.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Paul Menage
c6d57f3312 cgroups: support named cgroups hierarchies
To simplify referring to cgroup hierarchies in mount statements, and to
allow disambiguation in the presence of empty hierarchies and
multiply-bindable subsystems this patch adds support for naming a new
cgroup hierarchy via the "name=" mount option

A pre-existing hierarchy may be specified by either name or by subsystems;
a hierarchy's name cannot be changed by a remount operation.

Example usage:

# To create a hierarchy called "foo" containing the "cpu" subsystem
mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1

# To mount the "foo" hierarchy on a second location
mount -t cgroup -oname=foo cgroup /mnt/cgroup2

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Xiaotian Feng
34f77a90f7 cgroups: make unlock sequence in cgroup_get_sb consistent
Make the last unlock sequence consistent with previous unlock sequeue.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Randy Dunlap
55dff4954e docs: fix various Documentation/ paths in header files
Fix various Documentation/ paths in include/linux/.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Wu Fengguang
0b4b2ad530 page-types: add feature for walking process address space
Introduce "-p|--pid <pid>" for walking the process address space.  The
default action is to walk raw memory PFNs.

Both the virtual address and physical address of each present pages will
be listed:

	# ./tools/vm/page-types -lp $$ | head -3
	voffset offset  len     flags
	400     11bebe  1       __RU_lA____M______________________
	402     11bebc  1       __RU_lA____M______________________

Note that voffset/offset/len are now showed as hex numbers.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Josh Triplett
ba36c440ba Documentation/vm/.gitignore: add page-types
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Jaswinder Singh Rajput
2552a99b6e includecheck fix: Documentation, cfag12864b-example.c
fix the following 'make includecheck' warning:

  Documentation/auxdisplay/cfag12864b-example.c: string.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Xiaotian Feng
bcadbbd4c8 Documentation: update stale definition of file-nr in fs.txt
In "documentation: update Documentation/filesystem/proc.txt and
Documentation/sysctls" (commit 760df93ec) we merged /proc/sys/fs
documentation in Documentation/sysctl/fs.txt and
Documentation/filesystem/proc.txt, but stale file-nr definition
remained.

This patch adds back the right fs-nr definition for 2.6 kernel.

Signed-off-by: Xiaotian Feng<dfeng@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Peng Tao
16c01b20ae doc/filesystems: more mount cleanups
Documentation/filesystems/sharedsubtree.txt needs updating because the
mount command in util-linux package is well aware of shared subtree
features now.  The patch also fixes two typos in sharedsubtree.txt.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Randy Dunlap
0288b95b43 doc/filesystems: remove smount program
mount(8) handles shared subtrees just fine, so remove the smount program
from Documentation/filesystems/sharedsubtree.txt.

Fix annoying "Lets" -> "Let's".
Insert space between '#' prompt and "mount" command.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:57 -07:00
Zhaolei
57f1f0874f time: add function to convert between calendar time and broken-down time for universal use
There are many similar code in kernel for one object: convert time between
calendar time and broken-down time.

Here is some source I found:
  fs/ncpfs/dir.c
  fs/smbfs/proc.c
  fs/fat/misc.c
  fs/udf/udftime.c
  fs/cifs/netmisc.c
  net/netfilter/xt_time.c
  drivers/scsi/ips.c
  drivers/input/misc/hp_sdc_rtc.c
  drivers/rtc/rtc-lib.c
  arch/ia64/hp/sim/boot/fw-emu.c
  arch/m68k/mac/misc.c
  arch/powerpc/kernel/time.c
  arch/parisc/include/asm/rtc.h
  ...

We can make a common function for this type of conversion, At least we
can get following benefit:

1: Make kernel simple and unify
2: Easy to fix bug in converting code
3: Reduce clone of code in future
   For example, I'm trying to make ftrace display walltime,
   this patch will make me easy.

This code is based on code from glibc-2.6

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:56 -07:00
From: Mel Gorman
ef1ff6b8c0 hugetlbfs: do not call user_shm_lock() for MAP_HUGETLB fix
Commit 6bfde05bf5 ("hugetlbfs: allow the creation of files suitable for
MAP_PRIVATE on the vfs internal mount") altered can_do_hugetlb_shm() to
check if a file is being created for shared memory or mmap().  If this
returns false, we then unconditionally call user_shm_lock() triggering a
warning.  This block should never be entered for MAP_HUGETLB.  This
patch partially reverts the problem and fixes the check.

Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:56 -07:00
Izik Eidus
2c6854fdad ksm: change default values to better fit into mainline kernel
Now that ksm is in mainline it is better to change the default values to
better fit to most of the users.

This patch change the ksm default values to be:

	ksm_thread_pages_to_scan = 100 (instead of 200)
	ksm_thread_sleep_millisecs = 20 (like before)
	ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
	                        disabled by default)
	ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

The important aspect of this patch is: it disables ksm by default, and sets
the number of the kernel_pages that can be allocated to be a reasonable
number.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:56 -07:00
Ingo Molnar
d2b5ec3aa0 input: fix build failures caused by Kconfig Winbond WPCD376I Consumer IR hardware driver Kconfig entry
Fix these warnings:

  drivers/built-in.o: In function `apanel_remove':
  apanel.c:(.text+0x56e852): undefined reference to `led_classdev_unregister'
  drivers/built-in.o: In function `apanel_probe':
  apanel.c:(.text+0x56eae3): undefined reference to `led_classdev_register'
  drivers/built-in.o: In function `acpi_fujitsu_hotkey_add':
  fujitsu-laptop.c:(.text+0x5d7647): undefined reference to `led_classdev_register'
  fujitsu-laptop.c:(.text+0x5d76b5): undefined reference to `led_classdev_register'
  drivers/built-in.o: In function `wbcir_probe':
  winbond-cir.c:(.devinit.text+0x5f375): undefined reference to `led_classdev_register'
  winbond-cir.c:(.devinit.text+0x5f663): undefined reference to `led_classdev_unregister'
  drivers/built-in.o: In function `wbcir_remove':
  winbond-cir.c:(.devexit.text+0x7f23): undefined reference to `led_classdev_unregister'
  drivers/built-in.o: In function `fujitsu_cleanup':
  fujitsu-laptop.c:(.exit.text+0xbe37): undefined reference to `led_classdev_unregister'
  fujitsu-laptop.c:(.exit.text+0xbe53): undefined reference to `led_classdev_unregister'

It happens because the new INPUT_WINBOND_CIR driver relies on new-leds
infrastructure - but does not select it in drivers/input/misc/Kconfig.
But it selects LEDS_CLASS, which confuses a number of other drivers into
thinking that all the leds infrastructure is in place.

Fix this by selecting NEW_LEDS as well, like similar drivers do.

Eventually, this whole leds infrastructure complexity should be
cleaned up, it's been going on for years.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: David Härdeman <david@hardeman.nu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-24 07:20:56 -07:00
Chris Mason
54bcf382da Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/super.c
2009-09-24 10:00:58 -04:00
Yan Zheng
c65ddb52dc Btrfs: hash the btree inode during fill_super
The snapshot deletion  patches dropped this line, but the inode
needs to be hashed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24 09:24:43 -04:00
Yan, Zheng
0257bb82d2 Btrfs: relocate file extents in clusters
The extent relocation code copy file extents one by one when
relocating data block group. This is inefficient if file
extents are small. This patch makes the relocation code copy
file extents in clusters. So we can can make better use of
read-ahead.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24 09:17:31 -04:00
Yan, Zheng
f679a84034 Btrfs: don't rename file into dummy directory
A recent change enforces only one access point to each subvolume. The first
directory entry (the one added when the subvolume/snapshot was created) is
treated as valid access point, all other subvolume links are linked to dummy
empty directories. The dummy directories are temporary inodes that only in
memory, so we can not rename file into them.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24 09:17:31 -04:00
Yan, Zheng
a571952143 Btrfs: check size of inode backref before adding hardlink
For every hardlink in btrfs, there is a corresponding inode back
reference. All inode back references for hardlinks in a given
directory are stored in single b-tree item. The size of b-tree item
is limited by the size of b-tree leaf, so we can only create limited
number of hardlinks to a given file in a directory.

The original code lacks of the check, it oops if the number of
hardlinks goes over the limit. This patch fixes the issue by adding
check to btrfs_link and btrfs_rename.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-24 09:17:31 -04:00
Eric Dumazet
a255a9981a perf tools: Fix buffer allocation
"perf top" cores dump on my dev machine, if run from a directory
where vmlinux is present:

  *** glibc detected *** malloc(): memory corruption: 0x085670d0 ***

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <4ABB6EB7.7000002@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 15:11:23 +02:00
npiggin@suse.de
c08d3b0e33 truncate: use new helpers
Update some fs code to make use of new helper functions introduced
in the previous patch. Should be no significant change in behaviour
(except CIFS now calls send_sig under i_lock, via inode_newsize_ok).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: linux-nfs@vger.kernel.org
Cc: Trond.Myklebust@netapp.com
Cc: linux-cifs-client@lists.samba.org
Cc: sfrench@samba.org
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 08:41:47 -04:00
npiggin@suse.de
25d9e2d152 truncate: new helpers
Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 08:41:47 -04:00
Vegard Nossum
eca6f534e6 fs: fix overflow in sys_mount() for in-kernel calls
sys_mount() reads/copies a whole page for its "type" parameter.  When
do_mount_root() passes a kernel address that points to an object which is
smaller than a whole page, copy_mount_options() will happily go past this
memory object, possibly dereferencing "wild" pointers that could be in any
state (hence the kmemcheck warning, which shows that parts of the next
page are not even allocated).

(The likelihood of something going wrong here is pretty low -- first of
all this only applies to kernel calls to sys_mount(), which are mostly
found in the boot code.  Secondly, I guess if the page was not mapped,
exact_copy_from_user() _would_ in fact handle it correctly because of its
access_ok(), etc.  checks.)

But it is much nicer to avoid the dubious reads altogether, by stopping as
soon as we find a NUL byte.  Is there a good reason why we can't do
something like this, using the already existing strndup_from_user()?

[akpm@linux-foundation.org: make copy_mount_string() static]
[AV: fix compat mount breakage, which involves undoing akpm's change above]

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: al <al@dizzy.pdmi.ras.ru>
2009-09-24 08:40:15 -04:00
Rusty Russell
b0c6fbe458 x86: Remove redundant non-NUMA topology functions
arch/x86/include/asm/topology.h declares inline fns cpu_to_node and
cpumask_of_node for !NUMA, even though they are then declared as
macros by asm-generic/topology.h, which is #included just below.

The macros (which are the same) end up being used; these functions
are just confusing.

Noticed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: "Greg Kroah-Hartman" <gregkh@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
LKML-Reference: <200909241748.45629.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 14:16:15 +02:00
Kirill Smelkov
67030036eb perf tools: .gitignore += perf*.html
I've tried building the docs in tools/perf/Documentation/ , and after
that `git status` showed dozen of untracked htmls. Let's ignore them.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
LKML-Reference: <1253790022-10300-1-git-send-email-kirr@mns.spb.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 14:01:22 +02:00
Thomas Gleixner
6d729e44a5 fs: Make unload_nls() NULL pointer safe
Most call sites of unload_nls() do:
	if (nls)
		unload_nls(nls);

Check the pointer inside unload_nls() like we do in kfree() and
simplify the call sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steve French <sfrench@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Petr Vandrovec <vandrove@vc.cvut.cz>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:42 -04:00
Christoph Hellwig
4504230a71 freeze_bdev: grab active reference to frozen superblocks
Currently we held s_umount while a filesystem is frozen, despite that we
might return to userspace and unlock it from a different process.  Instead
grab an active reference to keep the file system busy and add an explicit
check for frozen filesystems in remount and reject the remount instead
of blocking on s_umount.

Add a new get_active_super helper to super.c for use by freeze_bdev that
grabs an active reference to a superblock from a given block device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:41 -04:00
Christoph Hellwig
4fadd7bb20 freeze_bdev: kill bd_mount_sem
Now that we have the freeze count there is not much reason for bd_mount_sem
anymore.  The actual freeze/thaw operations are serialized using the
bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
get_sb_bdev which tries to prevent mounting a filesystem while the block
device is frozen.  Instead of add a check for bd_fsfreeze_count and
return -EBUSY if a filesystem is frozen.  While that is a change in user
visible behaviour a failing mount is much better for this case rather
than having the mount process stuck uninterruptible for a long time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:39 -04:00
Boaz Harrosh
1ba50bbe93 exofs: remove BKL from super operations
the two places inside exofs that where taking the BKL were:
exofs_put_super() - .put_super
and
exofs_sync_fs() - which is .sync_fs and is also called from
                  .write_super.

Now exofs_sync_fs() is protected from itself by also taking
the sb_lock.

exofs_put_super() directly calls exofs_sync_fs() so there is no
danger between these two either.

In anyway there is absolutely nothing dangerous been done
inside exofs_sync_fs().

Unless there is some subtle race with the actual lifetime of
the super_block in regard to .put_super and some other parts
of the VFS. Which is highly unlikely.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:38 -04:00
Julia Lawall
88a0a53d70 fs/romfs: correct error-handling code
romfs_fill_super() assumes that romfs_iget() returns NULL when
it fails.  romfs_iget() actually returns ERR_PTR(-ve) in that
case...

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:37 -04:00
Miklos Szeredi
f84398068d vfs: seq_file: add helpers for data filling
Add two helpers that allow access to the seq_file's own buffer, but
hide the internal details of seq_files.

This allows easier implementation of special purpose filling
functions.  It also cleans up some existing functions which duplicated
the seq_file logic.

Make these inline functions in seq_file.h, as suggested by Al.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:35 -04:00
Jeff Layton
f9098980ff vfs: remove redundant position check in do_sendfile
As Johannes Weiner pointed out, one of the range checks in do_sendfile
is redundant and is already checked in rw_verify_area.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Robert Love <rlove@google.com>
Cc: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:34 -04:00
Jeff Layton
42cb56ae2a vfs: change sb->s_maxbytes to a loff_t
sb->s_maxbytes is supposed to indicate the maximum size of a file that can
exist on the filesystem.  It's declared as an unsigned long long.

Even if a filesystem has no inherent limit that prevents it from using
every bit in that unsigned long long, it's still problematic to set it to
anything larger than MAX_LFS_FILESIZE.  There are places in the kernel
that cast s_maxbytes to a signed value.  If it's set too large then this
cast makes it a negative number and generally breaks the comparison.

Change s_maxbytes to be loff_t instead.  That should help eliminate the
temptation to set it too large by making it a signed value.

Also, add a warning for couple of releases to help catch filesystems that
set s_maxbytes too large.  Eventually we can either convert this to a
BUG() or just remove it and in the hope that no one will get it wrong now
that it's a signed value.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Robert Love <rlove@google.com>
Cc: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:33 -04:00
Jeff Layton
5aa98b706e vfs: explicitly cast s_maxbytes in fiemap_check_ranges
If fiemap_check_ranges is passed a large enough value, then it's
possible that the value would be cast to a signed value for comparison
against s_maxbytes when we change it to loff_t. Make sure that doesn't
happen by explicitly casting s_maxbytes to an unsigned value for the
purposes of comparison.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Love <rlove@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:31 -04:00
Wu Fengguang
05cc0cee69 libfs: return error code on failed attr set
Currently all simple_attr.set handlers return 0 on success and negative
codes on error.  Fix simple_attr_write() to return these error codes.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:30 -04:00
Tetsuo Handa
7a62cc1021 seq_file: return a negative error code when seq_path_root() fails.
seq_path_root() is returning a return value of successful __d_path()
instead of returning a negative value when mangle_path() failed.

This is not a bug so far because nobody is using return value of
seq_path_root().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:29 -04:00
Andi Kleen
ce06e0b21d vfs: optimize touch_time() too
Do a similar optimization as earlier for touch_atime.  Getting the lock in
mnt_get_write is relatively costly, so try all avenues to avoid it first.

This patch is careful to still only update inode fields inside the lock
region.

This didn't show up in benchmarks, but it's easy enough to do.

[akpm@linux-foundation.org: fix typo in comment]
[hugh.dickins@tiscali.co.uk: fix inverted test of mnt_want_write_file()]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:27 -04:00
Andi Kleen
b12536c270 vfs: optimization for touch_atime()
Some benchmark testing shows touch_atime to be high up in profile logs for
IO intensive workloads.  Most likely that's due to the lock in
mnt_want_write().  Unfortunately touch_atime first takes the lock, and
then does all the other tests that could avoid atime updates (like noatime
or relatime).

Do it the other way round -- first try to avoid the update and only then
if that didn't succeed take the lock.  That works because none of the
atime avoidance tests rely on locking.

This also eliminates a goto.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Valerie Aurora <vaurora@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:26 -04:00
Jan Kara
22fe404218 vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
Hugetlbfs needs to do special things instead of truncate_inode_pages().
 Currently, it copied generic_forget_inode() except for
truncate_inode_pages() call which is asking for trouble (the code there
isn't trivial).  So create a separate function generic_detach_inode()
which does all the list magic done in generic_forget_inode() and call
it from hugetlbfs_forget_inode().

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:25 -04:00
Manish Katiyar
af0d9ae811 fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
Add device-id and inode number for better debugging.  This was suggested
by Andreas in one of the threads
http://article.gmane.org/gmane.comp.file-systems.ext4/12062 .

"If anyone has a chance, fixing this error message to be not-useless would
be good...  Including the device name and the inode number would help
track down the source of the problem."

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:24 -04:00
Steven Rostedt
14be27460e libfs: make simple_read_from_buffer conventional
Impact: have simple_read_from_buffer conform to standards

It was brought to my attention by Andrew Morton, Theodore Tso, and H.
Peter Anvin that a read from userspace should only return -EFAULT if
nothing was actually read.

Looking at the simple_read_from_buffer I noticed that this function does
not conform to that rule.  This patch fixes that function.

[akpm@linux-foundation.org: simplification suggested by hpa]
[hpa@zytor.com: fix count==0 handling]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-09-24 07:47:22 -04:00
Jason Wessel
429a6e5e2c x86: early_printk: Protect against using the same device twice
If you use the kernel argument:

  earlyprintk=serial,ttyS0,115200

This will cause a recursive hang printing the same line
again and again:

 BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
bootconsole [earlyser0] enabled
Linux version 2.6.31-07863-gb64ada6 (mingo@sirius) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #16789 SMP Wed Sep 23 21:09:43 CEST 2009
Linux version 2.6.31-07863-gb64ada6 (mingo@sirius) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #16789 SMP Wed Sep 23 21:09:43 CEST 2009
Linux version 2.6.31-07863-gb64ada6 (mingo@sirius) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #16789 SMP Wed Sep 23 21:09:43 CEST 2009
Linux version 2.6.31-07863-gb64ada6 (mingo@sirius) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #16789 SMP Wed Sep 23 21:09:43 CEST 2009
Linux version 2.6.31-07863-gb64ada6 (mingo@sirius) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #16789 SMP Wed Sep 23 21:09:43 CEST 2009

Instead warn the end user that they specified the device
a second time, and ignore that second console.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4ABAAB89.1080407@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 13:01:13 +02:00
Ingo Molnar
d2ff6de537 Merge branch 'linus' into x86/urgent
Merge reason: Queueing up dependent early-printk fix.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 12:59:18 +02:00
Mike Galbraith
dd906a0fe8 perf tools: Handle relative paths while loading module symbols
Inform util/module.c::mod_dso__load_module_paths() that relative
paths do exist in some modules.dep, and make it fail noisily should
it encounter a path that it doesn't understand, or a module it
cannot open.

Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: rostedt@goodmis.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <1253779628.10513.8.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 11:40:35 +02:00
Roland Dreier
e23a8b6a8f x86: Reduce verbosity of "PAT enabled" kernel message
On modern systems, the kernel prints the message

    x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106

once for every CPU.

This gets kind of ridiculous on huge systems; for example, on a
64-thread system I was lucky enough to get:

    dmesg| grep 'PAT enabled' | wc
         64     704    5174

There is already a BUG() if non-boot CPUs have PAT capabilities
that don't match the boot CPU, so just print the message on the
boot CPU. (I kept the print after the wrmsrl() that enables PAT,
so that the log output continues to mean that the system survived
enabling PAT on the boot CPU)

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <adavdj92sso.fsf@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 11:35:19 +02:00
Roland Dreier
ea01c0d731 x86: Reduce verbosity of "TSC is reliable" message
On modern systems, the kernel prints the message

    Skipping synchronization checks as TSC is reliable.

once for every non-boot CPU.

This gets kind of ridiculous on huge systems; for example, on a
64-thread system I was lucky enough to get:

    $ dmesg | grep 'TSC is reliable' | wc
         63     567    4221

There's no point to doing this for every CPU, since the code is
just checking the boot CPU anyway, so change this to a
printk_once() to make the message appears only once.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
LKML-Reference: <adazl8l2swc.fsf@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-24 11:35:19 +02:00