Commit Graph

234 Commits

Author SHA1 Message Date
Maxim Uvarov
b663a79c19 taskstats: add context-switch counters
Make available to the user the following task and process performance
statistics:

	* Involuntary Context Switches (task_struct->nivcsw)
	* Voluntary Context Switches (task_struct->nvcsw)

Statistics information is available from:
	1. taskstats interface (Documentation/accounting/)
	2. /proc/PID/status (task only).

This data is useful for detecting hyperactivity patterns between processes.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Maxim Uvarov <muvarov@ru.mvista.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Jonathan Lim <jlim@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:46 -07:00
Alexey Dobriyan
da58a16173 /proc/*/environ: wrong placing of ptrace_may_attach() check
It's a bit dopey-looking and can permit a task to cause a pagefault in an mm
which it doesn't have permission to read from.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:44 -07:00
Changli Gao
99fc06df72 procfs directory entry cleanup
Function proc_register() will assign proc_dir_operations and
proc_dir_inode_operations to ent's members proc_fops and proc_iops
correctly if ent is a directory. So the early assignment isn't
necessary.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:43 -07:00
Pavel Emelianov
25216b0039 Make /proc/tty/drivers use seq_list_xxx helpers
Simple and stupid like some previous ones.  Just use new API.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:42 -07:00
Tomas Janousek
d62141414a Use boot based time for uptime in /proc
Commit 411187fb05 caused uptime not to increase
during suspend.  This may cause confusion so I restore the old behaviour by
using the boot based time instead of monotonic for uptime.

Signed-off-by: Tomas Janousek <tjanouse@redhat.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:41 -07:00
Tomas Janousek
924b42d5a2 Use boot based time for process start time and boot time in /proc
Commit 411187fb05 caused boot time to move and
process start times to become invalid after suspend.  Using boot based time
for those restores the old behaviour and fixes the issue.

[akpm@linux-foundation.org: little cleanup]
Signed-off-by: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:41 -07:00
Alexey Dobriyan
786d7e1612 Fix rmmod/read/write races in /proc entries
Fix following races:
===========================================
1. Write via ->write_proc sleeps in copy_from_user(). Module disappears
   meanwhile. Or, more generically, system call done on /proc file, method
   supplied by module is called, module dissapeares meanwhile.

   pde = create_proc_entry()
   if (!pde)
	return -ENOMEM;
   pde->write_proc = ...
				open
				write
				copy_from_user
   pde = create_proc_entry();
   if (!pde) {
	remove_proc_entry();
	return -ENOMEM;
	/* module unloaded */
   }
				*boom*
==========================================
2. bogo-revoke aka proc_kill_inodes()

  remove_proc_entry		vfs_read
  proc_kill_inodes		[check ->f_op validness]
				[check ->f_op->read validness]
				[verify_area, security permissions checks]
	->f_op = NULL;
				if (file->f_op->read)
					/* ->f_op dereference, boom */

NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's
see how this scheme behaves, then extend if needed for directories.
Directories creators in /proc only set ->owner for them, so proxying for
directories may be unneeded.

NOTE, NOTE, NOTE: methods being proxied are ->llseek, ->read, ->write,
->poll, ->unlocked_ioctl, ->ioctl, ->compat_ioctl, ->open, ->release.
If your in-tree module uses something else, yell on me. Full audit pending.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:39 -07:00
Ingo Molnar
43ae34cb4c sched: scheduler debugging, core
scheduler debugging core: implement /proc/sched_debug and
/proc/<PID>/sched files for scheduler debugging.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Balbir Singh
172ba844a8 sched: update delay-accounting to use CFS's precise stats
update delay-accounting to use CFS's precise stats.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
b27f03d4bd sched: make use of precise accounting for /proc task stats
make use of CFS's precise accounting to drive /proc/<pid>/stat statistics.

this code was co-authored by:

 Balbir Singh <balbir@linux.vnet.ibm.com>
 Dmitry Adamushko <dmitry.adamushko@gmail.com>
 Ingo Molnar <mingo@elte.hu>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
2007-07-09 18:51:59 +02:00
Ingo Molnar
62480d13d5 sched: remove the SleepAVG field
remove the SleepAVG field from /proc/<pid>/status, as
with the removal of the sleep-average code this value
no longer makes sense.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Christoph Lameter
a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
David Rientjes
4b8df8915a smaps: only define clear_refs for CONFIG_MMU
/proc/pid/clear_refs is only defined in the CONFIG_MMU case, so make sure we
don't have any references to clear_refs_smap() in generic procfs code.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 20:41:14 -07:00
Akinobu Mita
0c28f287aa procfs: use simple_read_from_buffer()
Cleanup using simple_read_from_buffer() in procfs.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:14 -07:00
John Johansen
9d0633cfed Remove redundant check from proc_sys_setattr()
notify_change() already calls security_inode_setattr() before
calling iop->setattr.

Alan sayeth

  This is a behaviour change on all of these and limits some behaviour of
  existing established security modules

  When inode_change_ok is called it has side effects.  This includes
  clearing the SGID bit on attribute changes caused by chmod.  If you make
  this change the results of some rulesets may be different before or after
  the change is made.

  I'm not saying the change is wrong but it does change behaviour so that
  needs looking at closely (ditto all other attribute twiddles)

Signed-off-by: Steve Beattie <sbeattie@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:10 -07:00
John Johansen
1e8123fded Remove redundant check from proc_setattr()
notify_change() already calls security_inode_setattr() before
calling iop->setattr.

Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: John Johansen <jjohansen@suse.de>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:10 -07:00
Martin Peschke
09f0892ec7 proc: cleanup: use seq_release_private() where appropriate
We can save some lines of code by using seq_release_private().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:09 -07:00
Alexey Dobriyan
9d65cb4a17 Fix race between cat /proc/*/wchan and rmmod et al
kallsyms_lookup() can go iterating over modules list unprotected which is OK
for emergency situations (oops), but not OK for regular stuff like
/proc/*/wchan.

Introduce lookup_symbol_name()/lookup_module_symbol_name() which copy symbol
name into caller-supplied buffer or return -ERANGE.  All copying is done with
module_mutex held, so...

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Alexey Dobriyan
ffb4512276 Simplify kallsyms_lookup()
Several kallsyms_lookup() pass dummy arguments but only need, say, module's
name.  Make kallsyms_lookup() accept NULLs where possible.

Also, makes picture clearer about what interfaces are needed for all symbol
resolving business.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Randy Dunlap
e63340ae6b header cleaning: don't include smp_lock.h when not used
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:07 -07:00
Alexey Dobriyan
ca509f69de Protect tty drivers list with tty_mutex
Additions and removal from tty_drivers list were just done as well as
iterating on it for /proc/tty/drivers generation.

testing: modprobe/rmmod loop of simple module which does nothing but
tty_register_driver() vs cat /proc/tty/drivers loop

BUG: unable to handle kernel paging request at virtual address 6b6b6b6b
 printing eip:
c01cefa7
*pde = 00000000
Oops: 0000 [#1]
PREEMPT
last sysfs file: devices/pci0000:00/0000:00:1d.7/usb5/5-0:1.0/bInterfaceProtocol
Modules linked in: ohci_hcd af_packet e1000 ehci_hcd uhci_hcd usbcore xfs
CPU:    0
EIP:    0060:[<c01cefa7>]    Not tainted VLI
EFLAGS: 00010297   (2.6.21-rc4-mm1 #4)
EIP is at vsnprintf+0x3a4/0x5fc
eax: 6b6b6b6b   ebx: f6cb50f2   ecx: 6b6b6b6b   edx: fffffffe
esi: c0354700   edi: f6cb6000   ebp: 6b6b6b6b   esp: f31f5e68
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process cat (pid: 31864, ti=f31f4000 task=c1998030 task.ti=f31f4000)
Stack: 00000000 c0103f20 c013003a c0103f20 00000000 f6cb50da 0000000a 00000f0e
       f6cb50f2 00000010 00000014 ffffffff ffffffff 00000007 c0354753 f6cb50f2
       f73e39dc f73e39dc 00000001 c0175416 f31f5ed8 f31f5ed4 0ee00000 f32090bc
Call Trace:
 [<c0103f20>] restore_nocheck+0x12/0x15
 [<c013003a>] mark_held_locks+0x6d/0x86
 [<c0103f20>] restore_nocheck+0x12/0x15
 [<c0175416>] seq_printf+0x2e/0x52
 [<c0192895>] show_tty_range+0x35/0x1f3
 [<c0175416>] seq_printf+0x2e/0x52
 [<c0192add>] show_tty_driver+0x8a/0x1d9
 [<c01758f6>] seq_read+0x70/0x2ba
 [<c0175886>] seq_read+0x0/0x2ba
 [<c018d8e6>] proc_reg_read+0x63/0x9f
 [<c015e764>] vfs_read+0x7d/0xb5
 [<c018d883>] proc_reg_read+0x0/0x9f
 [<c015eab1>] sys_read+0x41/0x6a
 [<c0103e4e>] sysenter_past_esp+0x5f/0x99
 =======================
Code: 00 8b 4d 04 e9 44 ff ff ff 8d 4d 04 89 4c 24 50 8b 6d 00 81 fd ff 0f 00 00 b8 a4 c1 35 c0 0f 46 e8 8b 54 24 2c 89 e9 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 89 c6 8b 44 24 28 89
EIP: [<c01cefa7>] vsnprintf+0x3a4/0x5fc SS:ESP 0068:f31f5e68

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:05 -07:00
Alexey Dobriyan
19c5d45a09 /proc/*/oom_score oops re badness
Eternal quest to make

	while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done
	while true; do find /proc -type f 2>/dev/null | xargs cat >/dev/null 2>/dev/null; done
	while true; do modprobe xfs; rmmod xfs; done

work reliably continues and now kernel oopses in the following way:

BUG: unable to handle ... at virtual address 6b6b6b6b
EIP is at badness
process: cat
	proc_oom_score
	proc_info_read
	sys_fstat64
	vfs_read
	proc_info_read
	sys_read

Failing code is prefetch hidden in list_for_each_entry() in badness().
badness() is reachable from two points. One is proc_oom_score, another
is out_of_memory() => select_bad_process() => badness().

Second path grabs tasklist_lock, while first doesn't.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Miklos Szeredi
2793274298 add file position info to proc
Add support for finding out the current file position, open flags and
possibly other info in the future.

These new entries are added:

  /proc/PID/fdinfo/FD
  /proc/PID/task/TID/fdinfo/FD

For each fd the information is provided in the following format:

pos:	1234
flags:	0100002

[bunk@stusta.de: make struct proc_fdinfo_file_operations static]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:03 -07:00
Eric Dumazet
c5141e6d64 procfs: reorder struct pid_dentry to save space on 64bit archs, and constify them
Change the order of fields of struct pid_entry (file fs/proc/base.c) in order
to avoid a hole on 64bit archs.  (8 bytes saved per object)

Also change all pid_entry arrays to be const qualified, to make clear they
must not be modified.

Before (on x86_64) :

# size fs/proc/base.o
   text    data     bss     dec     hex filename
  15549    2192       0   17741    454d fs/proc/base.o

After :

# size fs/proc/base.o
   text    data     bss     dec     hex filename
  17229     176       0   17405    43fd fs/proc/base.o

Thats 336 bytes saved on kernel size on x86_64

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:03 -07:00
Kees Cook
5096add84b proc: maps protection
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes.  Issues:

- maps should not be world-readable, especially if programs expect any
  kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
  check the maps when %n is in a *printf call, and a setuid(getuid())
  process wouldn't be able to read its own maps file.  (For reference
  see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
  non-root applications that depend on access to the maps contents.

This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents.  To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.

[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat]
Signed-off-by: Kees Cook <kees@outflux.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:02 -07:00
Alexey Dobriyan
578c8183c1 proc: remove pathetic ->deleted WARN_ON
WARN_ON(de && de->deleted); is sooo unreliable. Why?

proc_lookup				remove_proc_entry
===========				=================
lock_kernel();
spin_lock(&proc_subdir_lock);
[find proc entry]
spin_unlock(&proc_subdir_lock);
					spin_lock(&proc_subdir_lock);
					[find proc entry]

proc_get_inode
==============
WARN_ON(de && de->deleted);			...

					if (!atomic_read(&de->count))
						free_proc_entry(de);
					else
						de->deleted = 1;

So, if you have some strange oops [1], and doesn't see this WARN_ON it means
nothing.

[1] try_module_get() of module which doesn't exist, two lines below
    should suffice, or not?

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:02 -07:00
Darrick J. Wong
59cd0cbc75 Fix race between proc_readdir and remove_proc_entry
Fix the following race:

proc_readdir				remove_proc_entry
============				=================

spin_lock(&proc_subdir_lock);
[choose PDE to start filldir from]
spin_unlock(&proc_subdir_lock);
					spin_lock(&proc_subdir_lock);
					[find PDE]
					[free PDE, refcount is 0]
					spin_unlock(&proc_subdir_lock);
		    /* boom */
if (filldir(dirent, de->name, ...

[de_put on error path --adobriyan]
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:02 -07:00
Alexey Dobriyan
7695650a92 Fix race between proc_get_inode() and remove_proc_entry()
proc_lookup				remove_proc_entry
===========				=================

lock_kernel();
spin_lock(&proc_subdir_lock);
[find PDE with refcount 0]
spin_unlock(&proc_subdir_lock);
					spin_lock(&proc_subdir_lock);
					[find PDE with refcount 0]
					[check refcount and free PDE]
					spin_unlock(&proc_subdir_lock);
proc_get_inode:
	de_get(de); /* boom */

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:01 -07:00
William Cohen
97dc32cdb1 reduce size of task_struct on 64-bit machines
This past week I was playing around with that pahole tool
(http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size
of various struct in the kernel.  I was surprised by the size of the
task_struct on x86_64, approaching 4K.  I looked through the fields in
task_struct and found that a number of them were declared as "unsigned
long" rather than "unsigned int" despite them appearing okay as 32-bit
sized fields.  On x86_64 "unsigned long" ends up being 8 bytes in size and
forces 8 byte alignment.  Is there a reason there a reason they are
"unsigned long"?

The patch below drops the size of the struct from 3808 bytes (60 64-byte
cachelines) to 3760 bytes (59 64-byte cachelines).  A couple other fields
in the task struct take a signficant amount of space:

struct thread_struct       thread;               688
struct held_lock           held_locks[30];       1680

CONFIG_LOCKDEP is turned on in the .config

[akpm@linux-foundation.org: fix printk warnings]
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:58 -07:00
Alexey Dobriyan
8948e11f45 Allow access to /proc/$PID/fd after setuid()
/proc/$PID/fd has r-x------ permissions, so if process does setuid(), it
will not be able to access /proc/*/fd/. This breaks fstatat() emulation
in glibc.

open("foo", O_RDONLY|O_DIRECTORY)       = 4
setuid32(65534)                         = 0
stat64("/proc/self/fd/4/bar", 0xbfafb298) = -1 EACCES (Permission denied)

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Kirill Korotaev <dev@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:58 -07:00
Christoph Lameter
50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
David Rientjes
b813e931b4 smaps: add clear_refs file to clear reference
Adds /proc/pid/clear_refs.  When any non-zero number is written to this file,
pte_mkold() and ClearPageReferenced() is called for each pte and its
corresponding page, respectively, in that task's VMAs.  This file is only
writable by the user who owns the task.

It is now possible to measure _approximately_ how much memory a task is using
by clearing the reference bits with

	echo 1 > /proc/pid/clear_refs

and checking the reference count for each VMA from the /proc/pid/smaps output
at a measured time interval.  For example, to observe the approximate change
in memory footprint for a task, write a script that clears the references
(echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
extracts the size in kB.  Add the sizes for each VMA together for the total
referenced footprint.  Moments later, repeat the process and observe the
difference.

For example, using an efficient Mozilla:

	accumulated time		referenced memory
	----------------		-----------------
		 0 s				 408 kB
		 1 s				 408 kB
		 2 s				 556 kB
		 3 s				1028 kB
		 4 s				 872 kB
		 5 s				1956 kB
		 6 s				 416 kB
		 7 s				1560 kB
		 8 s				2336 kB
		 9 s				1044 kB
		10 s				 416 kB

This is a valuable tool to get an approximate measurement of the memory
footprint for a task.

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
[akpm@linux-foundation.org: build fixes]
[mpm@selenic.com: rename for_each_pmd]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
David Rientjes
f79f177c25 smaps: add pages referenced count to smaps
Adds an additional unsigned long field to struct mem_size_stats called
'referenced'.  For each pte walked in the smaps code, this field is
incremented by PAGE_SIZE if it has pte-reference bits.

An additional line was added to the /proc/pid/smaps output for each VMA to
indicate how many pages within it are currently marked as referenced or
accessed.

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
David Rientjes
826fad1b93 smaps: extract pmd walker from smaps code
Extracts the pmd walker from smaps-specific code in fs/proc/task_mmu.c.

The new struct pmd_walker includes the struct vm_area_struct of the memory to
walk over.  Iteration begins at the vma->vm_start and completes at
vma->vm_end.  A pointer to another data structure may be stored in the private
field such as struct mem_size_stats, which acts as the smaps accumulator.  For
each pmd in the VMA, the action function is called with a pointer to its
struct vm_area_struct, a pointer to the pmd_t, its start and end addresses,
and the private field.

The interface for walking pmd's in a VMA for fs/proc/task_mmu.c is now:

	void for_each_pmd(struct vm_area_struct *vma,
			  void (*action)(struct vm_area_struct *vma,
					 pmd_t *pmd, unsigned long addr,
					 unsigned long end,
					 void *private),
			  void *private);

Since the pmd walker is now extracted from the smaps code, smaps_one_pmd() is
invoked for each pmd in the VMA.  Its behavior and efficiency is identical to
the existing implementation.

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Adrian Bunk
ac267728f1 mm/slab.c: proper prototypes
Add proper prototypes in include/linux/slab.h.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:52 -07:00
Ian Campbell
79e030114a [PATCH] i386: Allow i386 crash kernels to handle x86_64 dumps
The specific case I am encountering is kdump under Xen with a 64 bit
hypervisor and 32 bit kernel/userspace.  The dump created is 64 bit due to
the hypervisor but the dump kernel is 32 bit for maximum compatibility.

It's possibly less likely to be useful in a purely native scenario but I
see no reason to disallow it.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Ian Campbell <ian.campbell@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Horms <horms@verge.net.au>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:09 +02:00
Stephen Rothwell
1a38147ed0 [POWERPC] Make struct property's value a void *
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-04-13 03:55:18 +10:00
Andrew Morton
05565b65a5 [PATCH] proc: fix linkage with CONFIG_SYSCTL=y, CONFIG_PROC_SYSCTL=n
We're using #ifdef CONFIG_SYSCTL, but we should be using CONFIG_PROC_SYSCTL,
so we get

 fs/built-in.o: In function `proc_root_init':
 /usr/src/linux/fs/proc/root.c:83: undefined reference to `proc_sys_init'

Fix that up and remove an ifdef-in-C.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Helge Hafting <helgehaf@aitel.hist.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:08 -07:00
Mika Kukkonen
5c46010af2 [PATCH] Fix kernel build with EMBEDDED & PROC_FS & !PROC_SYSCTL
Without attached patch against current -git I get following with
!PROC_SYSCTL (with EMBEDDED and PROC_FS set):

    CC      init/version.o
    LD      init/built-in.o
    LD      vmlinux
  fs/built-in.o: In function `do_proc_sys_lookup':
  proc_sysctl.c:(.text+0x26583): undefined reference to `sysctl_head_next'
  fs/built-in.o: In function `proc_sys_revalidate':
  proc_sysctl.c:(.text+0x265bb): undefined reference to `sysctl_head_finish'
  fs/built-in.o: In function `proc_sys_readdir':
  proc_sysctl.c:(.text+0x26720): undefined reference to `sysctl_head_next'
  proc_sysctl.c:(.text+0x267d8): undefined reference to `sysctl_head_finish'
  proc_sysctl.c:(.text+0x268e7): undefined reference to `sysctl_head_next'
  proc_sysctl.c:(.text+0x26910): undefined reference to `sysctl_head_finish'
  fs/built-in.o: In function `proc_sys_write':
  proc_sysctl.c:(.text+0x2695d): undefined reference to `sysctl_perm'
  proc_sysctl.c:(.text+0x2699c): undefined reference to `sysctl_head_finish'
  fs/built-in.o: In function `proc_sys_read':
  proc_sysctl.c:(.text+0x269e9): undefined reference to `sysctl_perm'
  proc_sysctl.c:(.text+0x26a25): undefined reference to `sysctl_head_finish'
  fs/built-in.o: In function `proc_sys_permission':
  proc_sysctl.c:(.text+0x26ad1): undefined reference to `sysctl_perm'
  proc_sysctl.c:(.text+0x26adb): undefined reference to `sysctl_head_finish'
  fs/built-in.o: In function `proc_sys_lookup':
  proc_sysctl.c:(.text+0x26b39): undefined reference to `sysctl_head_finish'
  make: *** [vmlinux] Virhe 1

All those functions are in fs/proc/proc_sysctl.c, which has no CONFIG_
#define's in it, so the patch makes the compilation of that file to depend
on CONFIG_PROC_SYSCTL (the simplest choice).

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:16 -07:00
Al Viro
04ff97086b [PATCH] sanitize security_getprocattr() API
have it return the buffer it had allocated

Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-14 15:27:48 -07:00
Glauber de Oliveira Costa
63967fa911 [PATCH] Missing __user in pointer referenced within copy_from_user
Pointers to user data should be marked with a __user hint.  This one is
missing.

Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:15 -08:00
Eric W. Biederman
86a71dbd3e [PATCH] sysctl: hide the sysctl proc inodes from selinux
Since the security checks are applied on each read and write of a sysctl file,
just like they are applied when calling sys_sysctl, they are redundant on the
standard VFS constructs.  Since it is difficult to compute the security labels
on the standard VFS constructs we just mark the sysctl inodes in proc private
so selinux won't even bother with them.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:10:00 -08:00
Eric W. Biederman
77b14db502 [PATCH] sysctl: reimplement the sysctl proc support
With this change the sysctl inodes can be cached and nothing needs to be done
when removing a sysctl table.

For a cost of 2K code we will save about 4K of static tables (when we remove
de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
about half that on a 32bit arch.

The speed feels about the same, even though we can now cache the sysctl
dentries :(

We get the core advantage that we don't need to have a 1 to 1 mapping between
ctl table entries and proc files.  Making it possible to have /proc/sys vary
depending on the namespace you are in.  The currently merged namespaces don't
have an issue here but the network namespace under /proc/sys/net needs to have
different directories depending on which network adapters are visible.  By
simply being a cache different directories being visible depending on who you
are is trivial to implement.

[akpm@osdl.org: fix uninitialised var]
[akpm@osdl.org: fix ARM build]
[bunk@stusta.de: make things static]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:10:00 -08:00
Eric W. Biederman
2abc26fc6b [PATCH] sysctl: create sys/fs/binfmt_misc as an ordinary sysctl entry
binfmt_misc has a mount point in the middle of the sysctl and that mount point
is created as a proc_generic directory.

Doing it that way gets in the way of cleaning up the sysctl proc support as it
continues the existence of a horrible hack.  So instead simply create the
directory as an ordinary sysctl directory.  At least that removes the magic
special case.

[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Josef 'Jeff' Sipek
ee9b6d61a2 [PATCH] Mark struct super_operations const
This patch is inspired by Arjan's "Patch series to mark struct
file_operations and struct inode_operations const".

Compile tested with gcc & sparse.

Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:47 -08:00
Arjan van de Ven
c5ef1c42c5 [PATCH] mark struct inode_operations const 3
Many struct inode_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
Arjan van de Ven
00977a59b9 [PATCH] mark struct file_operations const 6
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:45 -08:00
Eric W. Biederman
ab521dc0f8 [PATCH] tty: update the tty layer to work with struct pid
Of kernel subsystems that work with pids the tty layer is probably the largest
consumer.  But it has the nice virtue that the assiation with a session only
lasts until the session leader exits.  Which means that no reference counting
is required.  So using struct pid winds up being a simple optimization to
avoid hash table lookups.

In the long term the use of pid_nr also ensures that when we have multiple pid
spaces mixed everything will work correctly.

Signed-off-by: Eric W. Biederman <eric@maxwell.lnxi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:32 -08:00
Alexey Dobriyan
4b98d11b40 [PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_struct
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".

And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:07 -08:00
Andrew Morton
100bb9349e [PATCH] proc_misc warning fix
fs/proc/proc_misc.c: In function 'proc_misc_init':
fs/proc/proc_misc.c:764: warning: unused variable 'entry'

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:31 -08:00