linux/mm
Wu Fengguang 10be0b372c readahead: introduce context readahead algorithm
Introduce page cache context based readahead algorithm.
This is to better support concurrent read streams in general.

RATIONALE
---------
The current readahead algorithm detects interleaved reads in a _passive_ way.
Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
By checking for (offset == prev_offset + 1), it will discover the sequentialness
between 3,4 and between 1004,1005, and start doing sequential readahead for the
individual streams since page 4 and page 1005.

The context readahead algorithm guarantees to discover the sequentialness no
matter how the streams are interleaved. For the above example, it will start
sequential readahead since page 2 and 1002.

The trick is to poke for page @offset-1 in the page cache when it has no other
clues on the sequentialness of request @offset: if the current requenst belongs
to a sequential stream, that stream must have accessed page @offset-1 recently,
and the page will still be cached now. So if page @offset-1 is there, we can
take request @offset as a sequential access.

BENEFICIARIES
-------------
- strictly interleaved reads  i.e. 1,1001,2,1002,3,1003,...
  the current readahead will take them as silly random reads;
  the context readahead will take them as two sequential streams.

- cooperative IO processes   i.e. NFS and SCST
  They create a thread pool, farming off (sequential) IO requests to different
  threads which will be performing interleaved IO.

  It was not easy(or possible) to reliably tell from file->f_ra all those
  cooperative processes working on the same sequential stream, since they will
  have different file->f_ra instances. And NFSD's file->f_ra is particularly
  unusable, since their file objects are dynamically created for each request.
  The nfsd does have code trying to restore the f_ra bits, but not satisfactory.

  The new scheme is to detect the sequential pattern via looking up the page
  cache, which provides one single and consistent view of the pages recently
  accessed. That makes sequential detection for cooperative processes possible.

USER REPORT
-----------
Vladislav recommends the addition of context readahead as a result of his SCST
benchmarks. It leads to 6%~40% performance gains in various cases and achieves
equal performance in others.                http://lkml.org/lkml/2009/3/19/239

OVERHEADS
---------
In theory, it introduces one extra page cache lookup per random read.  However
the below benchmark shows context readahead to be slightly faster, wondering..

Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
each block size. The average throughputs are:

                       	original ra	context ra	gain
 4K random reads:	 65.561MB/s	 65.648MB/s	+0.1%
16K random reads:	124.767MB/s	124.951MB/s	+0.1%
64K random reads: 	162.123MB/s	162.278MB/s	+0.1%

Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Vladislav Bolkhovitin <vst@vlnb.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:30 -07:00
..
allocpercpu.c percpu: __percpu_depopulate_mask can take a const mask 2009-04-06 13:44:15 -07:00
backing-dev.c block: change the request allocation/congestion logic to be sync/async based 2009-04-06 08:04:53 -07:00
bootmem.c bootmem: fix slab fallback on numa 2009-06-11 19:15:54 +03:00
bounce.c Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block 2009-06-11 11:10:35 -07:00
debug-pagealloc.c generic debug pagealloc 2009-04-01 08:59:13 -07:00
dmapool.c
fadvise.c readahead: move max_sane_readahead() calls into force_page_cache_readahead() 2009-06-16 19:47:28 -07:00
failslab.c kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c 2009-04-03 12:23:01 +02:00
filemap_xip.c mm: do_xip_mapping_read: fix length calculation 2009-04-02 19:04:49 -07:00
filemap.c readahead: record mmap read-around states in file_ra_state 2009-06-16 19:47:29 -07:00
fremap.c
highmem.c mm: introduce debug_kmap_atomic 2009-04-01 08:59:14 -07:00
hugetlb.c mm: account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs 2009-05-29 08:40:03 -07:00
init-mm.c mm: consolidate init_mm definition 2009-06-16 19:47:28 -07:00
internal.h nommu: there is no mlock() for NOMMU, so don't provide the bits 2009-04-01 08:59:14 -07:00
Kconfig security: use mmap_min_addr indepedently of security models 2009-06-04 12:07:48 +10:00
Kconfig.debug generic debug pagealloc: build fix 2009-04-02 19:04:48 -07:00
kmemleak-test.c kmemleak: Simple testing module for kmemleak 2009-06-11 17:04:19 +01:00
kmemleak.c kmemleak: Add the base support 2009-06-11 17:03:28 +01:00
maccess.c [S390] maccess: add weak attribute to probe_kernel_write 2009-06-12 10:27:37 +02:00
madvise.c readahead: move max_sane_readahead() calls into force_page_cache_readahead() 2009-06-16 19:47:28 -07:00
Makefile mm: consolidate init_mm definition 2009-06-16 19:47:28 -07:00
memcontrol.c memcg: fix build warning and avoid checking for mem != null again and again 2009-05-29 08:40:03 -07:00
memory_hotplug.c
memory.c mm: close page_mkwrite races 2009-05-02 15:36:09 -07:00
mempolicy.c
mempool.c
migrate.c FS-Cache: Recruit a page flags for cache management 2009-04-03 16:42:36 +01:00
mincore.c
mlock.c x86, bts, mm: clean up buffer allocation 2009-04-24 10:18:52 +02:00
mm_init.c
mmap.c Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-06-11 14:01:07 -07:00
mmu_notifier.c
mmzone.c [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 2009-05-18 11:22:24 +01:00
mprotect.c perf_counter: Add mmap event hooks to mprotect() 2009-06-08 23:10:43 +02:00
mremap.c
msync.c
nommu.c nommu: Provide mmap_min_addr definition. 2009-06-10 09:24:09 +10:00
oom_kill.c oom: fix possible oom_dump_tasks NULL pointer 2009-05-29 08:40:01 -07:00
page_alloc.c kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash 2009-06-11 17:03:30 +01:00
page_cgroup.c memcg: fix page_cgroup fatal error in FLATMEM 2009-06-12 11:00:54 +03:00
page_io.c block: fix bad definition of BIO_RW_SYNC 2009-02-18 10:32:00 +01:00
page_isolation.c
page-writeback.c page-writeback: fix the calculation of the oldest_jif in wb_kupdate() 2009-05-17 16:36:11 -07:00
pagewalk.c
pdflush.c Revert "mm: add /proc controls for pdflush threads" 2009-05-15 11:32:24 +02:00
percpu.c percpu: remove rbtree and use page->index instead 2009-04-08 18:31:31 +02:00
prio_tree.c
quicklist.c cpumask: replace node_to_cpumask with cpumask_of_node. 2009-03-13 14:49:46 +10:30
readahead.c readahead: introduce context readahead algorithm 2009-06-16 19:47:30 -07:00
rmap.c hugh: update email address 2009-05-21 13:14:32 -07:00
shmem_acl.c
shmem.c integrity: move ima_counts_get 2009-05-22 09:45:33 +10:00
slab.c slab: setup cpu caches later on when interrupts are enabled 2009-06-12 18:53:58 +03:00
slob.c kmemleak: Add the slob memory allocation/freeing hooks 2009-06-11 17:03:30 +01:00
slub.c slab,slub: don't enable interrupts during early boot 2009-06-12 18:53:33 +03:00
sparse-vmemmap.c
sparse.c mm: mminit_validate_memmodel_limits(): remove redundant test 2009-04-01 08:59:11 -07:00
swap_state.c memcg: fix deadlock between lock_page_cgroup and mapping tree_lock 2009-05-29 08:40:02 -07:00
swap.c mm: fix Committed_AS underflow on large NR_CPUS environment 2009-05-02 15:36:10 -07:00
swapfile.c PM/hibernate: fix "swap breaks after hibernation failures" 2009-02-21 14:17:17 -08:00
thrash.c
truncate.c memcg: fix deadlock between lock_page_cgroup and mapping tree_lock 2009-05-29 08:40:02 -07:00
util.c Merge branch 'linus' into tracing/core 2009-05-07 11:17:34 +02:00
vmalloc.c Merge branch 'for-linus' of git://linux-arm.org/linux-2.6 2009-06-11 14:15:57 -07:00
vmscan.c PM/Suspend: Do not shrink memory before suspend 2009-06-12 21:32:32 +02:00
vmstat.c [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 2009-05-18 11:22:24 +01:00