linux/Documentation
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
ABI Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86 2013-07-13 18:08:23 -07:00
accounting Documentation/accounting/getdelays.c: avoid strncpy in accounting tool 2013-07-03 16:08:06 -07:00
acpi Power management and ACPI updates for 3.11-rc1 2013-07-03 14:35:40 -07:00
aoe
arm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
arm64
auxdisplay
backlight
blackfin
block
blockdev
bus-devices
cdrom
cgroups Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-block 2013-07-11 13:03:24 -07:00
connector
console
cpu-freq
cpuidle
cris
crypto drivers/dma: remove unused support for MEMSET operations 2013-07-03 16:07:42 -07:00
development-process
device-mapper Add a device-mapper target called dm-switch to provide a multipath 2013-07-11 13:05:40 -07:00
devicetree Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2013-07-13 18:05:13 -07:00
DocBook Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
driver-model
dvb
early-userspace
EDID
extcon
fault-injection
fb Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2013-07-09 16:04:31 -07:00
filesystems xfs: update (#2) for 3.11-rc1 2013-07-13 11:40:24 -07:00
firmware_class
fmc FMC: add a char-device mezzanine driver 2013-06-18 15:42:15 -07:00
frv
hid
hwmon New driver to support GMT G762/G763 pwm fan controllers 2013-07-03 19:56:35 -07:00
i2c Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2013-07-04 14:02:09 -07:00
i2o
ia64
ide
infiniband
input
ioctl s390/sclp: Add SCLP character device driver 2013-06-26 21:10:13 +02:00
isdn
ja_JP
kbuild Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild 2013-07-10 16:06:46 -07:00
kdump Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
ko_KR
laptops
leds
m68k
make
memory-devices
metag
mips
misc-devices
mmc
mn10300
mtd
namespaces
netlabel
networking tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
nfc
parisc parisc: document the shadow registers 2013-07-09 22:09:19 +02:00
PCI
pcmcia
power Merge branch 'pm-assorted' 2013-06-28 13:01:40 +02:00
powerpc powerpc/perf: Core EBB support for 64-bit book3s 2013-07-01 11:50:10 +10:00
pps
prctl
pti
ptp
rapidio rapidio: documentation update 2013-07-03 16:08:05 -07:00
RCU
s390
scheduler sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
scsi [SCSI] megaraid_sas: Changelog and driver version update 2013-06-24 17:52:10 -07:00
security
serial
sh
sound
spi
sysctl net: rename busy poll socket op and globals 2013-07-10 17:08:27 -07:00
target
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2013-07-11 12:26:08 -07:00
timers
trace The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
usb Driver core patches for 3.11-rc1 2013-07-02 11:44:19 -07:00
vDSO
video4linux Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
virtual Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
vm zswap: add documentation 2013-07-10 18:11:34 -07:00
w1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
watchdog watchdog: delete mpcore_wdt driver 2013-07-11 21:47:58 +02:00
wimax
x86 arm: add support for LZ4-compressed kernel 2013-07-09 10:33:30 -07:00
xtensa
zh_CN
.gitignore
00-INDEX FMC: add documentation for the core 2013-06-18 15:41:03 -07:00
applying-patches.txt
atomic_ops.txt
bad_memory.txt
basic_profiling.txt
bcache.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
binfmt_misc.txt
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt
cachetlb.txt
Changes
circular-buffers.txt
clk.txt
coccinelle.txt Coccinelle: Update information about the minimal version required 2013-07-03 22:58:20 +02:00
CodingStyle Documentation/CodingStyle: allow multiple return statements per function 2013-07-03 16:08:01 -07:00
cpu-hotplug.txt kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt /dev/oldmem: Remove the interface 2013-07-03 16:08:03 -07:00
digsig.txt
DMA-API-HOWTO.txt
DMA-API.txt
DMA-attributes.txt
dma-buf-sharing.txt
DMA-ISA-LPC.txt
dmaengine.txt
dmatest.txt
dontdiff
dynamic-debug-howto.txt
edac.txt
eisa.txt
email-clients.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt
gpio.txt
highuid.txt
HOWTO
hw_random.txt
hwspinlock.txt
init.txt
initrd.txt
intel_txt.txt
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt
kernel-docs.txt
kernel-parameters.txt The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
kernel-per-CPU-kthreads.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
kmemcheck.txt
kmemleak.txt
kobject.txt
kprobes.txt
kref.txt
ldm.txt
local_ops.txt
lockdep-design.txt
lockstat.txt
lockup-watchdogs.txt
logo.gif
logo.txt
magic-number.txt
Makefile
ManagementStyle
md.txt md: remove doubled description for sync_max, merging it within sync_min/sync_max 2013-07-03 09:43:28 +10:00
media-framework.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
memory-barriers.txt
memory-hotplug.txt
mono.txt
mutex-design.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
pi-futex.txt
pinctrl.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
pnp.txt
preempt-locking.txt
printk-formats.txt lib: vsprintf: add IPv4/v6 generic %p[Ii]S[pfs] format specifier 2013-07-01 23:22:13 -07:00
pwm.txt pwm: Add sysfs interface 2013-06-21 11:32:51 +02:00
ramoops.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rt-mutex-design.txt sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
rt-mutex.txt
rtc.txt rtc: add ability to push out an existing wakealarm using sysfs 2013-07-03 16:07:54 -07:00
SAK.txt
SecurityBugs
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
SM501.txt
smsc_ece1099.txt
sparse.txt
spinlocks.txt sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
SubmitChecklist
SubmittingDrivers
SubmittingPatches
svga.txt
sysfs-rules.txt
sysrq.txt
this_cpu_ops.txt
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt vfio Updates for v3.11 2013-07-10 14:50:08 -07:00
VGA-softcursor.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt
ww-mutex-design.txt mutex: Add support for wound/wait style locks 2013-06-26 12:10:56 +02:00
xz.txt
zorro.txt