linux/net/core
Jason Baron 790ba4566c tcp: set SOCK_NOSPACE under memory pressure
Under tcp memory pressure, calling epoll_wait() in edge triggered
mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
even when there is sufficient memory available to continue making
progress. The problem is that when __sk_mem_schedule() returns 0
under memory pressure, we do not set the SOCK_NOSPACE flag in the
tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
SOCK_NOSPACE is used to trigger wakeups when incoming acks create
sufficient new space in the write queue, all outstanding packets
are acked, but we never wake up with the the EPOLLOUT that we are
expecting from epoll_wait().

This issue is currently limited to epoll() when used in edge trigger
mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
This is sufficient for poll()/select() and epoll() in level trigger
mode. However, in edge trigger mode, epoll() is relying on the write
path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
can only call epoll_wait() after read/write return -EAGAIN. Thus, in
the case of the socket write, we are relying on the fact that
tcp_sendmsg()/network write paths are going to issue a wakeup for
us at some point in the future when we get -EAGAIN.

Normally, epoll() edge trigger works fine when we've exceeded the
sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
we return -EAGAIN from the write path b/c we are over the tcp memory
limits and not b/c we are over the sndbuf, we are never going to get
another wakeup.

I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
will return 0, or failure more readily with SO_SNDBUF:

1) create socket and set SO_SNDBUF to N
2) add socket as edge trigger
3) write to socket and block in epoll on -EAGAIN
4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem

The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
to waking up outstanding waiters is also used to expand the size of
the sk->sndbuf. However, we will not expand it by setting it in this
case because tcp_should_expand_sndbuf(), ensures that no expansion
occurs when we are under tcp memory pressure.

Note that we could still hang if sk->sk_wmem_queue is 0, when we get
the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
are waiting for and event that will never happen. I believe
that this case is harder to hit (and did not hit in my testing),
in that over the tcp 'soft' memory limits, we continue to guarantee a
minimum write buffer size. Perhaps, we could return -ENOSPC in this
case, or maybe we simply issue a wakeup in this case, such that we
keep retrying the write. Note that this case is not specific to
epoll() ET, but rather would affect blocking sockets as well. So I
view this patch as bringing epoll() edge-trigger into sync with the
current poll()/select()/epoll() level trigger and blocking sockets
behavior.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 17:38:36 -04:00
..
datagram.c new helper: msg_data_left() 2015-04-11 15:53:35 -04:00
dev_addr_lists.c net: fix spelling for synchronized 2014-11-18 15:26:32 -05:00
dev_ioctl.c dev_ioctl: use sizeof(x) instead of sizeof x 2014-11-18 15:27:32 -05:00
dev.c tc: remove unused redirect ttl 2015-05-04 12:16:12 -04:00
drop_monitor.c net: Replace get_cpu_var through this_cpu_ptr 2014-08-26 13:45:47 -04:00
dst.c dst: no need to take reference on DST_NOCACHE dsts 2014-12-09 16:08:17 -05:00
ethtool.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-03-03 21:16:48 -05:00
fib_rules.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-06 22:34:15 -04:00
filter.c seccomp, filter: add and use bpf_prog_create_from_user from seccomp 2015-05-09 17:35:05 -04:00
flow_dissector.c net: Add flow_keys digest 2015-05-04 00:09:09 -04:00
flow.c flowcache: Fix kernel panic in flow_cache_flush_task 2015-02-05 14:38:53 -08:00
gen_estimator.c net: sched: make bstats per cpu and estimator RCU safe 2014-09-30 01:02:26 -04:00
gen_stats.c gen_stats.c: Duplicate xstats buffer for later use 2015-02-19 15:45:53 -05:00
link_watch.c dev: introduce dev_get_iflink() 2015-04-02 14:04:59 -04:00
Makefile net: bury net/core/iovec.c - nothing in there is used anymore 2015-02-04 01:34:15 -05:00
neighbour.c net: neighbour: Add mcast_resolicit to configure the number of multicast resolicitations in PROBE state. 2015-03-20 21:47:40 -04:00
net_namespace.c netns: remove duplicated include from net_namespace.c 2015-04-16 12:14:24 -04:00
net-procfs.c
net-sysfs.c dev: introduce dev_get_iflink() 2015-04-02 14:04:59 -04:00
net-sysfs.h net: netdev_kobject_init: annotate with __init 2014-01-05 20:27:54 -05:00
net-traces.c
netclassid_cgroup.c cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes 2014-07-15 11:05:09 -04:00
netevent.c
netpoll.c net: rename vlan_tx_* helpers since "tx" is misleading there 2015-01-13 17:51:08 -05:00
netprio_cgroup.c cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes 2014-07-15 11:05:09 -04:00
pktgen.c net: pktgen: disable xmit_clone on virtual devices 2015-02-22 22:43:20 -05:00
ptp_classifier.c net: filter: split 'struct sk_filter' into socket and bpf parts 2014-08-02 15:03:58 -07:00
request_sock.c net: convert syn_wait_lock to a spinlock 2015-03-23 16:52:26 -04:00
rtnetlink.c bridge/nl: remove wrong use of NLM_F_MULTI 2015-04-29 14:59:16 -04:00
scm.c net: introduce helper macro for_each_cmsghdr 2014-12-10 22:41:55 -05:00
secure_seq.c net: use ktime_get_ns() and ktime_get_real_ns() helpers 2014-08-22 19:57:23 -07:00
skbuff.c net: fix two sparse warnings introduced by IGMP/MLD parsing exports 2015-05-04 19:19:54 -04:00
sock_diag.c net: add real socket cookies 2015-03-11 21:55:28 -04:00
sock.c tcp: do not cache align timewait sockets 2015-04-12 21:16:05 -04:00
stream.c tcp: set SOCK_NOSPACE under memory pressure 2015-05-09 17:38:36 -04:00
sysctl_net_core.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-03-20 18:51:09 -04:00
timestamping.c net-timestamp: Make the clone operation stand-alone from phy timestamping 2014-09-05 17:43:45 -07:00
tso.c net: tso: fix unaligned access to crafted TCP header in helper API 2014-10-22 12:52:55 -04:00
utils.c net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited 2014-11-11 14:10:31 -05:00