Linux kernel source tree
Go to file
Lawrence Brakmo 40304b2a15 bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
arch arm: sunxi: Revert changes merged through net-next. 2017-07-01 14:18:13 -07:00
block block: provide bio_uninit() free freeing integrity/task associations 2017-06-28 15:30:13 -06:00
certs scripts/spelling.txt: add "intialise(d)" pattern and fix typo instances 2017-05-08 17:15:13 -07:00
crypto net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
Documentation NFC 4.13 pull request 2017-07-01 14:30:39 -07:00
drivers Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next 2017-07-01 15:57:29 -07:00
firmware firmware/Makefile: force recompilation if makefile changes 2017-05-08 17:15:10 -07:00
fs Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-30 12:43:08 -04:00
include bpf: BPF support for sock_ops 2017-07-01 16:15:13 -07:00
init Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-05-10 10:30:46 -07:00
ipc mm: introduce kv[mz]alloc helpers 2017-05-08 17:15:12 -07:00
kernel bpf: BPF support for sock_ops 2017-07-01 16:15:13 -07:00
lib Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-30 12:43:08 -04:00
mm slub: make sysfs file removal asynchronous 2017-06-23 16:15:55 -07:00
net bpf: BPF support for sock_ops 2017-07-01 16:15:13 -07:00
samples bpf: BPF support for sock_ops 2017-07-01 16:15:13 -07:00
scripts Kbuild fixes for v4.12 (2nd) 2017-06-24 16:18:00 -07:00
security Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-21 17:35:22 -04:00
sound ALSA: hda - Apply quirks to Broxton-T, too 2017-06-20 07:52:49 +02:00
tools Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-06-30 12:43:08 -04:00
usr initramfs: fix disabling of initramfs (and its compression) 2017-06-02 15:07:37 -07:00
virt KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages 2017-06-06 15:28:40 +02:00
.cocciconfig
.get_maintainer.ignore
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore kbuild: Add support to generate LLVM assembly files 2017-04-25 08:13:52 +09:00
.mailmap power supply and reset changes for the v4.12 series (part 2) 2017-05-12 12:02:21 -07:00
COPYING
CREDITS avr32: remove support for AVR32 architecture 2017-05-01 09:27:15 +02:00
Kbuild kbuild: Consolidate header generation from ASM offset information 2017-04-13 05:43:37 +09:00
Kconfig
MAINTAINERS NFC 4.13 pull request 2017-07-01 14:30:39 -07:00
Makefile Linux 4.12-rc7 2017-06-25 18:30:05 -07:00
README README: add a new README file, pointing to the Documentation/ 2016-10-24 08:12:35 -02:00

Linux kernel
============

This file was moved to Documentation/admin-guide/README.rst

Please notice that there are several guides for kernel developers and users.
These guides can be rendered in a number of formats, like HTML and PDF.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.