37065 Commits

Author SHA1 Message Date
David S. Miller
40451fd013 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next.
Basically, more incremental updates for br_netfilter from Florian
Westphal, small nf_tables updates (including one fix for rb-tree
locking) and small two-liner to add extra validation for the REJECT6
target.

More specifically, they are:

1) Use the conntrack status flags from br_netfilter to know that DNAT is
   happening. Patch for Florian Westphal.

2) nf_bridge->physoutdev == NULL already indicates that the traffic is
   bridged, so let's get rid of the BRNF_BRIDGED flag. Also from Florian.

3) Another patch to prepare voidization of seq_printf/seq_puts/seq_putc,
   from Joe Perches.

4) Consolidation of nf_tables_newtable() error path.

5) Kill nf_bridge_pad used by br_netfilter from ip_fragment(),
   from Florian Westphal.

6) Access rb-tree root node inside the lock and remove unnecessary
   locking from the get path (we already hold nfnl_lock there), from
   Patrick McHardy.

7) You cannot use a NFT_SET_ELEM_INTERVAL_END when the set doesn't
   support interval, also from Patrick.

8) Enforce IP6T_F_PROTO from ip6t_REJECT to make sure the core is
   actually restricting matches to TCP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:02:46 -04:00
Alexander Drozdov
682f048bd4 af_packet: pass checksum validation status to the user
Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
af_packet user that at least the transport header checksum
has been already validated.

For now, the flag may be set for incoming packets only.

Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:01:28 -04:00
Alexander Drozdov
68c2e5de36 af_packet: make tpacket_rcv to not set status value before run_filter
It is just an optimization. We don't need the value of status variable
if the packet is filtered.

Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:00:36 -04:00
Fan Du
c69736696c inet: fix double request socket freeing
Eric Hugne reported following error :

I'm hitting this warning on latest net-next when i try to SSH into a machine
with eth0 added to a bridge (but i think the problem is older than that)

Steps to reproduce:
node2 ~ # brctl addif br0 eth0
[  223.758785] device eth0 entered promiscuous mode
node2 ~ # ip link set br0 up
[  244.503614] br0: port 1(eth0) entered forwarding state
[  244.505108] br0: port 1(eth0) entered forwarding state
node2 ~ # [  251.160159] ------------[ cut here ]------------
[  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
[  251.162077] Modules linked in:
[  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
[  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
[  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
[  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
[  251.167203] Call Trace:
[  251.167533]  [<ffffffff8162ace4>] dump_stack+0x45/0x57
[  251.168206]  [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
[  251.169239]  [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
[  251.170271]  [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
[  251.171408]  [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
[  251.172589]  [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
[  251.173366]  [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
[  251.174134]  [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
[  251.174820]  [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
[  251.175473]  [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
[  251.176282]  [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
[  251.177004]  [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
[  251.177693]  [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
[  251.178336]  [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
[  251.179170]  [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
[  251.179922]  [<ffffffff814f97d4>] process_backlog+0x94/0x120
[  251.180639]  [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
[  251.181356]  [<ffffffff81051267>] __do_softirq+0xa7/0x290
[  251.182046]  [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
[  251.182726]  [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
[  251.183485]  [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
[  251.184228]  [<ffffffff8106935e>] kthread+0xee/0x110
[  251.184871]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[  251.185690]  [<ffffffff81631108>] ret_from_fork+0x58/0x90
[  251.186385]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
[  259.542268] br0: port 1(eth0) entered forwarding state

Remove the double calls to reqsk_put()

[edumazet] :

I got confused because reqsk_timer_handler() _has_ to call
reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
the timer handler holds a reference on req.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 21:40:48 -04:00
Arman Uguray
912098a630 Bluetooth: Add support for adv instance timeout
This patch implements support for the timeout parameter of the
Add Advertising command.

Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-24 01:53:47 +01:00
Arman Uguray
4117ed70a5 Bluetooth: Add support for instance scan response
This patch implements setting the Scan Response data provided as part
of an advertising instance through the Add Advertising command.

Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-24 01:53:47 +01:00
Arman Uguray
da929335f2 Bluetooth: Implement the Remove Advertising command
This patch implements the "Remove Advertising" mgmt command.

Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-24 01:53:47 +01:00
Arman Uguray
24b4f38fc9 Bluetooth: Implement the Add Advertising command
This patch adds the most basic implementation for the
"Add Advertisement" command. All state updates between the
various HCI settings (POWERED, ADVERTISING, ADVERTISING_INSTANCE,
and LE_ENABLED) has been implemented. The command currently
supports only setting the advertising data fields, with no flags
and no scan response data.

Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-24 01:53:46 +01:00
Arman Uguray
203fea0178 Bluetooth: Add data structure for advertising instance
This patch introduces a new data structure to represent advertising
instances that were added using the "Add Advertising" mgmt command.
Initially an hci_dev structure will support only one of these instances
at a time, so the current instance is simply stored as a direct member
of hci_dev.

Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-24 01:53:46 +01:00
Alexander Duyck
b6f15f828d fib_trie: Fix regression in handling of inflate/halve failure
When I updated the code to address a possible null pointer dereference in
resize I ended up reverting an exception handling fix for the suffix length
in the event that inflate or halve failed.  This change is meant to correct
that by reverting the earlier fix and instead simply getting the parent
again after inflate has been completed to avoid the possible null pointer
issue.

Fixes: ddb4b9a13 ("fib_trie: Address possible NULL pointer dereference in resize")
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:58:32 -04:00
YOSHIFUJI Hideaki/吉藤英明
443b5991a7 net: Move the comment about unsettable socket-level options to default clause and update its reference.
We implement the SO_SNDLOWAT etc not to be settable and return
ENOPROTOOPT per 1003.1g 7.  Move the comment to appropriate
position and update the reference.

Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:54:34 -04:00
Eric Dumazet
52036a4305 ipv6: dccp: handle ICMP messages on DCCP_NEW_SYN_RECV request sockets
dccp_v6_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
85645bab57 ipv4: dccp: handle ICMP messages on DCCP_NEW_SYN_RECV request sockets
dccp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New dccp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
2215089b22 ipv6: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets
tcp_v6_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
26e3736090 ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets
tcp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New tcp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
b282705336 net: convert syn_wait_lock to a spinlock
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
8b929ab12f inet: remove some sk_listener dependencies
listener can be source of false sharing. request sock has some
useful information like : ireq->ir_iif, ireq->ir_num, ireq->ireq_net

This patch does not solve the major problem of having to read
sk->sk_protocol which is sharing a cache line with sk->sk_wmem_alloc.
(This same field is read later in ip_build_and_send_pkt())

One idea would be to move sk_protocol close to sk_family
(using 8 bits instead of 16 for sk_family seems enough)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00
Eric Dumazet
2b41fab70f inet: cache listen_sock_qlen() and read rskq_defer_accept once
Cache listen_sock_qlen() to limit false sharing, and read
rskq_defer_accept once as it might change under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00
Roopa Prabhu
558d51fa2f switchdev: fix stp update API to work with layered netdevices
make it same as the netdev_switch_port_bridge_setlink/dellink
api (ie traverse lowerdevs to get to the switch port).

removes "WARN_ON(!ops->ndo_switch_parent_id_get)" because
direct bridge ports can be stacked netdevices (like bonds
and team of switch ports) which may not implement this ndo.

v2 to v3:
	- remove changes to bond and team. Bring back the
	transparently following lowerdevs like i initially
	had for setlink/getlink
	(http://www.spinics.net/lists/netdev/msg313436.html)
	dave and scott feldman also seem to prefer it be that
	way and move to non-transparent way of doing things
	if we see a problem down the lane.

v3 to v4:
	- fix ret initialization

v4 to v5:
	- return err on first failure (scott feldman)

v5 to v6:
	- change variable name (err) and initialize to
	-EOPNOTSUPP (scott feldman).

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:44:56 -04:00
WANG Cong
08b4b8ea79 net: clear skb->priority when forwarding to another netns
skb->priority can be set for two purposes:

1) With respect to IP TOS field, which is computed by a mask.
Ususally used for priority qdisc's (pfifo, prio etc.), on TX
side (we only have ingress qdisc on RX side).

2) Used as a classid or flowid, works in the same way with tc
classid. What's more, this can even override the classid
of tc filters.

For case 1), it has been respected within its netns, I don't
see any point of keeping it for another netns, especially
when packets will be forwarded to Rx path (no matter from TX
path or RX path).

For case 2) we care, our applications run inside a netns,
and we classify the packets by our own filters outside,
If some application sets this priority, it could bypass
our filters, therefore clear it when moving out of a netns,
it makes no sense to bypass tc filters out of its netns.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:43:08 -04:00
tadeusz.struk@intel.com
0345f93138 net: socket: add support for async operations
Add support for async operations.

Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:41:36 -04:00
David S. Miller
c0e41fa76c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix missing initialization of tuple structure in nfnetlink_cthelper
   to avoid mismatches when looking up to attach userspace helpers to
   flows, from Ian Wilson.

2) Fix potential crash in nft_hash when we hit -EAGAIN in
   nft_hash_walk(), from Herbert Xu.

3) We don't need to indicate the hook information to update the
   basechain default policy in nf_tables.

4) Restore tracing over nfnetlink_log due to recent rework to
   accomodate logging infrastructure into nf_tables.

5) Fix wrong IP6T_INV_PROTO check in xt_TPROXY.

6) Set IP6T_F_PROTO flag in nft_compat so we can use SYNPROXY6 and
   REJECT6 from xt over nftables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-22 16:57:07 -04:00
Pablo Neira Ayuso
e35158e401 netfilter: ip6t_REJECT: check for IP6T_F_PROTO
Make sure IP6T_F_PROTO is set to enforce layer 4 protocol matching from
the ip6_tables core.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-22 20:02:46 +01:00
Patrick McHardy
55df35d22f netfilter: nf_tables: reject NFT_SET_ELEM_INTERVAL_END flag for non-interval sets
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-22 19:50:35 +01:00
Patrick McHardy
16c45eda96 netfilter: nft_rbtree: fix locking
Fix a race condition and unnecessary locking:

* the root rb_node must only be accessed under the lock in nft_rbtree_lookup()
* the lock is not needed in lookup functions in netlink context

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-22 19:49:09 +01:00
Florian Westphal
8d0451638a netfilter: bridge: kill nf_bridge_pad
The br_netfilter frag output function calls skb_cow_head() so in
case it needs a larger headroom to e.g. re-add a previously stripped PPPOE
or VLAN header things will still work (at cost of reallocation).

We can then move nf_bridge_encap_header_len to br_netfilter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-22 19:45:55 +01:00
Pablo Neira Ayuso
749177ccc7 netfilter: nft_compat: set IP6T_F_PROTO flag if protocol is set
ip6tables extensions check for this flag to restrict match/target to a
given protocol. Without this flag set, SYNPROXY6 returns an error.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
2015-03-22 19:32:05 +01:00
Johan Hedberg
baf880a968 Bluetooth: Fix memory leak in le_scan_disable_work_complete()
The hci_request in le_scan_disable_work_complete() was being initialized
in a general context but only used in a specific branch in the function
(when simultaneous discovery is not supported). This patch moves the
usage to be limited to the branch where hci_req_run() is actually
called.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-22 08:03:54 +01:00
Herbert Xu
8f2ddaac30 netlink: Remove netlink_compare_arg.trailer
Instead of computing the offset from trailer, this patch computes
netlink_compare_arg_len from the offset of portid and then adds 4
to it.  This allows trailer to be removed.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-21 00:16:39 -04:00
YOSHIFUJI Hideaki/吉藤英明
8da86466b8 net: neighbour: Add mcast_resolicit to configure the number of multicast resolicitations in PROBE state.
We send unicast neighbor (ARP or NDP) solicitations ucast_probes
times in PROBE state.  Zhu Yanjun reported that some implementation
does not reply against them and the entry will become FAILED, which
is undesirable.

We had been dealt with such nodes by sending multicast probes mcast_
solicit times after unicast probes in PROBE state.  In 2003, I made
a change not to send them to improve compatibility with IPv6 NDP.

Let's introduce per-protocol per-interface sysctl knob "mcast_
reprobe" to configure the number of multicast (re)solicitation for
reconfirmation in PROBE state.  The default is 0, since we have
been doing so for 10+ years.

Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
CC: Ulf Samuelsson <ulf.samuelsson@ericsson.com>
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:47:40 -04:00
Mathieu Olivari
c6f15070e7 net: dsa: make NET_DSA manually selectable from the config
Change bd76a116707bd2381da36cf7c3183df11293f1d6 made all DSA drivers
depend on NET_DSA rather than selecting them. However, as the only way
to select this option was to actually select a driver, it made DSA
impossible to enable at all.

This patch adds an explicit entry which the user will have to enable
prior selecting a driver.

Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:37:58 -04:00
Eric Dumazet
d3593b5cef Revert "selinux: add a skb_owned_by() hook"
This reverts commit ca10b9e9a8ca7342ee07065289cbe74ac128c169.

No longer needed after commit eb8895debe1baba41fcb62c78a16f0c63c21662a
("tcp: tcp_make_synack() should use sock_wmalloc")

When under SYNFLOOD, we build lot of SYNACK and hit false sharing
because of multiple modifications done on sk_listener->sk_wmem_alloc

Since tcp_make_synack() uses sock_wmalloc(), there is no need
to call skb_set_owner_w() again, as this adds two atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:36:53 -04:00
Daniel Borkmann
a8cb5f556b act_bpf: add initial eBPF support for actions
This work extends the "classic" BPF programmable tc action by extending
its scope also to native eBPF code!

Together with commit e2e9b6541dd4 ("cls_bpf: add initial eBPF support
for programmable classifiers") this adds the facility to implement fully
flexible classifier and actions for tc that can be implemented in a C
subset in user space, "safely" loaded into the kernel, and being run in
native speed when JITed.

Also, since eBPF maps can be shared between eBPF programs, it offers the
possibility that cls_bpf and act_bpf can share data 1) between themselves
and 2) between user space applications. That means that, f.e. customized
runtime statistics can be collected in user space, but also more importantly
classifier and action behaviour could be altered based on map input from
the user space application.

For the remaining details on the workflow and integration, see the cls_bpf
commit e2e9b6541dd4. Preliminary iproute2 part can be found under [1].

  [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-act

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 19:10:44 -04:00
Daniel Borkmann
94caee8c31 ebpf: add sched_act_type and map it to sk_filter's verifier ops
In order to prepare eBPF support for tc action, we need to add
sched_act_type, so that the eBPF verifier is aware of what helper
function act_bpf may use, that it can load skb data and read out
currently available skb fields.

This is bascially analogous to 96be4325f443 ("ebpf: add sched_cls_type
and map it to sk_filter's verifier ops").

BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
separate since both will have a different set of functionality in
future (classifier vs action), thus we won't run into ABI troubles
when the point in time comes to diverge functionality from the
classifier.

The future plan for act_bpf would be that it will be able to write
into skb->data and alter selected fields mirrored in struct __sk_buff.

For an initial support, it's sufficient to map it to sk_filter_ops.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 19:10:44 -04:00
David S. Miller
0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Al Viro
4de930efc2 net: validate the range we feed to iov_iter_init() in sys_sendto/sys_recvfrom
Cc: stable@vger.kernel.org # v3.19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:38:06 -04:00
Catalin Marinas
91edd096e2 net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour
Commit db31c55a6fb2 (net: clamp ->msg_namelen instead of returning an
error) introduced the clamping of msg_namelen when the unsigned value
was larger than sizeof(struct sockaddr_storage). This caused a
msg_namelen of -1 to be valid. The native code was subsequently fixed by
commit dbb490b96584 (net: socket: error on a negative msg_namelen).

In addition, the native code sets msg_namelen to 0 when msg_name is
NULL. This was done in commit (6a2a2b3ae075 net:socket: set msg_namelen
to 0 if msg_name is passed as NULL in msghdr struct from userland) and
subsequently updated by 08adb7dabd48 (fold verify_iovec() into
copy_msghdr_from_user()).

This patch brings the get_compat_msghdr() in line with
copy_msghdr_from_user().

Fixes: db31c55a6fb2 (net: clamp ->msg_namelen instead of returning an error)
Cc: David S. Miller <davem@davemloft.net>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:31:09 -04:00
Herbert Xu
6cca7289d5 tipc: Use inlined rhashtable interface
This patch converts tipc to the inlined rhashtable interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Herbert Xu
fa3773211e netfilter: Convert nft_hash to inlined rhashtable
This patch converts nft_hash to the inlined rhashtable interface.

This patch also replaces the call to rhashtable_lookup_compare with
a straight rhashtable_lookup_fast because it's simply doing a memcmp
(in fact nft_hash_lookup already uses memcmp instead of nft_data_cmp).

Furthermore, the compare function is only meant to compare, it is not
supposed to have side-effects.  The current side-effect code can
simply be moved into the nft_hash_get.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Herbert Xu
c428ecd1a2 netlink: Move namespace into hash key
Currently the name space is a de facto key because it has to match
before we find an object in the hash table.  However, it isn't in
the hash value so all objects from different name spaces with the
same port ID hash to the same bucket.

This is bad as the number of name spaces is unbounded.

This patch fixes this by using the namespace when doing the hash.

Because the namespace field doesn't lie next to the portid field
in the netlink socket, this patch switches over to the rhashtable
interface without a fixed key.

This patch also uses the new inlined rhashtable interface where
possible.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Daniel Borkmann
0b8c707ddf ebpf, filter: do not convert skb->protocol to host endianess during runtime
Commit c24973957975 ("bpf: allow BPF programs access 'protocol' and 'vlan_tci'
fields") has added support for accessing protocol, vlan_present and vlan_tci
into the skb offset map.

As referenced in the below discussion, accessing skb->protocol from an eBPF
program should be converted without handling endianess.

The reason for this is that an eBPF program could simply do a check more
naturally, by f.e. testing skb->protocol == htons(ETH_P_IP), where the LLVM
compiler resolves htons() against a constant automatically during compilation
time, as opposed to an otherwise needed run time conversion.

After all, the way of programming both from a user perspective differs quite
a lot, i.e. bpf_asm ["ld proto"] versus a C subset/LLVM.

Reference: https://patchwork.ozlabs.org/patch/450819/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 15:24:26 -04:00
Johannes Berg
5d8325ecb9 cfg80211: add vlan to station add/change tracing
This helps debug issues with VLAN modifications that are otherwise
not really visible in any tracing/debugging.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-03-20 19:56:41 +01:00
Jakub Pawlowski
b55d1abf56 Bluetooth: Expose quirks through debugfs
This patch expose controller quirks through debugfs. It would be
useful for BlueZ tests using vhci. Currently there is no way to
test quirk dependent behaviour. It might be also useful for manual
testing.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-03-20 19:47:01 +01:00
Marcelo Ricardo Leitner
c4a6853d8f ipv6: invert join/leave anycast rtnl/socket locking order
Commit baf606d9c9b1 ("ipv4,ipv6: grab rtnl before locking the socket")
missed to update two setsockopt options, IPV6_JOIN_ANYCAST and
IPV6_LEAVE_ANYCAST, causing a lock inverstion regarding to the updated ones.

As ipv6_sock_ac_join and ipv6_sock_ac_leave are only called from
do_ipv6_setsockopt, we are good to just move the rtnl lock upper.

Fixes: baf606d9c9b1 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 13:32:38 -04:00
Josh Hunt
d22e153718 tcp: fix tcp fin memory accounting
tcp_send_fin() does not account for the memory it allocates properly, so
sk_forward_alloc can be negative in cases where we've sent a FIN:

ss example output (ss -amn | grep -B1 f4294):
tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
Acked-by: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 13:18:52 -04:00
Steven Barth
73ba57bfae ipv6: fix backtracking for throw routes
for throw routes to trigger evaluation of other policy rules
EAGAIN needs to be propagated up to fib_rules_lookup
similar to how its done for IPv4

A simple testcase for verification is:

ip -6 rule add lookup 33333 priority 33333
ip -6 route add throw 2001:db8::1
ip -6 route add 2001:db8::1 via fe80::1 dev wlan0 table 33333
ip route get 2001:db8::1

Signed-off-by: Steven Barth <cyrus@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:57:23 -04:00
Sabrina Dubroca
8e199dfd82 ipv6: call ipv6_proxy_select_ident instead of ipv6_select_ident in udp6_ufo_fragment
Matt Grant reported frequent crashes in ipv6_select_ident when
udp6_ufo_fragment is called from openvswitch on a skb that doesn't
have a dst_entry set.

ipv6_proxy_select_ident generates the frag_id without using the dst
associated with the skb.  This approach was suggested by Vladislav
Yasevich.

Fixes: 0508c07f5e0c ("ipv6: Select fragment id during UFO segmentation if not set.")
Cc: Vladislav Yasevich <vyasevic@redhat.com>
Reported-by: Matt Grant <matt@mattgrant.net.nz>
Tested-by: Matt Grant <matt@mattgrant.net.nz>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:56:11 -04:00
Eric Dumazet
becb74f0ac net: increase sk_[max_]ack_backlog
sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
listen() backlog was limited to 65535.

It is time to increase the width to allow much bigger backlog,
if admins change /proc/sys/net/core/somaxconn &
/proc/sys/net/ipv4/tcp_max_syn_backlog default values.

Tested:

echo 5000000 >/proc/sys/net/core/somaxconn
echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog

Ran a SYNFLOOD test against a listener using listen(fd, 5000000)

myhost~# grep request_sock_TCP /proc/slabinfo
request_sock_TCP  4185642 4411940    304   13    1 : tunables   54   27    8 : slabdata 339380 339380      0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Eric Dumazet
fa76ce7328 inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00