38176 Commits

Author SHA1 Message Date
Jozsef Kadlecsik
c4c997839c netfilter: ipset: Fix parallel resizing and listing of the same set
When elements added to a hash:* type of set and resizing triggered,
parallel listing could start to list the original set (before resizing)
and "continue" with listing the new set. Fix it by references and
using the original hash table for listing. Therefore the destroying of
the original hash table may happen from the resizing or listing functions.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:15 +02:00
Jozsef Kadlecsik
f690cbaed9 netfilter: ipset: Fix cidr handling for hash:*net* types
Commit "Simplify cidr handling for hash:*net* types" broke the cidr
handling for the hash:*net* types when the sets were used by the SET
target: entries with invalid cidr values were added to the sets.
Reported by Jonathan Johnson.

Testsuite entry is added to verify the fix.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:14 +02:00
Sergey Popovich
aff227581e netfilter: ipset: Check CIDR value only when attribute is given
There is no reason to check CIDR value regardless attribute
specifying CIDR is given.

Initialize cidr array in element structure on element structure
declaration to let more freedom to the compiler to optimize
initialization right before element structure is used.

Remove local variables cidr and cidr2 for netnet and netportnet
hashes as we do not use packed cidr value for such set types and
can store value directly in e.cidr[].

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:14 +02:00
Sergey Popovich
a212e08e8e netfilter: ipset: Make sure we always return line number on batch
Even if we return with generic IPSET_ERR_PROTOCOL it is good idea
to return line number if we called in batch mode.

Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For
example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED
or IPSET_ERR_INVALID_CIDR.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:13 +02:00
Sergey Popovich
2c227f278a netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6
Permit userspace to supply CIDR length equal to the host address CIDR
length in netlink message. Prohibit any other CIDR length for IPv6
variant of the set.

Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic
-IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when
IPSET_ATTR_IP_TO attribute is given.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:13 +02:00
Sergey Popovich
7dd37bc8e6 netfilter: ipset: Check extensions attributes before getting extensions.
Make all extensions attributes checks within ip_set_get_extensions()
and reduce number of duplicated code.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:13 +02:00
Sergey Popovich
edda079174 netfilter: ipset: Use SET_WITH_*() helpers to test set extensions
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14 10:40:12 +02:00
David S. Miller
25c43bf13b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-06-13 23:56:52 -07:00
Eric Dumazet
a2f0fad32b tcp: tcp_v6_connect() cleanup
Remove dead code from tcp_v6_connect()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12 21:59:25 -07:00
Eric Dumazet
1e98a0f08a flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrs
__skb_header_pointer() returns a pointer that must be checked.

Fixes infinite loop reported by Alexei, and add __must_check to
catch these errors earlier.

Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12 21:58:49 -07:00
Vincent Cuissard
34ac496641 NFC: nci: remove current SLEEP mode management
NCI deactivate management was modified to support all NCI
deactivation type. Problem is that all the API are not ready
yet for it.

Problem is that with current code, when neard asks to deactivate
the tag it sends a deactivate SLEEP but nobody will then send a
IDLE deactivate. This IDLE deactivate is mandatory since NFC
controller can only be unlocked by DH.

Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-06-13 00:08:09 +02:00
Tom Herbert
6a74fcf426 flow_dissector: add support for dst, hop-by-hop and routing ext hdrs
If dst, hop-by-hop or routing extension headers are present determine
length of the options and skip over them in flow dissection.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12 14:24:28 -07:00
Tom Herbert
611d23c559 flow_dissector: Fix MPLS entropy label handling in flow dissector
Need to shift after masking to get label value for comparison.

Fixes: b3baa0fbd02a1a9d493d8 ("mpls: Add MPLS entropy label in flow_keys")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12 14:24:27 -07:00
Florian Westphal
b60f2f3d65 net: ipv4: un-inline ip_finish_output2
text    data     bss     dec     hex filename
old: 16527      44       0   16571    40bb net/ipv4/ip_output.o
new: 14935      44       0   14979    3a83 net/ipv4/ip_output.o

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12 14:19:17 -07:00
Marcelo Ricardo Leitner
ae36806a62 sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
Currently, we can ask to authenticate DATA chunks and we can send DATA
chunks on the same packet as COOKIE_ECHO, but if you try to combine
both, the DATA chunk will be sent unauthenticated and peer won't accept
it, leading to a communication failure.

This happens because even though the data was queued after it was
requested to authenticate DATA chunks, it was also queued before we
could know that remote peer can handle authenticating, so
sctp_auth_send_cid() returns false.

The fix is whenever we set up an active key, re-check send queue for
chunks that now should be authenticated. As a result, such packet will
now contain COOKIE_ECHO + AUTH + DATA chunks, in that order.

Reported-by: Liu Wei <weliu@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-12 14:18:20 -07:00
Matan Barak
8e37210b38 IB/core: Change ib_create_cq to use struct ib_cq_init_attr
Currently, ib_create_cq uses cqe and comp_vecotr instead
of the extendible ib_cq_init_attr struct.

Earlier patches already changed the vendors to work with
ib_cq_init_attr. This patch changes the consumers too.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-06-12 14:49:10 -04:00
Chuck Lever
40c6ed0c8a xprtrdma: Reduce per-transport MR allocation
Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
acb9da7a57 xprtrdma: Stack relief in fmr_op_map()
fmr_op_map() declares a 64 element array of u64 in automatic
storage. This is 512 bytes (8 * 64) on the stack.

Instead, when FMR memory registration is in use, pre-allocate a
physaddr array for each rpcrdma_mw.

This is a pre-requisite for increasing the r/wsize maximum for
FMR on platforms with 4KB pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
58d1dcf5a8 xprtrdma: Split rb_lock
/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and rb_mws
are always in a different cacheline than rb_lock and the buffer
pointers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
7e53df111b xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
Clean up: This field is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
3269a94b62 xprtrdma: Remove ->ro_reset
An RPC can exit at any time. When it does so, xprt_rdma_free() is
called, and it calls ->op_unmap().

If ->ro_reset() is running due to a transport disconnect, the two
methods can race while processing the same rpcrdma_mw. The results
are unpredictable.

Because of this, in previous patches I've altered ->ro_map() to
handle MR reset. ->ro_reset() is no longer needed and can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
06b00880b0 xprtrdma: Remove unused LOCAL_INV recovery logic
Clean up: Remove functions no longer used to recover broken FRMRs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
c14d86e591 xprtrdma: Acquire MRs in rpcrdma_register_external()
Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:37 -04:00
Chuck Lever
951e721ca0 xprtrdma: Introduce an FRMR recovery workqueue
After a transport disconnect, FRMRs can be left in an undetermined
state. In particular, the MR's rkey is no good.

Currently, FRMRs are fixed up by the transport connect worker, but
that can race with ->ro_unmap if an RPC happens to exit while the
transport connect worker is running.

A better way of dealing with broken FRMRs is to detect them before
they are re-used by ->ro_map. Such FRMRs are either already invalid
or are owned by the sending RPC, and thus no race with ->ro_unmap
is possible.

Introduce a mechanism for handing broken FRMRs to a workqueue to be
reset in a context that is appropriate for allocating resources
(ie. an ib_alloc_fast_reg_mr() API call).

This mechanism is not yet used, but will be in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
fc7fbb59e7 xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
346aa66b2a xprtrdma: Introduce helpers for allocating MWs
We eventually want to handle allocating MWs one at a time, as
needed, instead of grabbing 64 and throwing them at each RPC in the
pipeline.

Add a helper for grabbing an MW off rb_mws, and a helper for
returning an MW to rb_mws. These will be used in a subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
89e0d11258 xprtrdma: Use ib_device pointer safely
The connect worker can replace ri_id, but prevents ri_id->device
from changing during the lifetime of a transport instance. The old
ID is kept around until a new ID is created and the ->device is
confirmed to be the same.

Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
The cached copy can be used safely in code that does not serialize
with the connect worker.

Other code can use it to save an extra address generation (one
pointer dereference instead of two).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
494ae30d2a xprtrdma: Remove rr_func
A posted rpcrdma_rep never has rr_func set to anything but
rpcrdma_reply_handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
fed171b35c xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
Clean up: Instead of carrying a pointer to the buffer pool and
the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Chuck Lever
6d44698d54 xprtrdma: Warn when there are orphaned IB objects
WARN during transport destruction if ib_dealloc_pd() fails. This is
a sign that xprtrdma orphaned one or more RDMA API objects at some
point, which can pin lower layer kernel modules and cause shutdown
to hang.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-06-12 13:10:36 -04:00
Florian Westphal
482cfc3185 netfilter: xtables: avoid percpu ruleset duplication
We store the rule blob per (possible) cpu.  Unfortunately this means we can
waste lot of memory on big smp machines. ipt_entry structure ('rule head')
is 112 byte, so e.g. with maxcpu=64 one single rule eats
close to 8k RAM.

Since previous patch made counters percpu it appears there is nothing
left in the rule blob that needs to be percpu.

On my test system (144 possible cpus, 400k dummy rules) this
change saves close to 9 Gigabyte of RAM.

Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:27:10 +02:00
Florian Westphal
71ae0dff02 netfilter: xtables: use percpu rule counters
The binary arp/ip/ip6tables ruleset is stored per cpu.

The only reason left as to why we need percpu duplication are the rule
counters embedded into ipt_entry et al -- since each cpu has its own copy
of the rules, all counters can be lockless.

The downside is that the more cpus are supported, the more memory is
required.  Rules are not just duplicated per online cpu but for each
possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times,
not for the e.g. 64 cores present.

To save some memory and also improve utilization of shared caches it
would be preferable to only store the rule blob once.

So we first need to separate counters and the rule blob.

Instead of using entry->counters, allocate this percpu and store the
percpu address in entry->counters.pcnt on CONFIG_SMP.

This change makes no sense as-is; it is merely an intermediate step to
remove the percpu duplication of the rule set in a followup patch.

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:27:09 +02:00
Florian Westphal
d7b5974215 netfilter: bridge: restore vlan tag when refragmenting
If bridge netfilter is used with both
bridge-nf-call-iptables and bridge-nf-filter-vlan-tagged enabled
then ip fragments in VLAN frames are sent without the vlan header.

This has never worked reliably.  Turns out this relied on pre-3.5
behaviour where skb frag_list was used to store ip fragments;
ip_fragment() then re-used these skbs.

But since commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45
("ipv4: use skb coalescing in defragmentation") this is no longer
the case.  ip_do_fragment now needs to allocate new skbs, but these
don't contain the vlan tag information anymore.

Fix it by storing vlan information of the ressembled skb in the
br netfilter percpu frag area, and restore them for each of the
fragments.

Fixes: 3cc4949269e01f3 ("ipv4: use skb coalescing in defragmentation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:16:55 +02:00
Florian Westphal
33b1f31392 net: ip_fragment: remove BRIDGE_NETFILTER mtu special handling
since commit d6b915e29f4adea9
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.

Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:16:46 +02:00
Bernhard Thaler
efb6de9b4b netfilter: bridge: forward IPv6 fragmented packets
IPv6 fragmented packets are not forwarded on an ethernet bridge
with netfilter ip6_tables loaded. e.g. steps to reproduce

1) create a simple bridge like this

        modprobe br_netfilter
        brctl addbr br0
        brctl addif br0 eth0
        brctl addif br0 eth2
        ifconfig eth0 up
        ifconfig eth2 up
        ifconfig br0 up

2) place a host with an IPv6 address on each side of the bridge

        set IPv6 address on host A:
        ip -6 addr add fd01:2345:6789:1::1/64 dev eth0

        set IPv6 address on host B:
        ip -6 addr add fd01:2345:6789:1::2/64 dev eth0

3) run a simple ping command on host A with packets > MTU

        ping6 -s 4000 fd01:2345:6789:1::2

4) wait some time and run e.g. "ip6tables -t nat -nvL" on the bridge

IPv6 fragmented packets traverse the bridge cleanly until somebody runs.
"ip6tables -t nat -nvL". As soon as it is run (and netfilter modules are
loaded) IPv6 fragmented packets do not traverse the bridge any more (you
see no more responses in ping's output).

After applying this patch IPv6 fragmented packets traverse the bridge
cleanly in above scenario.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
[pablo@netfilter.org: small changes to br_nf_dev_queue_xmit]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:10:12 +02:00
Bernhard Thaler
a4611d3b74 netfilter: bridge: re-order check_hbh_len()
Prepare check_hbh_len() to be called from newly introduced
br_validate_ipv6() in next commit.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:09:46 +02:00
Bernhard Thaler
77d574e728 netfilter: bridge: rename br_parse_ip_options
br_parse_ip_options() does not parse any IP options, it validates IP
packets as a whole and the function name is misleading.

Rename br_parse_ip_options() to br_validate_ipv4() and remove unneeded
commments.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:09:17 +02:00
Bernhard Thaler
411ffb4fde netfilter: bridge: refactor frag_max_size
Currently frag_max_size is member of br_input_skb_cb and copied back and
forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or
used.

Attach frag_max_size to nf_bridge_info and set value in pre_routing and
forward functions. Use its value in forward and xmit functions.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:08:51 +02:00
Bernhard Thaler
72b31f7271 netfilter: bridge: detect NAT66 correctly and change MAC address
IPv4 iptables allows to REDIRECT/DNAT/SNAT any traffic over a bridge.

e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-iptables=1
$ iptables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
  -j REDIRECT --to-ports 81

This does not work with ip6tables on a bridge in NAT66 scenario
because the REDIRECT/DNAT/SNAT is not correctly detected.

The bridge pre-routing (finish) netfilter hook has to check for a possible
redirect and then fix the destination mac address. This allows to use the
ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4
iptables version.

e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-ip6tables=1
$ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
  -j REDIRECT --to-ports 81

This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested
on a bridge with two interfaces using SNAT/DNAT NAT66 rules.

Reported-by: Artie Hamilton <artiemhamilton@yahoo.com>
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
[bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()]
[bernhard.thaler@wvnet.at: rebased, split into separate patches]
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:08:07 +02:00
Bernhard Thaler
8cae308d2b netfilter: bridge: re-order br_nf_pre_routing_finish_ipv6()
Put br_nf_pre_routing_finish_ipv6() after daddr_was_changed() and
br_nf_pre_routing_finish_bridge() to prepare calling these functions
from there.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:07:56 +02:00
Bernhard Thaler
d39a33ed9b netfilter: bridge: refactor clearing BRNF_NF_BRIDGE_PREROUTING
use binary AND on complement of BRNF_NF_BRIDGE_PREROUTING to unset
bit in nf_bridge->mask.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:07:53 +02:00
Marcelo Ricardo Leitner
779668450a netfilter: conntrack: warn the user if there is a better helper to use
After db29a9508a92 ("netfilter: conntrack: disable generic tracking for
known protocols"), if the specific helper is built but not loaded
(a standard for most distributions) systems with a restrictive firewall
but weak configuration regarding netfilter modules to load, will
silently stop working.

This patch then puts a warning message so the sysadmin knows where to
start looking into. It's a pr_warn_once regardless of protocol itself
but it should be enough to give a hint on where to look.

Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12 14:06:24 +02:00
Johan Hedberg
5d667ef6e0 Bluetooth: Remove redundant check for ACL_LINK
The encryption key size is read only for BR/EDR (ACL_LINK) connections
so there's no need to check for it in the read_enc_key_size_complete()
callback.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-06-12 12:07:20 +02:00
Johan Hedberg
e3f6a257a7 Bluetooth: Use actual encryption key size for SMP over BR/EDR
When pairing over SMP over BR/EDR the generated LTK has by default the
same key size as the BR/EDR Link Key. Make sure we don't set our
Pairing Request/Response max value to anything higher than that.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-06-12 11:38:45 +02:00
Johan Hedberg
821f376668 Bluetooth: Read encryption key size for BR/EDR connections
Since Bluetooth 3.0 there's a HCI command available for reading the
encryption key size of an BR/EDR connection. This information is
essential e.g. for generating an LTK using SMP over BR/EDR, so store
it as part of struct hci_conn.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-06-12 11:38:45 +02:00
Johan Hedberg
035ad621b6 Bluetooth: Move SC-only check outside of BT_CONFIG branch
Checking for SC-only mode requirements when we get an encrypt change
event shouldn't be limited to the BT_CONFIG state but done any time
encryption changes.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-06-12 11:38:45 +02:00
Johan Hedberg
b1f663c91c Bluetooth: Add debugfs support for min LE encryption key size
This patch adds a debugfs control to set a different minimum LE
encryption key size. This is useful for testing that implementation of
the encryption key size handling is behaving correctly (e.g. that we
get appropriate 'Encryption Key Size' error responses when necessary).

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-06-12 11:38:45 +02:00
Johan Hedberg
2fd36558f0 Bluetooth: Add debugfs support for max LE encryption key size
This patch adds a debugfs control to set a different maximum LE
encryption key size. This is useful for testing that implementation of
the encryption key size handling is behaving correctly.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-06-12 11:38:45 +02:00
Shaohua Li
fb05e7a89f net: don't wait for order-3 page allocation
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f7685831e0
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.

This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.

alloc_skb_with_frags is the same.

The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.

V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer

Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11 17:33:44 -07:00
Robert Shearman
0fae3bf018 mpls: handle device renames for per-device sysctls
If a device is renamed and the original name is subsequently reused
for a new device, the following warning is generated:

sysctl duplicate entry: /net/mpls/conf/veth0//input
CPU: 3 PID: 1379 Comm: ip Not tainted 4.1.0-rc4+ #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
 0000000000000000 0000000000000000 ffffffff81566aaf 0000000000000000
 ffffffff81236279 ffff88002f7d7f00 0000000000000000 ffff88000db336d8
 ffff88000db33698 0000000000000005 ffff88002e046000 ffff8800168c9280
Call Trace:
 [<ffffffff81566aaf>] ? dump_stack+0x40/0x50
 [<ffffffff81236279>] ? __register_sysctl_table+0x289/0x5a0
 [<ffffffffa051a24f>] ? mpls_dev_notify+0x1ff/0x300 [mpls_router]
 [<ffffffff8108db7f>] ? notifier_call_chain+0x4f/0x70
 [<ffffffff81470e72>] ? register_netdevice+0x2b2/0x480
 [<ffffffffa0524748>] ? veth_newlink+0x178/0x2d3 [veth]
 [<ffffffff8147f84c>] ? rtnl_newlink+0x73c/0x8e0
 [<ffffffff8147f27a>] ? rtnl_newlink+0x16a/0x8e0
 [<ffffffff81459ff2>] ? __kmalloc_reserve.isra.30+0x32/0x90
 [<ffffffff8147ccfd>] ? rtnetlink_rcv_msg+0x8d/0x250
 [<ffffffff8145b027>] ? __alloc_skb+0x47/0x1f0
 [<ffffffff8149badb>] ? __netlink_lookup+0xab/0xe0
 [<ffffffff8147cc70>] ? rtnetlink_rcv+0x30/0x30
 [<ffffffff8149e7a0>] ? netlink_rcv_skb+0xb0/0xd0
 [<ffffffff8147cc64>] ? rtnetlink_rcv+0x24/0x30
 [<ffffffff8149df17>] ? netlink_unicast+0x107/0x1a0
 [<ffffffff8149e4be>] ? netlink_sendmsg+0x50e/0x630
 [<ffffffff8145209c>] ? sock_sendmsg+0x3c/0x50
 [<ffffffff81452beb>] ? ___sys_sendmsg+0x27b/0x290
 [<ffffffff811bd258>] ? mem_cgroup_try_charge+0x88/0x110
 [<ffffffff811bd5b6>] ? mem_cgroup_commit_charge+0x56/0xa0
 [<ffffffff811d7700>] ? do_filp_open+0x30/0xa0
 [<ffffffff8145336e>] ? __sys_sendmsg+0x3e/0x80
 [<ffffffff8156c3f2>] ? system_call_fastpath+0x16/0x75

Fix this by unregistering the previous sysctl table (registered for
the path containing the original device name) and re-registering the
table for the path containing the new device name.

Fixes: 37bde79979c3 ("mpls: Per-device enabling of packet input")
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11 16:47:16 -07:00