The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
in_mtu) - minlen". It may let reader think about how about the result.
Would it be negative.
Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
variable, then only perform one condition check, and it is more readable.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already checked for !found just a bit before:
if (!found) {
regs->verdict.code = NFT_BREAK;
return;
}
if (found && set->flags & NFT_SET_MAP)
^^^^^
So this redundant check can just go away.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It's better to use sizeof(info->name)-1 as index to force set the string
tail instead of literal number '29'.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are some codes which are used to get one random once in netfilter.
We could use net_get_random_once to simplify these codes.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
pkt->xt.thoff is not always set properly, but we use it without any check.
For payload expr, it will cause wrong results. For nftrace, we may notify
the wrong network or transport header to the user space, furthermore,
input the following nft rules, warning message will be printed out:
# nft add rule arp filter output meta nftrace set 1
WARNING: CPU: 0 PID: 13428 at net/netfilter/nf_tables_trace.c:263
nft_trace_notify+0x4a3/0x5e0 [nf_tables]
Call Trace:
[<ffffffff813d58ae>] dump_stack+0x63/0x85
[<ffffffff810a4c0b>] __warn+0xcb/0xf0
[<ffffffff810a4d3d>] warn_slowpath_null+0x1d/0x20
[<ffffffffa0589703>] nft_trace_notify+0x4a3/0x5e0 [nf_tables]
[ ... ]
[<ffffffffa05690a8>] nft_do_chain_arp+0x78/0x90 [nf_tables_arp]
[<ffffffff816f4aa2>] nf_iterate+0x62/0x80
[<ffffffff816f4b33>] nf_hook_slow+0x73/0xd0
[<ffffffff81732bbf>] arp_xmit+0x8f/0xb0
[ ... ]
[<ffffffff81732d36>] arp_solicit+0x106/0x2c0
So before we use pkt->xt.thoff, check the tprot_set first.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There's an off-by-one issue in nft_payload_fast_eval, skb_tail_pointer
and ptr + priv->len all point to the last valid address plus 1. So if
they are equal, we can still fetch the valid data. It's unnecessary to
fall back to nft_payload_eval.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.
But acctually, it is not very flexible, for example:
tcp dport 80 mapped to queue0
tcp dport 81 mapped to queue1
tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping meant more rules ...
So take one register to select the queue number, then we can add one
simple rule to mapping queues, maybe like this:
queue num tcp dport map { 80:0, 81:1, 82:2 ... }
Florian Westphal also proposed wider usage scenarios:
queue num jhash ip saddr . ip daddr mod ...
queue num meta cpu ...
queue num meta mark ...
The last point is how to load a queue number from sreg, although we can
use *(u16*)®s->data[reg] to load the queue number, just like nat expr
to load its l4port do.
But we will cooperate with hash expr, meta cpu, meta mark expr and so on.
They all store the result to u32 type, so cast it to u16 pointer and
dereference it will generate wrong result in the big endian system.
So just keep it simple, we treat queue number as u32 type, although u16
type is already enough.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.
This patch revisits 4da449ae1d ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").
Fixes: 96518518cc ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support of an offset value for incremental counter and random. With
this option the sysadmin is able to start the counter to a certain value
and then apply the generated number.
Example:
meta mark set numgen inc mod 2 offset 100
This will generate marks with the serie 100, 101, 100, 101, ...
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The overflow validation in the init() function establishes that the
maximum value that the hash could reach is less than U32_MAX, which is
likely to be true.
The fix detects the overflow when the maximum hash value is less than
the offset itself.
Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After we generate a new number, we still use the priv->counter and
store it to the dreg. This is not correct, another cpu may already
change it to a new number. So we must use the generated number, not
the priv->counter itself.
Fixes: 91dbc6be0a ("netfilter: nf_tables: add number generator expression")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These counters sit in hot path and do show up in perf, this is especially
true for 'found' and 'searched' which get incremented for every packet
processed.
Information like
searched=212030105
new=623431
found=333613
delete=623327
does not seem too helpful nowadays:
- on busy systems found and searched will overflow every few hours
(these are 32bit integers), other more busy ones every few days.
- for debugging there are better methods, such as iptables' trace target,
the conntrack log sysctls. Nowadays we also have perf tool.
This removes packet path stat counters except those that
are expected to be 0 (or close to 0) on a normal system, e.g.
'insert_failed' (race happened) or 'invalid' (proto tracker rejects).
The insert stat is retained for the ctnetlink case.
The found stat is retained for the tuple-is-taken check when NAT has to
determine if it needs to pick a different source address.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are some codes of netfilter module which did not check the return
value of nft_register_chain_type. Add the checks now.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are some codes of netfilter module which did not check the return
value of register_netdevice_notifier. Add the checks now.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These functions are extracted from the netdev family, they initialize
the pktinfo structure and validate that the IPv4 and IPv6 headers are
well-formed given that these functions are called from a path where
layer 3 sanitization did not happen yet.
These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h
so they can be reused by a follow up patch to use them from the bridge
family too.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.
This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.
The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The dynset expression matches if we can fit a new entry into the set.
If there is no room for it, then it breaks the rule evaluation.
This patch introduces the inversion flag so you can add rules to
explicitly drop packets that don't fit into the set. For example:
# nft filter input flow table xyz size 4 { ip saddr timeout 120s counter } overflow drop
This is useful to provide a replacement for connlimit.
For the rule above, every new entry uses the IPv4 address as key in the
set, this entry gets a timeout of 120 seconds that gets refresh on every
packet seen. If we get new flow and our set already contains 4 entries
already, then this packet is dropped.
You can already express this in positive logic, assuming default policy
to drop:
# nft filter input flow table xyz size 4 { ip saddr timeout 10s counter } accept
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support to pass through an offset to the hash value. With this
feature, the sysadmin is able to generate a hash with a given
offset value.
Example:
meta mark set jhash ip saddr mod 2 seed 0xabcd offset 100
This option generates marks according to the source address from 100 to
101.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Although the validation of queues_total and queuenum is checked in nft
utility, but user can add nft rules via nfnetlink, so it is necessary
to check the validation at the nft_queue expr init routine too.
Tested by run ./nft-test.py any/queue.t:
any/queue.t: 6 unit tests, 0 error, 0 warning
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Current parsing methods for SIP headers do not allow the presence of
tab characters between header name and header value. As a result Call-ID
SIP headers like the following are discarded by IPVS SIP persistence
engine:
"Call-ID\t: mycallid@abcde"
"Call-ID:\tmycallid@abcde"
In above examples Call-IDs are represented as strings in C language.
Obviously in real message we have byte "09" before/after colon (":").
Proposed fix is in nf_conntrack_sip module.
Function sip_skip_whitespace() should skip tabs in addition to spaces,
since in SIP grammar whitespace (WSP) corresponds to space or tab.
Below is an extract of relevant SIP ABNF syntax.
Call-ID = ( "Call-ID" / "i" ) HCOLON callid
callid = word [ "@" word ]
HCOLON = *( SP / HTAB ) ":" SWS
SWS = [LWS] ; sep whitespace
LWS = [*WSP CRLF] 1*WSP ; linear whitespace
WSP = SP / HTAB
word = 1*(alphanum / "-" / "." / "!" / "%" / "*" /
"_" / "+" / "`" / "'" / "~" /
"(" / ")" / "<" / ">" /
":" / "\" / DQUOTE /
"/" / "[" / "]" / "?" /
"{" / "}" )
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is patch renames the existing function to nft_overquota() and make
it return a boolean that tells us if we have exceeded our byte quota.
Just a cleanup.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use xor to decide to break further rule evaluation or not, since the
existing logic doesn't achieve the expected inversion.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The _until_ attribute is renamed to _modulus_ as the behaviour is similar to
other expresions with number limits (ex. nft_hash).
Renaming is possible because there isn't a kernel release yet with these
changes.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are some debug code which are commented out in find_pattern by #if 0.
Now remove them.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The caller function "help" has already make sure the datalen could not be zero
before invoke find_pattern as a parameter by the following codes
if (dataoff >= skb->len) {
pr_debug("ftp: dataoff(%u) >= skblen(%u)\n", dataoff,
skb->len);
return NF_ACCEPT;
}
datalen = skb->len - dataoff;
And the latter codes "ends_in_nl = (fb_ptr[datalen - 1] == '\n');" use datalen
directly without checking if it is zero.
So it is unneccessary to check it in find_pattern too.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Current parsing methods for SIP header Call-ID do not check correctly all
characters allowed by RFC 3261. In particular "," character is allowed
instead of "'" character. As a result Call-ID headers like the following
are discarded by IPVS SIP persistence engine.
Call-ID: -.!%*_+`'~()<>:\"/[]?{}
Above example is composed using all non-alphanumeric characters listed
in RFC 3261 for Call-ID header syntax.
Proposed fix is in nf_conntrack_sip module; function iswordc() checks this
range: (c >= '(' && c <= '/') which includes these characters: ()*+,-./
They are all allowed except ",". Instead "'" is not included in the list.
Below is an extract of relevant SIP ABNF syntax.
Call-ID = ( "Call-ID" / "i" ) HCOLON callid
callid = word [ "@" word ]
HCOLON = *( SP / HTAB ) ":" SWS
SWS = [LWS] ; sep whitespace
LWS = [*WSP CRLF] 1*WSP ; linear whitespace
WSP = SP / HTAB
word = 1*(alphanum / "-" / "." / "!" / "%" / "*" /
"_" / "+" / "`" / "'" / "~" /
"(" / ")" / "<" / ">" /
":" / "\" / DQUOTE /
"/" / "[" / "]" / "?" /
"{" / "}" )
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Current parsing methods for SIP headers do not properly manage
continuation lines: in case of Call-ID header the first character of
Call-ID header value is truncated. As a result IPVS SIP persistence
engine hashes over a call-id that is not exactly the one present in
the originale message.
Example: "Call-ID: \r\n abcdeABCDE1234"
results in extracted call-id equal to "bcdeABCDE1234".
In above example Call-ID is represented as a string in C language.
Obviously in real message the first bytes after colon (":") are
"20 0d 0a 20".
Proposed fix is in nf_conntrack_sip module.
Since sip_follow_continuation() function walks past the leading
spaces or tabs of the continuation line, sip_skip_whitespace()
should simply return the ouput of sip_follow_continuation().
Otherwise another iteration of the for loop is done and dptr
is incremented by one pointing to the second character of the
first word in the header.
Below is an extract of relevant SIP ABNF syntax.
Call-ID = ( "Call-ID" / "i" ) HCOLON callid
callid = word [ "@" word ]
HCOLON = *( SP / HTAB ) ":" SWS
SWS = [LWS] ; sep whitespace
LWS = [*WSP CRLF] 1*WSP ; linear whitespace
WSP = SP / HTAB
word = 1*(alphanum / "-" / "." / "!" / "%" / "*" /
"_" / "+" / "`" / "'" / "~" /
"(" / ")" / "<" / ">" /
":" / "\" / DQUOTE /
"/" / "[" / "]" / "?" /
"{" / "}" )
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are two existing strutures which defines the GRE and PPTP header.
So use these two structures instead of the ones defined by netfilter to
keep consitent with other codes.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are already some GRE_* macros in kernel, so it is unnecessary
to define these macros. And remove some useless macros
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.
More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.
2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.
6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.
8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.
9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.
10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.
12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.
14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.
15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.
16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.
17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.
18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.
19) Add a garbage collector to get rid of stale conntracks, from
Florian.
20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.
This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.
Without this change it will take several minutes until conntrack count
comes back to normal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conntrack gc worker to evict stale entries.
GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.
This means that a complete scan of the table will take several minutes
of wall-clock time.
Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.
We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.
We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.
A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.
v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.
v3: don't call cancel_delayed_work_sync twice (again, Eric).
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.
We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later. Considering that resizes are very rare it doesn't
seem worth doing it.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.
Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de, 'timers: Switch to
a non-cascading wheel').
Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.
During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.
The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.
Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.
This is the standard/expected way when conntrack entries are destroyed.
Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.
Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.
If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.
Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.
Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).
We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension. The worker can then skip all entries that are in
a different state. Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.
Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.
Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: ea781f197d ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_dump_register() should only be used with registers, not with
immediates.
Fixes: cb1b69b0b1 ("netfilter: nf_tables: add hash expression")
Fixes: 91dbc6be0a62("netfilter: nf_tables: add number generator expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.
This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, if you add a base chain whose name clashes with an existing
non-base chain, nf_tables doesn't complain about this. Similarly, if you
update the chain type, the hook number and priority.
With this patch, nf_tables bails out in case any of this unsupported
operations occur by returning EBUSY.
# nft add table x
# nft add chain x y
# nft add chain x y { type nat hook input priority 0\; }
<cmdline>:1:1-49: Error: Could not process rule: Device or resource busy
add chain x y { type nat hook input priority 0; }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce a new function to wrap the code that parses the chain hook
configuration so we can reuse this code to validate chain updates.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes the following sparse warning:
net/netfilter/nft_hash.c:40:25: warning:
symbol 'nft_hash_policy' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
trivial fix to spelling mistake in pr_debug message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the numgen expression that allows us to generated
incremental and random numbers, this generator is bound to a upper limit
that is specified by userspace.
This expression is useful to distribute packets in a round-robin fashion
as well as randomly.
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the quota expression. This new stateful expression
integrate easily into the dynset expression to build 'hashquota' flow
tables.
Arguably, we could use instead "counter bytes > 1000" instead, but this
approach has several problems:
1) We only support for one single stateful expression in dynamic set
definitions, and the expression above is a composite of two
expressions: get counter + comparison.
2) We would need to restore the packed counter representation (that we
used to have) based on seqlock to synchronize this, since per-cpu is
not suitable for this.
So instead of bloating the counter expression back with the seqlock
representation and extending the existing set infrastructure to make it
more complex for the composite described above, let's follow the more
simple approach of adding a quota expression that we can plug into our
existing infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is required to iterate over the hash table in cttimeout, ctnetlink
and nf_conntrack_ipv4.
>> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] undefined!
ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined!
ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined!
Fixes: adf0516845 ("netfilter: remove ip_conntrack* sysctl compat code")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In general, when we want to delete a netns, cttimeout_net_exit will
be called before ipt_unregister_table, i.e. before ctnl_timeout_put.
But after call kfree_rcu in cttimeout_net_exit, we will still decrease
the timeout object's refcnt in ctnl_timeout_put, this is incorrect,
and will cause a use after free error.
It is easy to reproduce this problem:
# while : ; do
ip netns add xxx
ip netns exec xxx nfct add timeout testx inet icmp timeout 200
ip netns exec xxx iptables -t raw -p icmp -I OUTPUT -j CT --timeout testx
ip netns del xxx
done
=======================================================================
BUG kmalloc-96 (Tainted: G B E ): Poison overwritten
-----------------------------------------------------------------------
INFO: 0xffff88002b5161e8-0xffff88002b5161e8. First byte 0x6a instead of
0x6b
INFO: Allocated in cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
age=104 cpu=0 pid=3330
___slab_alloc+0x4da/0x540
__slab_alloc+0x20/0x40
__kmalloc+0x1c8/0x240
cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
nfnetlink_rcv_msg+0x21a/0x230 [nfnetlink]
[ ... ]
So only when the refcnt decreased to 0, we call kfree_rcu to free the
timeout object. And like nfnetlink_acct do, use atomic_cmpxchg to
avoid race between ctnl_timeout_try_del and ctnl_timeout_put.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Suppose that we input the following commands at first:
# nfacct add test
# iptables -A INPUT -m nfacct --nfacct-name test
And now "test" acct's refcnt is 2, but later when we try to delete the
"test" nfacct and the related iptables rule at the same time, race maybe
happen:
CPU0 CPU1
nfnl_acct_try_del nfnl_acct_put
atomic_dec_and_test //ref=1,testfail -
- atomic_dec_and_test //ref=0,testok
- kfree_rcu
atomic_inc //ref=1 -
So after the rcu grace period, nf_acct will be freed but it is still linked
in the nfnl_acct_list, and we can access it later, then oops will happen.
Convert atomic_dec_and_test and atomic_inc combinaiton to one atomic
operation atomic_cmpxchg here to fix this problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>