The SCTP socket extensions API document describes the v4mapping option as
follows:
8.1.15. Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR)
This socket option is a Boolean flag which turns on or off the
mapping of IPv4 addresses. If this option is turned on, then IPv4
addresses will be mapped to V6 representation. If this option is
turned off, then no mapping will be done of V4 addresses and a user
will receive both PF_INET6 and PF_INET type addresses on the socket.
See [RFC3542] for more details on mapped V6 addresses.
This description isn't really in line with what the code does though.
Introduce addr_to_user (renamed addr_v4map), which should be called
before any sockaddr is passed back to user space. The new function
places the sockaddr into the correct format depending on the
SCTP_I_WANT_MAPPED_V4_ADDR option.
Audit all places that touched v4mapped and either sanely construct
a v4 or v6 address then call addr_to_user, or drop the
unnecessary v4mapped check entirely.
Audit all places that call addr_to_user and verify they are on a sycall
return path.
Add a custom getname that formats the address properly.
Several bugs are addressed:
- SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for
addresses to user space
- The addr_len returned from recvmsg was not correct when
returning AF_INET on a v6 socket
- flowlabel and scope_id were not zerod when promoting
a v4 to v6
- Some syscalls like bind and connect behaved differently
depending on v4mapped
Tested bind, getpeername, getsockname, connect, and recvmsg for proper
behaviour in v4mapped = 1 and 0 cases.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason reported an oops caused by SCTP on his ARM machine with
SCTP authentication enabled:
Internal error: Oops: 17 [#1] ARM
CPU: 0 PID: 104 Comm: sctp-test Not tainted 3.13.0-68744-g3632f30c9b20-dirty #1
task: c6eefa40 ti: c6f52000 task.ti: c6f52000
PC is at sctp_auth_calculate_hmac+0xc4/0x10c
LR is at sg_init_table+0x20/0x38
pc : [<c024bb80>] lr : [<c00f32dc>] psr: 40000013
sp : c6f538e8 ip : 00000000 fp : c6f53924
r10: c6f50d80 r9 : 00000000 r8 : 00010000
r7 : 00000000 r6 : c7be4000 r5 : 00000000 r4 : c6f56254
r3 : c00c8170 r2 : 00000001 r1 : 00000008 r0 : c6f1e660
Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 0005397f Table: 06f28000 DAC: 00000015
Process sctp-test (pid: 104, stack limit = 0xc6f521c0)
Stack: (0xc6f538e8 to 0xc6f54000)
[...]
Backtrace:
[<c024babc>] (sctp_auth_calculate_hmac+0x0/0x10c) from [<c0249af8>] (sctp_packet_transmit+0x33c/0x5c8)
[<c02497bc>] (sctp_packet_transmit+0x0/0x5c8) from [<c023e96c>] (sctp_outq_flush+0x7fc/0x844)
[<c023e170>] (sctp_outq_flush+0x0/0x844) from [<c023ef78>] (sctp_outq_uncork+0x24/0x28)
[<c023ef54>] (sctp_outq_uncork+0x0/0x28) from [<c0234364>] (sctp_side_effects+0x1134/0x1220)
[<c0233230>] (sctp_side_effects+0x0/0x1220) from [<c02330b0>] (sctp_do_sm+0xac/0xd4)
[<c0233004>] (sctp_do_sm+0x0/0xd4) from [<c023675c>] (sctp_assoc_bh_rcv+0x118/0x160)
[<c0236644>] (sctp_assoc_bh_rcv+0x0/0x160) from [<c023d5bc>] (sctp_inq_push+0x6c/0x74)
[<c023d550>] (sctp_inq_push+0x0/0x74) from [<c024a6b0>] (sctp_rcv+0x7d8/0x888)
While we already had various kind of bugs in that area
ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if
we/peer is AUTH capable") and b14878ccb7fa ("net: sctp: cache
auth_enable per endpoint"), this one is a bit of a different
kind.
Giving a bit more background on why SCTP authentication is
needed can be found in RFC4895:
SCTP uses 32-bit verification tags to protect itself against
blind attackers. These values are not changed during the
lifetime of an SCTP association.
Looking at new SCTP extensions, there is the need to have a
method of proving that an SCTP chunk(s) was really sent by
the original peer that started the association and not by a
malicious attacker.
To cause this bug, we're triggering an INIT collision between
peers; normal SCTP handshake where both sides intent to
authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO
parameters that are being negotiated among peers:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
-------------------- COOKIE-ECHO -------------------->
<-------------------- COOKIE-ACK ---------------------
RFC4895 says that each endpoint therefore knows its own random
number and the peer's random number *after* the association
has been established. The local and peer's random number along
with the shared key are then part of the secret used for
calculating the HMAC in the AUTH chunk.
Now, in our scenario, we have 2 threads with 1 non-blocking
SEQ_PACKET socket each, setting up common shared SCTP_AUTH_KEY
and SCTP_AUTH_ACTIVE_KEY properly, and each of them calling
sctp_bindx(3), listen(2) and connect(2) against each other,
thus the handshake looks similar to this, e.g.:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
<--------- INIT[RANDOM; CHUNKS; HMAC-ALGO] -----------
-------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] -------->
...
Since such collisions can also happen with verification tags,
the RFC4895 for AUTH rather vaguely says under section 6.1:
In case of INIT collision, the rules governing the handling
of this Random Number follow the same pattern as those for
the Verification Tag, as explained in Section 5.2.4 of
RFC 2960 [5]. Therefore, each endpoint knows its own Random
Number and the peer's Random Number after the association
has been established.
In RFC2960, section 5.2.4, we're eventually hitting Action B:
B) In this case, both sides may be attempting to start an
association at about the same time but the peer endpoint
started its INIT after responding to the local endpoint's
INIT. Thus it may have picked a new Verification Tag not
being aware of the previous Tag it had sent this endpoint.
The endpoint should stay in or enter the ESTABLISHED
state but it MUST update its peer's Verification Tag from
the State Cookie, stop any init or cookie timers that may
running and send a COOKIE ACK.
In other words, the handling of the Random parameter is the
same as behavior for the Verification Tag as described in
Action B of section 5.2.4.
Looking at the code, we exactly hit the sctp_sf_do_dupcook_b()
case which triggers an SCTP_CMD_UPDATE_ASSOC command to the
side effect interpreter, and in fact it properly copies over
peer_{random, hmacs, chunks} parameters from the newly created
association to update the existing one.
Also, the old asoc_shared_key is being released and based on
the new params, sctp_auth_asoc_init_active_key() updated.
However, the issue observed in this case is that the previous
asoc->peer.auth_capable was 0, and has *not* been updated, so
that instead of creating a new secret, we're doing an early
return from the function sctp_auth_asoc_init_active_key()
leaving asoc->asoc_shared_key as NULL. However, we now have to
authenticate chunks from the updated chunk list (e.g. COOKIE-ACK).
That in fact causes the server side when responding with ...
<------------------ AUTH; COOKIE-ACK -----------------
... to trigger a NULL pointer dereference, since in
sctp_packet_transmit(), it discovers that an AUTH chunk is
being queued for xmit, and thus it calls sctp_auth_calculate_hmac().
Since the asoc->active_key_id is still inherited from the
endpoint, and the same as encoded into the chunk, it uses
asoc->asoc_shared_key, which is still NULL, as an asoc_key
and dereferences it in ...
crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len)
... causing an oops. All this happens because sctp_make_cookie_ack()
called with the *new* association has the peer.auth_capable=1
and therefore marks the chunk with auth=1 after checking
sctp_auth_send_cid(), but it is *actually* sent later on over
the then *updated* association's transport that didn't initialize
its shared key due to peer.auth_capable=0. Since control chunks
in that case are not sent by the temporary association which
are scheduled for deletion, they are issued for xmit via
SCTP_CMD_REPLY in the interpreter with the context of the
*updated* association. peer.auth_capable was 0 in the updated
association (which went from COOKIE_WAIT into ESTABLISHED state),
since all previous processing that performed sctp_process_init()
was being done on temporary associations, that we eventually
throw away each time.
The correct fix is to update to the new peer.auth_capable
value as well in the collision case via sctp_assoc_update(),
so that in case the collision migrated from 0 -> 1,
sctp_auth_asoc_init_active_key() can properly recalculate
the secret. This therefore fixes the observed server panic.
Fixes: 730fc3d05cd4 ("[SCTP]: Implete SCTP-AUTH parameter processing")
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MSG_MORE and 'corking' a socket would require that the transmit of
a data chunk be delayed.
Rename the return value to be less specific.
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for Nagle contains 6 separate checks all of which must be true
before a data packet is delayed.
Separate out each into its own 'if (test) return SCTP_XMIT_OK' so that
the reasons can be individually described.
Also return directly with SCTP_XMIT_RWND_FULL.
Delete the now-unused 'retval' variable and 'finish' label from
sctp_packet_can_append_data().
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With support of SCTP_SNDINFO/SCTP_RCVINFO as described in RFC6458,
5.3.4/5.3.5, we can now deprecate SCTP_SNDRCV. The RFC already
declares it as deprecated:
This structure mixes the send and receive path. SCTP_SNDINFO
(described in Section 5.3.4) and SCTP_RCVINFO (described in
Section 5.3.5) split this information. These structures should
be used, when possible, since SCTP_SNDRCV is deprecated.
So whenever a user tries to subscribe to sctp_data_io_event via
setsockopt(2) which triggers inclusion of SCTP_SNDRCV cmsg_type,
issue a warning in the log.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements section 8.1.31. of RFC6458, which adds support
for setting/retrieving SCTP_DEFAULT_SNDINFO:
Applications that wish to use the sendto() system call may wish
to specify a default set of parameters that would normally be
supplied through the inclusion of ancillary data. This socket
option allows such an application to set the default sctp_sndinfo
structure. The application that wishes to use this socket option
simply passes the sctp_sndinfo structure (defined in Section 5.3.4)
to this call. The input parameters accepted by this call include
snd_sid, snd_flags, snd_ppid, and snd_context. The snd_flags
parameter is composed of a bitwise OR of SCTP_UNORDERED, SCTP_EOF,
and SCTP_SENDALL. The snd_assoc_id field specifies the association
to which to apply the parameters. For a one-to-many style socket,
any of the predefined constants are also allowed in this field.
The field is ignored for one-to-one style sockets.
Joint work with Daniel Borkmann.
Signed-off-by: Geir Ola Vaagland <geirola@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements section 5.3.6. of RFC6458, that is, support
for 'SCTP Next Receive Information Structure' (SCTP_NXTINFO) which
is placed into ancillary data cmsghdr structure for each recvmsg()
call, if this information is already available when delivering the
current message.
This option can be enabled/disabled via setsockopt(2) on SOL_SCTP
level by setting an int value with 1/0 for SCTP_RECVNXTINFO in
user space applications as per RFC6458, section 8.1.30.
The sctp_nxtinfo structure is defined as per RFC as below ...
struct sctp_nxtinfo {
uint16_t nxt_sid;
uint16_t nxt_flags;
uint32_t nxt_ppid;
uint32_t nxt_length;
sctp_assoc_t nxt_assoc_id;
};
... and provided under cmsg_level IPPROTO_SCTP, cmsg_type
SCTP_NXTINFO, while cmsg_data[] contains struct sctp_nxtinfo.
Joint work with Daniel Borkmann.
Signed-off-by: Geir Ola Vaagland <geirola@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements section 5.3.5. of RFC6458, that is, support
for 'SCTP Receive Information Structure' (SCTP_RCVINFO) which is
placed into ancillary data cmsghdr structure for each recvmsg()
call.
This option can be enabled/disabled via setsockopt(2) on SOL_SCTP
level by setting an int value with 1/0 for SCTP_RECVRCVINFO in user
space applications as per RFC6458, section 8.1.29.
The sctp_rcvinfo structure is defined as per RFC as below ...
struct sctp_rcvinfo {
uint16_t rcv_sid;
uint16_t rcv_ssn;
uint16_t rcv_flags;
<-- 2 bytes hole -->
uint32_t rcv_ppid;
uint32_t rcv_tsn;
uint32_t rcv_cumtsn;
uint32_t rcv_context;
sctp_assoc_t rcv_assoc_id;
};
... and provided under cmsg_level IPPROTO_SCTP, cmsg_type
SCTP_RCVINFO, while cmsg_data[] contains struct sctp_rcvinfo.
An sctp_rcvinfo item always corresponds to the data in msg_iov.
Joint work with Daniel Borkmann.
Signed-off-by: Geir Ola Vaagland <geirola@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements section 5.3.4. of RFC6458, that is, support
for 'SCTP Send Information Structure' (SCTP_SNDINFO) which can be
placed into ancillary data cmsghdr structure for sendmsg() calls.
The sctp_sndinfo structure is defined as per RFC as below ...
struct sctp_sndinfo {
uint16_t snd_sid;
uint16_t snd_flags;
uint32_t snd_ppid;
uint32_t snd_context;
sctp_assoc_t snd_assoc_id;
};
... and supplied under cmsg_level IPPROTO_SCTP, cmsg_type
SCTP_SNDINFO, while cmsg_data[] contains struct sctp_sndinfo.
An sctp_sndinfo item always corresponds to the data in msg_iov.
Joint work with Daniel Borkmann.
Signed-off-by: Geir Ola Vaagland <geirola@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on some other SCTP code, I noticed that some
structures shared with user space are leaking uninitialized
stack or heap buffer. In particular, struct sctp_sndrcvinfo
has a 2 bytes hole between .sinfo_flags and .sinfo_ppid that
remains unfilled by us in sctp_ulpevent_read_sndrcvinfo() when
putting this into cmsg. But also struct sctp_remote_error
contains a 2 bytes hole that we don't fill but place into a skb
through skb_copy_expand() via sctp_ulpevent_make_remote_error().
Both structures are defined by the IETF in RFC6458:
* Section 5.3.2. SCTP Header Information Structure:
The sctp_sndrcvinfo structure is defined below:
struct sctp_sndrcvinfo {
uint16_t sinfo_stream;
uint16_t sinfo_ssn;
uint16_t sinfo_flags;
<-- 2 bytes hole -->
uint32_t sinfo_ppid;
uint32_t sinfo_context;
uint32_t sinfo_timetolive;
uint32_t sinfo_tsn;
uint32_t sinfo_cumtsn;
sctp_assoc_t sinfo_assoc_id;
};
* 6.1.3. SCTP_REMOTE_ERROR:
A remote peer may send an Operation Error message to its peer.
This message indicates a variety of error conditions on an
association. The entire ERROR chunk as it appears on the wire
is included in an SCTP_REMOTE_ERROR event. Please refer to the
SCTP specification [RFC4960] and any extensions for a list of
possible error formats. An SCTP error notification has the
following format:
struct sctp_remote_error {
uint16_t sre_type;
uint16_t sre_flags;
uint32_t sre_length;
uint16_t sre_error;
<-- 2 bytes hole -->
sctp_assoc_t sre_assoc_id;
uint8_t sre_data[];
};
Fix this by setting both to 0 before filling them out. We also
have other structures shared between user and kernel space in
SCTP that contains holes (e.g. struct sctp_paddrthlds), but we
copy that buffer over from user space first and thus don't need
to care about it in that cases.
While at it, we can also remove lengthy comments copied from
the draft, instead, we update the comment with the correct RFC
number where one can look it up.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_init_cmd_seq() and sctp_next_cmd() are only called from one place.
The call sequence for sctp_add_cmd_sf() is likely to be longer than
the inlined code.
With sctp_add_cmd_sf() inlined the compiler can optimise repeated calls.
Signed-off-by: David Laight <david.laight@aculab.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only warn if the value is written to alpha or beta. We don't care
emitting a one-time warning when only reading it.
Reported-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC4960, section 8.3 says:
On an idle destination address that is allowed to heartbeat,
it is recommended that a HEARTBEAT chunk is sent once per RTO
of that destination address plus the protocol parameter
'HB.interval', with jittering of +/- 50% of the RTO value,
and exponential backoff of the RTO if the previous HEARTBEAT
is unanswered.
Currently, we calculate jitter via sctp_jitter() function first,
and then add its result to the current RTO for the new timeout:
TMO = RTO + (RAND() % RTO) - (RTO / 2)
`------------------------^-=> sctp_jitter()
Instead, we can just simplify all this by directly calculating:
TMO = (RTO / 2) + (RAND() % RTO)
With the help of prandom_u32_max(), we don't need to open code
our own global PRNG, but can instead just make use of the per
CPU implementation of prandom with better quality numbers. Also,
we can now spare us the conditional for divide by zero check
since no div or mod operation needs to be used. Note that
prandom_u32_max() won't emit the same result as a mod operation,
but we really don't care here as we only want to have a random
number scaled into RTO interval.
Note, exponential RTO backoff is handeled elsewhere, namely in
sctp_do_8_2_transport_strike().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When writing to the sysctl field net.sctp.auth_enable, it can well
be that the user buffer we handed over to proc_dointvec() via
proc_sctp_do_auth() handler contains something other than integers.
In that case, we would set an uninitialized 4-byte value from the
stack to net->sctp.auth_enable that can be leaked back when reading
the sysctl variable, and it can unintentionally turn auth_enable
on/off based on the stack content since auth_enable is interpreted
as a boolean.
Fix it up by making sure proc_dointvec() returned sucessfully.
Fixes: b14878ccb7fa ("net: sctp: cache auth_enable per endpoint")
Reported-by: Florian Westphal <fwestpha@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl handler proc_sctp_do_hmac_alg(), proc_sctp_do_rto_min() and
proc_sctp_do_rto_max() do not properly reflect some error cases
when writing values via sysctl from internal proc functions such
as proc_dointvec() and proc_dostring().
In all these cases we pass the test for write != 0 and partially
do additional work just to notice that additional sanity checks
fail and we return with hard-coded -EINVAL while proc_do*
functions might also return different errors. So fix this up by
simply testing a successful return of proc_do* right after
calling it.
This also allows to propagate its return value onwards to the user.
While touching this, also fix up some minor style issues.
Fixes: 4f3fdf3bc59c ("sctp: add check rto_min and rto_max in sysctl")
Fixes: 3c68198e7511 ("sctp: Make hmac algorithm selection for cookie generation dynamic")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3fd091e73b81 ("[SCTP]: Remove multiple levels of msecs
to jiffies conversions.") has silently changed permissions for
rto_alpha and rto_beta knobs from 0644 to 0444. The purpose of
this was to discourage users from tweaking rto_alpha and
rto_beta knobs in production environments since they are key
to correctly compute rtt/srtt.
RFC4960 under section 6.3.1. RTO Calculation says regarding
rto_alpha and rto_beta under rule C3 and C4:
[...]
C3) When a new RTT measurement R' is made, set
RTTVAR <- (1 - RTO.Beta) * RTTVAR + RTO.Beta * |SRTT - R'|
and
SRTT <- (1 - RTO.Alpha) * SRTT + RTO.Alpha * R'
Note: The value of SRTT used in the update to RTTVAR
is its value before updating SRTT itself using the
second assignment. After the computation, update
RTO <- SRTT + 4 * RTTVAR.
C4) When data is in flight and when allowed by rule C5
below, a new RTT measurement MUST be made each round
trip. Furthermore, new RTT measurements SHOULD be
made no more than once per round trip for a given
destination transport address. There are two reasons
for this recommendation: First, it appears that
measuring more frequently often does not in practice
yield any significant benefit [ALLMAN99]; second,
if measurements are made more often, then the values
of RTO.Alpha and RTO.Beta in rule C3 above should be
adjusted so that SRTT and RTTVAR still adjust to
changes at roughly the same rate (in terms of how many
round trips it takes them to reflect new values) as
they would if making only one measurement per
round-trip and using RTO.Alpha and RTO.Beta as given
in rule C3. However, the exact nature of these
adjustments remains a research issue.
[...]
While it is discouraged to adjust rto_alpha and rto_beta
and not further specified how to adjust them, the RFC also
doesn't explicitly forbid it, but rather gives a RECOMMENDED
default value (rto_alpha=3, rto_beta=2). We have a couple
of users relying on the old permissions before they got
changed. That said, if someone really has the urge to adjust
them, we could allow it with a warning in the log.
Fixes: 3fd091e73b81 ("[SCTP]: Remove multiple levels of msecs to jiffies conversions.")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consider the scenario:
For a TCP-style socket, while processing the COOKIE_ECHO chunk in
sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
a new association would be created in sctp_unpack_cookie(), but afterwards,
some processing maybe failed, and sctp_association_free() will be called to
free the previously allocated association, in sctp_association_free(),
sk_ack_backlog value is decremented for this socket, since the initial
value for sk_ack_backlog is 0, after the decrement, it will be 65535,
a wrap-around problem happens, and if we want to establish new associations
afterward in the same socket, ABORT would be triggered since sctp deem the
accept queue as full.
Fix this issue by only decrementing sk_ack_backlog for associations in
the endpoint's list.
Fix-suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the following sparse warning:
net/sctp/associola.c:1556:29: warning: incorrect type in initializer (different base types)
net/sctp/associola.c:1556:29: expected bool [unsigned] [usertype] preload
net/sctp/associola.c:1556:29: got restricted gfp_t
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function sctp_select_active_and_retran_path(), we walk the
transport list in order to look for the two most recently used
ACTIVE transports (trans_pri, trans_sec). In case we didn't find
anything ACTIVE, we currently just camp on a possibly PF or
INACTIVE transport that is primary path; this behavior actually
dates back to linux-history tree of the very early days of
lksctp, and can yield a behavior that chooses suboptimal
transport paths.
Instead, be a bit more clever by reusing and extending the
recently introduced sctp_trans_elect_best() handler. In case
both transports are evaluated to have the same score resulting
from their states, break the tie by looking at: 1) transport
patch error count 2) last_time_heard value from each transport.
This is analogous to Nishida's Quick Failover draft [1],
section 5.1, 3:
The sender SHOULD avoid data transmission to PF destinations.
When all destinations are in either PF or Inactive state,
the sender MAY either move the destination from PF to active
state (and transmit data to the active destination) or the
sender MAY transmit data to a PF destination. In the former
scenario, (i) the sender MUST NOT notify the ULP about the
state transition, and (ii) MUST NOT clear the destination's
error counter. It is recommended that the sender picks the
PF destination with least error count (fewest consecutive
timeouts) for data transmission. In case of a tie (multiple PF
destinations with same error count), the sender MAY choose the
last active destination.
Thus for sctp_select_active_and_retran_path(), we keep track of
the best, if any, transport that is in PF state and in case no
ACTIVE transport has been found (hence trans_{pri,sec} is NULL),
we select the best out of the three: current primary_path and
retran_path as well as a possible PF transport.
The secondary may still camp on the original primary_path as
before. The change in sctp_trans_elect_best() with a more fine
grained tie selection also improves at the same time path selection
for sctp_assoc_update_retran_path() in case of non-ACTIVE states.
[1] http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Be more precise in transport path selection and use ktime
helpers instead of jiffies to compare and pick the better
primary and secondary recently used transports. This also
avoids any side-effects during a possible roll-over, and
could lead to better path decision-making.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch just refactors and moves the code for the active
path selection into its own helper function outside of
sctp_assoc_control_transport() which is already big enough.
No functional changes here.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two minimal helper functions analogous to time_before() and
time_after() that will later on both be needed by SCTP code.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define separate fields in the sock structure for configuring disabling
checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
removed UDP_CSUM_* defines since they are no longer necessary.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It doesn't seem like an protocols are setting anything other
than the default, and allowing to arbitrarily disable checksums
for a whole protocol seems dangerous. This can be done on a per
socket basis.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fengguang reported the following sparse warning:
>> net/ipv6/proc.c:198:41: sparse: incorrect type in argument 1 (different address spaces)
net/ipv6/proc.c:198:41: expected void [noderef] <asn:3>*mib
net/ipv6/proc.c:198:41: got void [noderef] <asn:3>**pcpumib
Fixes: commit 698365fa1874aa7635d51667a3 (net: clean up snmp stats code)
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.
By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by several people, rename local_df to ignore_df,
since it means "ignore df bit if it is set".
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
When register_net_sysctl failed, we should free the
sysctl_table.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8f0ea0fe3a036a47767f9c80e (snmp: reduce percpu needs by 50%)
reduced snmp array size to 1, so technically it doesn't have to be
an array any more. What's more, after the following commit:
commit 933393f58fef9963eac61db8093689544e29a600
Date: Thu Dec 22 11:58:51 2011 -0600
percpu: Remove irqsafe_cpu_xxx variants
We simply say that regular this_cpu use must be safe regardless of
preemption and interrupt state. That has no material change for x86
and s390 implementations of this_cpu operations. However, arches that
do not provide their own implementation for this_cpu operations will
now get code generated that disables interrupts instead of preemption.
probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after
almost 3 years, no one complains.
So, just convert the array to a single pointer and remove snmp_mib_init()
and snmp_mib_free() as well.
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't transition to the PF state on every strike after 'Path.Max.Retrans'.
Per draft-ietf-tsvwg-sctp-failover-03 Section 5.1.6:
Additional (PMR - PFMR) consecutive timeouts on a PF destination
confirm the path failure, upon which the destination transitions to the
Inactive state. As described in [RFC4960], the sender (i) SHOULD notify
ULP about this state transition, and (ii) transmit heartbeats to the
Inactive destination at a lower frequency as described in Section 8.3 of
[RFC4960].
This also prevents sending SCTP_ADDR_UNREACHABLE to the user as the state
bounces between SCTP_INACTIVE and SCTP_PF for each subsequent strike.
Signed-off-by: Karl Heiss <kheiss@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 813b3b5db83 (ipv4: Use caller's on-stack flowi as-is
in output route lookups.) introduces another regression which
is very similar to the problem of commit e6b45241c (ipv4: reset
flowi parameters on route connect) wants to fix:
Before we call ip_route_output_key() in sctp_v4_get_dst() to
get a dst that matches a bind address as the source address,
we have already called this function previously and the flowi
parameters have been initialized including flowi4_oif, so when
we call this function again, the process in __ip_route_output_key()
will be different because of the setting of flowi4_oif, and we'll
get a networking device which corresponds to the inputted flowi4_oif
as the output device, this is wrong because we'll never hit this
place if the previously returned source address of dst match one
of the bound addresses.
To reproduce this problem, a vlan setting is enough:
# ifconfig eth0 up
# route del default
# vconfig add eth0 2
# vconfig add eth0 3
# ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0
# route add default gw 10.0.1.254 dev eth0.2
# ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0
# ip rule add from 10.0.0.14 table 4
# ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3
# sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I
You'll detect that all the flow are routed to eth0.2(10.0.1.254).
Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The busy polling socket option adds support for sockets to busy wait on data
arriving on the napi queue from which they have most recently received a frame.
Currently only tcp and udp support this feature, but theres no reason sctp can't
do so as well. Add it in so appliations can take advantage of it
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Daniel Borkmann <dborkman@redhat.com>
CC: netdev@vger.kernel.org
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, it is possible to create an SCTP socket, then switch
auth_enable via sysctl setting to 1 and crash the system on connect:
Oops[#1]:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
[...]
Call Trace:
[<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80
[<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4
[<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c
[<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8
[<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214
[<ffffffff8043af68>] sctp_rcv+0x588/0x630
[<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24
[<ffffffff803acb50>] ip6_input+0x2c0/0x440
[<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564
[<ffffffff80310650>] process_backlog+0xb4/0x18c
[<ffffffff80313cbc>] net_rx_action+0x12c/0x210
[<ffffffff80034254>] __do_softirq+0x17c/0x2ac
[<ffffffff800345e0>] irq_exit+0x54/0xb0
[<ffffffff800075a4>] ret_from_irq+0x0/0x4
[<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48
[<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148
[<ffffffff805a88b0>] start_kernel+0x37c/0x398
Code: dd0900b8 000330f8 0126302d <dcc60000> 50c0fff1 0047182a a48306a0
03e00008 00000000
---[ end trace b530b0551467f2fd ]---
Kernel panic - not syncing: Fatal exception in interrupt
What happens while auth_enable=0 in that case is, that
ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
when endpoint is being created.
After that point, if an admin switches over to auth_enable=1,
the machine can crash due to NULL pointer dereference during
reception of an INIT chunk. When we enter sctp_process_init()
via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
the INIT verification succeeds and while we walk and process
all INIT params via sctp_process_param() we find that
net->sctp.auth_enable is set, therefore do not fall through,
but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
dereference what we have set to NULL during endpoint
initialization phase.
The fix is to make auth_enable immutable by caching its value
during endpoint initialization, so that its original value is
being carried along until destruction. The bug seems to originate
from the very first days.
Fix in joint work with Daniel Borkmann.
Reported-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.
One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.
Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
Reported-by: Zhan Jianyu <nasa4836@gmail.com>
Tested-by: Zhan Jianyu <nasa4836@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
In function sctp_wake_up_waiters(), we need to involve a test
if the association is declared dead. If so, we don't have any
reference to a possible sibling association anymore and need
to invoke sctp_write_space() instead, and normally walk the
socket's associations and notify them of new wmem space. The
reason for special casing is that otherwise, we could run
into the following issue when a sctp_primitive_SEND() call
from sctp_sendmsg() fails, and tries to flush an association's
outq, i.e. in the following way:
sctp_association_free()
`-> list_del(&asoc->asocs) <-- poisons list pointer
asoc->base.dead = true
sctp_outq_free(&asoc->outqueue)
`-> __sctp_outq_teardown()
`-> sctp_chunk_free()
`-> consume_skb()
`-> sctp_wfree()
`-> sctp_wake_up_waiters() <-- dereferences poisoned pointers
if asoc->ep->sndbuf_policy=0
Therefore, only walk the list in an 'optimized' way if we find
that the current association is still active. We could also use
list_del_init() in addition when we call sctp_association_free(),
but as Vlad suggests, we want to trap such bugs and thus leave
it poisoned as is.
Why is it safe to resolve the issue by testing for asoc->base.dead?
Parallel calls to sctp_sendmsg() are protected under socket lock,
that is lock_sock()/release_sock(). Only within that path under
lock held, we're setting skb/chunk owner via sctp_set_owner_w().
Eventually, chunks are freed directly by an association still
under that lock. So when traversing association list on destruction
time from sctp_wake_up_waiters() via sctp_wfree(), a different
CPU can't be running sctp_wfree() while another one calls
sctp_association_free() as both happens under the same lock.
Therefore, this can also not race with setting/testing against
asoc->base.dead as we are guaranteed for this to happen in order,
under lock. Further, Vlad says: the times we check asoc->base.dead
is when we've cached an association pointer for later processing.
In between cache and processing, the association may have been
freed and is simply still around due to reference counts. We check
asoc->base.dead under a lock, so it should always be safe to check
and not race against sctp_association_free(). Stress-testing seems
fine now, too.
Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP charges chunks for wmem accounting via skb->truesize in
sctp_set_owner_w(), and sctp_wfree() respectively as the
reverse operation. If a sender runs out of wmem, it needs to
wait via sctp_wait_for_sndbuf(), and gets woken up by a call
to __sctp_write_space() mostly via sctp_wfree().
__sctp_write_space() is being called per association. Although
we assign sk->sk_write_space() to sctp_write_space(), which
is then being done per socket, it is only used if send space
is increased per socket option (SO_SNDBUF), as SOCK_USE_WRITE_QUEUE
is set and therefore not invoked in sock_wfree().
Commit 4c3a5bdae293 ("sctp: Don't charge for data in sndbuf
again when transmitting packet") fixed an issue where in case
sctp_packet_transmit() manages to queue up more than sndbuf
bytes, sctp_wait_for_sndbuf() will never be woken up again
unless it is interrupted by a signal. However, a still
remaining issue is that if net.sctp.sndbuf_policy=0, that is
accounting per socket, and one-to-many sockets are in use,
the reclaimed write space from sctp_wfree() is 'unfairly'
handed back on the server to the association that is the lucky
one to be woken up again via __sctp_write_space(), while
the remaining associations are never be woken up again
(unless by a signal).
The effect disappears with net.sctp.sndbuf_policy=1, that
is wmem accounting per association, as it guarantees a fair
share of wmem among associations.
Therefore, if we have reclaimed memory in case of per socket
accounting, wake all related associations to a socket in a
fair manner, that is, traverse the socket association list
starting from the current neighbour of the association and
issue a __sctp_write_space() to everyone until we end up
waking ourselves. This guarantees that no association is
preferred over another and even if more associations are
taken into the one-to-many session, all receivers will get
messages from the server and are not stalled forever on
high load. This setting still leaves the advantage of per
socket accounting in touch as an association can still use
up global limits if unused by others.
Fixes: 4eb701dfc618 ("[SCTP] Fix SCTP sendbuffer accouting.")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is basically just to let Coverity et al shut up. Remove an
unneeded NULL check in sctp_assoc_update_retran_path().
It is safe to remove it, because in sctp_assoc_update_retran_path()
we iterate over the list of transports, our own transport which is
asoc->peer.retran_path included. In the iteration, we skip the
list head element and transports in state SCTP_UNCONFIRMED.
Such transports came from peer addresses received in INIT/INIT-ACK
address parameters. They are not yet confirmed by a heartbeat and
not available for data transfers.
We know however that in the list of transports, even if it contains
such elements, it at least contains our asoc->peer.retran_path as
well, so even if next to that element, we only encounter
SCTP_UNCONFIRMED transports, we are always going to fall back to
asoc->peer.retran_path through sctp_trans_elect_best(), as that is
for sure not SCTP_UNCONFIRMED as per fbdf501c9374 ("sctp: Do no
select unconfirmed transports for retransmissions").
Whenever we call sctp_trans_elect_best() it will give us a non-NULL
element back, and therefore when we break out of the loop, we are
guaranteed to have a non-NULL transport pointer, and can remove
the NULL check.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to
verify if we/peer is AUTH capable"), we noticed that there's a skb
memory leakage in the error path.
Running the same reproducer as in ec0223ec48a9 and by unconditionally
jumping to the error label (to simulate an error condition) in
sctp_sf_do_5_1D_ce() receive path lets kmemleak detector bark about
the unfreed chunk->auth_chunk skb clone:
Unreferenced object 0xffff8800b8f3a000 (size 256):
comm "softirq", pid 0, jiffies 4294769856 (age 110.757s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
89 ab 75 5e d4 01 58 13 00 00 00 00 00 00 00 00 ..u^..X.........
backtrace:
[<ffffffff816660be>] kmemleak_alloc+0x4e/0xb0
[<ffffffff8119f328>] kmem_cache_alloc+0xc8/0x210
[<ffffffff81566929>] skb_clone+0x49/0xb0
[<ffffffffa0467459>] sctp_endpoint_bh_rcv+0x1d9/0x230 [sctp]
[<ffffffffa046fdbc>] sctp_inq_push+0x4c/0x70 [sctp]
[<ffffffffa047e8de>] sctp_rcv+0x82e/0x9a0 [sctp]
[<ffffffff815abd38>] ip_local_deliver_finish+0xa8/0x210
[<ffffffff815a64af>] nf_reinject+0xbf/0x180
[<ffffffffa04b4762>] nfqnl_recv_verdict+0x1d2/0x2b0 [nfnetlink_queue]
[<ffffffffa04aa40b>] nfnetlink_rcv_msg+0x14b/0x250 [nfnetlink]
[<ffffffff815a3269>] netlink_rcv_skb+0xa9/0xc0
[<ffffffffa04aa7cf>] nfnetlink_rcv+0x23f/0x408 [nfnetlink]
[<ffffffff815a2bd8>] netlink_unicast+0x168/0x250
[<ffffffff815a2fa1>] netlink_sendmsg+0x2e1/0x3f0
[<ffffffff8155cc6b>] sock_sendmsg+0x8b/0xc0
[<ffffffff8155d449>] ___sys_sendmsg+0x369/0x380
What happens is that commit bbd0d59809f9 clones the skb containing
the AUTH chunk in sctp_endpoint_bh_rcv() when having the edge case
that an endpoint requires COOKIE-ECHO chunks to be authenticated:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
------------------ AUTH; COOKIE-ECHO ---------------->
<-------------------- COOKIE-ACK ---------------------
When we enter sctp_sf_do_5_1D_ce() and before we actually get to
the point where we process (and subsequently free) a non-NULL
chunk->auth_chunk, we could hit the "goto nomem_init" path from
an error condition and thus leave the cloned skb around w/o
freeing it.
The fix is to centrally free such clones in sctp_chunk_destroy()
handler that is invoked from sctp_chunk_free() after all refs have
dropped; and also move both kfree_skb(chunk->auth_chunk) there,
so that chunk->auth_chunk is either NULL (since sctp_chunkify()
allocs new chunks through kmem_cache_zalloc()) or non-NULL with
a valid skb pointer. chunk->skb and chunk->auth_chunk are the
only skbs in the sctp_chunk structure that need to be handeled.
While at it, we should use consume_skb() for both. It is the same
as dev_kfree_skb() but more appropriately named as we are not
a device but a protocol. Also, this effectively replaces the
kfree_skb() from both invocations into consume_skb(). Functions
are the same only that kfree_skb() assumes that the frame was
being dropped after a failure (e.g. for tools like drop monitor),
usage of consume_skb() seems more appropriate in function
sctp_chunk_destroy() though.
Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <yasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC4895 introduced AUTH chunks for SCTP; during the SCTP
handshake RANDOM; CHUNKS; HMAC-ALGO are negotiated (CHUNKS
being optional though):
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
-------------------- COOKIE-ECHO -------------------->
<-------------------- COOKIE-ACK ---------------------
A special case is when an endpoint requires COOKIE-ECHO
chunks to be authenticated:
---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
<------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
------------------ AUTH; COOKIE-ECHO ---------------->
<-------------------- COOKIE-ACK ---------------------
RFC4895, section 6.3. Receiving Authenticated Chunks says:
The receiver MUST use the HMAC algorithm indicated in
the HMAC Identifier field. If this algorithm was not
specified by the receiver in the HMAC-ALGO parameter in
the INIT or INIT-ACK chunk during association setup, the
AUTH chunk and all the chunks after it MUST be discarded
and an ERROR chunk SHOULD be sent with the error cause
defined in Section 4.1. [...] If no endpoint pair shared
key has been configured for that Shared Key Identifier,
all authenticated chunks MUST be silently discarded. [...]
When an endpoint requires COOKIE-ECHO chunks to be
authenticated, some special procedures have to be followed
because the reception of a COOKIE-ECHO chunk might result
in the creation of an SCTP association. If a packet arrives
containing an AUTH chunk as a first chunk, a COOKIE-ECHO
chunk as the second chunk, and possibly more chunks after
them, and the receiver does not have an STCB for that
packet, then authentication is based on the contents of
the COOKIE-ECHO chunk. In this situation, the receiver MUST
authenticate the chunks in the packet by using the RANDOM
parameters, CHUNKS parameters and HMAC_ALGO parameters
obtained from the COOKIE-ECHO chunk, and possibly a local
shared secret as inputs to the authentication procedure
specified in Section 6.3. If authentication fails, then
the packet is discarded. If the authentication is successful,
the COOKIE-ECHO and all the chunks after the COOKIE-ECHO
MUST be processed. If the receiver has an STCB, it MUST
process the AUTH chunk as described above using the STCB
from the existing association to authenticate the
COOKIE-ECHO chunk and all the chunks after it. [...]
Commit bbd0d59809f9 introduced the possibility to receive
and verification of AUTH chunk, including the edge case for
authenticated COOKIE-ECHO. On reception of COOKIE-ECHO,
the function sctp_sf_do_5_1D_ce() handles processing,
unpacks and creates a new association if it passed sanity
checks and also tests for authentication chunks being
present. After a new association has been processed, it
invokes sctp_process_init() on the new association and
walks through the parameter list it received from the INIT
chunk. It checks SCTP_PARAM_RANDOM, SCTP_PARAM_HMAC_ALGO
and SCTP_PARAM_CHUNKS, and copies them into asoc->peer
meta data (peer_random, peer_hmacs, peer_chunks) in case
sysctl -w net.sctp.auth_enable=1 is set. If in INIT's
SCTP_PARAM_SUPPORTED_EXT parameter SCTP_CID_AUTH is set,
peer_random != NULL and peer_hmacs != NULL the peer is to be
assumed asoc->peer.auth_capable=1, in any other case
asoc->peer.auth_capable=0.
Now, if in sctp_sf_do_5_1D_ce() chunk->auth_chunk is
available, we set up a fake auth chunk and pass that on to
sctp_sf_authenticate(), which at latest in
sctp_auth_calculate_hmac() reliably dereferences a NULL pointer
at position 0..0008 when setting up the crypto key in
crypto_hash_setkey() by using asoc->asoc_shared_key that is
NULL as condition key_id == asoc->active_key_id is true if
the AUTH chunk was injected correctly from remote. This
happens no matter what net.sctp.auth_enable sysctl says.
The fix is to check for net->sctp.auth_enable and for
asoc->peer.auth_capable before doing any operations like
sctp_sf_authenticate() as no key is activated in
sctp_auth_asoc_init_active_key() for each case.
Now as RFC4895 section 6.3 states that if the used HMAC-ALGO
passed from the INIT chunk was not used in the AUTH chunk, we
SHOULD send an error; however in this case it would be better
to just silently discard such a maliciously prepared handshake
as we didn't even receive a parameter at all. Also, as our
endpoint has no shared key configured, section 6.3 says that
MUST silently discard, which we are doing from now onwards.
Before calling sctp_sf_pdiscard(), we need not only to free
the association, but also the chunk->auth_chunk skb, as
commit bbd0d59809f9 created a skb clone in that case.
I have tested this locally by using netfilter's nfqueue and
re-injecting packets into the local stack after maliciously
modifying the INIT chunk (removing RANDOM; HMAC-ALGO param)
and the SCTP packet containing the COOKIE_ECHO (injecting
AUTH chunk before COOKIE_ECHO). Fixed with this patch applied.
Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <yasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Problem statement: 1) both paths (primary path1 and alternate
path2) are up after the association has been established i.e.,
HB packets are normally exchanged, 2) path2 gets inactive after
path_max_retrans * max_rto timed out (i.e. path2 is down completely),
3) now, if a transmission times out on the only surviving/active
path1 (any ~1sec network service impact could cause this like
a channel bonding failover), then the retransmitted packets are
sent over the inactive path2; this happens with partial failover
and without it.
Besides not being optimal in the above scenario, a small failure
or timeout in the only existing path has the potential to cause
long delays in the retransmission (depending on RTO_MAX) until
the still active path is reselected. Further, when the T3-timeout
occurs, we have active_patch == retrans_path, and even though the
timeout occurred on the initial transmission of data, not a
retransmit, we end up updating retransmit path.
RFC4960, section 6.4. "Multi-Homed SCTP Endpoints" states under
6.4.1. "Failover from an Inactive Destination Address" the
following:
Some of the transport addresses of a multi-homed SCTP endpoint
may become inactive due to either the occurrence of certain
error conditions (see Section 8.2) or adjustments from the
SCTP user.
When there is outbound data to send and the primary path
becomes inactive (e.g., due to failures), or where the SCTP
user explicitly requests to send data to an inactive
destination transport address, before reporting an error to
its ULP, the SCTP endpoint should try to send the data to an
alternate __active__ destination transport address if one
exists.
When retransmitting data that timed out, if the endpoint is
multihomed, it should consider each source-destination address
pair in its retransmission selection policy. When retransmitting
timed-out data, the endpoint should attempt to pick the most
divergent source-destination pair from the original
source-destination pair to which the packet was transmitted.
Note: Rules for picking the most divergent source-destination
pair are an implementation decision and are not specified
within this document.
So, we should first reconsider to take the current active
retransmission transport if we cannot find an alternative
active one. If all of that fails, we can still round robin
through unkown, partial failover, and inactive ones in the
hope to find something still suitable.
Commit 4141ddc02a92 ("sctp: retran_path update bug fix") broke
that behaviour by selecting the next inactive transport when
no other active transport was found besides the current assoc's
peer.retran_path. Before commit 4141ddc02a92, we would have
traversed through the list until we reach our peer.retran_path
again, and in case that is still in state SCTP_ACTIVE, we would
take it and return. Only if that is not the case either, we
take the next inactive transport.
Besides all that, another issue is that transports in state
SCTP_UNKNOWN could be preferred over transports in state
SCTP_ACTIVE in case a SCTP_ACTIVE transport appears after
SCTP_UNKNOWN in the transport list yielding a weaker transport
state to be used in retransmission.
This patch mostly reverts 4141ddc02a92, but also rewrites
this function to introduce more clarity and strictness into
the code. A strict priority of transport states is enforced
in this patch, hence selection is active > unkown > partial
failover > inactive.
Fixes: 4141ddc02a92 ("sctp: retran_path update bug fix")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Acked-by: Vlad Yasevich <yasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In current implementation it is possible to reach PF state from unconfirmed.
We can interpret sctp-failover-02 in a way that PF state is meant to be reached
only from active state, in the end, this is when entering PF state makes sense.
Here are few quotes from sctp-failover-02, but regardless of these, same
understanding can be reached from whole section 5:
Section 5.1, quickfailover guide:
"The PF state is an intermediate state between Active and Failed states."
"Each time the T3-rtx timer expires on an active or idle
destination, the error counter of that destination address will
be incremented. When the value in the error counter exceeds
PFMR, the endpoint should mark the destination transport address as PF."
There are several concrete reasons for such interpretation. For start, rfc4960
does not take into concern quickfailover algorithm. Therefore, quickfailover
must comply to 4960. Point where this compliance can be argued is following
behavior:
When PF is entered, association overall error counter is incremented for each
missed HB. This is contradictory to rfc4960, as address, while in unconfirmed
state, is subjected to probing, and while it is probed, it should not increment
association overall error counter. This has as a consequence that we might end
up in situation in which we drop association due path failure on unconfirmed
address, in case we have wrong configuration in a way:
Association.Max.Retrans == Path.Max.Retrans.
Another reason is that entering PF from unconfirmed will cause a loss of address
confirmed event when address is once (if) confirmed. This is fine from failover
guide point of view, but it is not consistent with behavior preceding failover
implementation and recommendation from 4960:
5.4. Path Verification
Whenever a path is confirmed, an indication MAY be given to the upper
layer.
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>