Commit Graph

6477 Commits

Author SHA1 Message Date
Vlad Yasevich
555d3d5d2b SCTP: Fix chunk acceptance when no authenticated chunks were listed.
In the case where no autheticated chunks were specified, we were still
trying to verify that a given chunk needs authentication and doing so
incorrectly.  Add a check for parameter length to make sure we don't
try to use an empty auth_chunks parameter to verify against.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-29 10:17:42 -05:00
Vlad Yasevich
8ee4be37e8 SCTP: Fix the supported extensions paramter
Supported extensions parameter was not coded right and ended up
over-writing memory or causing skb overflows.  First, remove
the FWD_TSN support from as it shouldn't be there and also fix
the paramter encoding.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-29 10:17:41 -05:00
Vlad Yasevich
9baffaa689 SCTP: Fix SCTP-AUTH to correctly add HMACS paramter.
There was a typo that cleared the HMACS parameters when no
authenticated chunks were specified.  We whould be clearing
the chunks pointer instead of the hmacs.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-29 10:17:41 -05:00
Vlad Yasevich
fd10279bc7 SCTP: Fix the number of HB transmissions.
Our treatment of Heartbeats is special in that the inital HB chunk
counts against the error count for the association, where as for
other chunks, only retransmissions or timeouts count against us.
As a result, we had an off-by-1 situation with a number of
Heartbeats we could send.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-29 10:17:41 -05:00
Stephen Hemminger
a357dde9df [TCP] illinois: Incorrect beta usage
Lachlan Andrew observed that my TCP-Illinois implementation uses the
beta value incorrectly:
  The parameter  beta  in the paper specifies the amount to decrease
  *by*:  that is, on loss,
     W <-  W -  beta*W
  but in   tcp_illinois_ssthresh() uses  beta  as the amount
  to decrease  *to*: W <- beta*W

This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
hurting performance. Note: since the base beta value is .5, it has no
impact on a congested network.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-30 01:10:55 +11:00
Herbert Xu
5e5234ff17 [IPSEC]: Fix uninitialised dst warning in __xfrm_lookup
Andrew Morton reported that __xfrm_lookup generates this warning:

net/xfrm/xfrm_policy.c: In function '__xfrm_lookup':
net/xfrm/xfrm_policy.c:1449: warning: 'dst' may be used uninitialized in this function

This is because if policy->action is of an unexpected value then dst will
not be initialised.  Of course, in practice this should never happen since
the input layer xfrm_user/af_key will filter out all illegal values.  But
the compiler doesn't know that of course.

So this patch fixes this by taking the conservative approach and treat all
unknown actions the same as a blocking action.

Thanks to Andrew for finding this and providing an initial fix.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-30 00:50:31 +11:00
Pavel Emelyanov
076931989f [INET]: Fix inet_diag register vs rcv race
The following race is possible when one cpu unregisters the handler
while other one is trying to receive a message and call this one:

CPU1:                                                 CPU2:
inet_diag_rcv()                                       inet_diag_unregister()
  mutex_lock(&inet_diag_mutex);
  netlink_rcv_skb(skb, &inet_diag_rcv_msg);
    if (inet_diag_table[nlh->nlmsg_type] == 
                               NULL) /* false handler is still registered */
    ...
    netlink_dump_start(idiagnl, skb, nlh,
                           inet_diag_dump, NULL);
           cb = kzalloc(sizeof(*cb), GFP_KERNEL);
                   /* sleep here freeing memory 
                    * or preempt
                    * or sleep later on nlk->cb_mutex
                    */
                                                         spin_lock(&inet_diag_register_lock);
                                                         inet_diag_table[type] = NULL;
    ...                                                  spin_unlock(&inet_diag_register_lock);
                                                         synchronize_rcu();
                                                         /* CPU1 is sleeping - RCU quiescent
                                                          * state is passed
                                                          */
                                                         return;
    /* inet_diag_dump is finally called: */
    inet_diag_dump()
      handler = inet_diag_table[cb->nlh->nlmsg_type];
      BUG_ON(handler == NULL); 
      /* OOPS! While we slept the unregister has set
       * handler to NULL :(
       */

Grep showed, that the register/unregister functions are called
from init/fini module callbacks for tcp_/dccp_diag, so it's OK
to use the inet_diag_mutex to synchronize manipulations with the
inet_diag_table and the access to it.

Besides, as Herbert pointed out, asynchronous dumps should hold 
this mutex as well, and thus, we provide the mutex as cb_mutex one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-30 00:08:14 +11:00
Pavel Emelyanov
82de382ce8 [BRIDGE]: Properly dereference the br_should_route_hook
This hook is protected with the RCU, so simple

	if (br_should_route_hook)
		br_should_route_hook(...)

is not enough on some architectures.

Use the rcu_dereference/rcu_assign_pointer in this case.

Fixed Stephen's comment concerning using the typeof().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-29 23:58:58 +11:00
Pavel Emelyanov
17efdd4575 [BRIDGE]: Lost call to br_fdb_fini() in br_init() error path
In case the br_netfilter_init() (or any subsequent call) 
fails, the br_fdb_fini() must be called to free the allocated
in br_fdb_init() br_fdb_cache kmem cache.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-29 23:41:43 +11:00
Florian Zumbiehl
0a11225887 [UNIX]: EOF on non-blocking SOCK_SEQPACKET
I am not absolutely sure whether this actually is a bug (as in: I've got
no clue what the standards say or what other implementations do), but at
least I was pretty surprised when I noticed that a recv() on a
non-blocking unix domain socket of type SOCK_SEQPACKET (which is connection
oriented, after all) where the remote end has closed the connection
returned -1 (EAGAIN) rather than 0 to indicate end of file.

This is a test case:

| #include <sys/types.h>
| #include <unistd.h>
| #include <sys/socket.h>
| #include <sys/un.h>
| #include <fcntl.h>
| #include <string.h>
| #include <stdlib.h>
| 
| int main(){
| 	int sock;
| 	struct sockaddr_un addr;
| 	char buf[4096];
| 	int pfds[2];
| 
| 	pipe(pfds);
| 	sock=socket(PF_UNIX,SOCK_SEQPACKET,0);
| 	addr.sun_family=AF_UNIX;
| 	strcpy(addr.sun_path,"/tmp/foobar_testsock");
| 	bind(sock,(struct sockaddr *)&addr,sizeof(addr));
| 	listen(sock,1);
| 	if(fork()){
| 		close(sock);
| 		sock=socket(PF_UNIX,SOCK_SEQPACKET,0);
| 		connect(sock,(struct sockaddr *)&addr,sizeof(addr));
| 		fcntl(sock,F_SETFL,fcntl(sock,F_GETFL)|O_NONBLOCK);
| 		close(pfds[1]);
| 		read(pfds[0],buf,sizeof(buf));
| 		recv(sock,buf,sizeof(buf),0); // <-- this one
| 	}else accept(sock,NULL,NULL);
| 	exit(0);
| }

If you try it, make sure /tmp/foobar_testsock doesn't exist.

The marked recv() returns -1 (EAGAIN) on 2.6.23.9. Below you find a
patch that fixes that.

Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-29 23:19:23 +11:00
Joonwoo Park
6ab3b487db [VLAN]: Fix nested VLAN transmit bug
Fix misbehavior of vlan_dev_hard_start_xmit() for recursive encapsulations.

Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-29 22:16:41 +11:00
Linus Torvalds
8c27eba549 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6: (41 commits)
  [XFRM]: Fix leak of expired xfrm_states
  [ATM]: [he] initialize lock and tasklet earlier
  [IPV4]: Remove bogus ifdef mess in arp_process
  [SKBUFF]: Free old skb properly in skb_morph
  [IPV4]: Fix memory leak in inet_hashtables.h when NUMA is on
  [IPSEC]: Temporarily remove locks around copying of non-atomic fields
  [TCP] MTUprobe: Cleanup send queue check (no need to loop)
  [TCP]: MTUprobe: receiver window & data available checks fixed
  [MAINTAINERS]: tlan list is subscribers-only
  [SUNRPC]: Remove SPIN_LOCK_UNLOCKED
  [SUNRPC]: Make xprtsock.c:xs_setup_{udp,tcp}() static
  [PFKEY]: Sending an SADB_GET responds with an SADB_GET
  [IRDA]: Compilation for CONFIG_INET=n case
  [IPVS]: Fix compiler warning about unused register_ip_vs_protocol
  [ARP]: Fix arp reply when sender ip 0
  [IPV6] TCPMD5: Fix deleting key operation.
  [IPV6] TCPMD5: Check return value of tcp_alloc_md5sig_pool().
  [IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps.
  [IPV4] TCPMD5: Omit redundant NULL check for kfree() argument.
  ieee80211: Stop net_ratelimit/IEEE80211_DEBUG_DROP log pollution
  ...
2007-11-26 20:09:07 -08:00
Linus Torvalds
423eaf8f00 Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6:
  NFS: Clean up new multi-segment direct I/O changes
  NFS: Ensure we return zero if applications attempt to write zero bytes
  NFS: Support multiple segment iovecs in the NFS direct I/O path
  NFS: Introduce iovec I/O helpers to fs/nfs/direct.c
  SUNRPC: Add missing "space" to net/sunrpc/auth_gss.c
  SUNRPC: make sunrpc/xprtsock.c:xs_setup_{udp,tcp}() static
  NFS: fs/nfs/dir.c should #include "internal.h"
  NFS: make nfs_wb_page_priority() static
  NFS: mount failure causes bad page state
  SUNRPC: remove NFS/RDMA client's binary sysctls
  kernel BUG at fs/nfs/namespace.c:108! - can be triggered by bad server
  sunrpc: rpc_pipe_poll may miss available data in some cases
  sunrpc: return error if unsupported enctype or cksumtype is encountered
  sunrpc: gss_pipe_downcall(), don't assume all errors are transient
  NFS: Fix the ustat() regression
2007-11-26 19:42:59 -08:00
Patrick McHardy
5dba479711 [XFRM]: Fix leak of expired xfrm_states
The xfrm_timer calls __xfrm_state_delete, which drops the final reference
manually without triggering destruction of the state. Change it to use
xfrm_state_put to add the state to the gc list when we're dropping the
last reference. The timer function may still continue to use the state
safely since the final destruction does a del_timer_sync().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-27 11:10:07 +08:00
Joe Perches
014313a9d6 SUNRPC: Add missing "space" to net/sunrpc/auth_gss.c
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-11-26 16:24:59 -05:00
Adrian Bunk
483066d62e SUNRPC: make sunrpc/xprtsock.c:xs_setup_{udp,tcp}() static
xs_setup_{udp,tcp}() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-11-26 16:24:50 -05:00
James Lentini
cfcb43ff7c SUNRPC: remove NFS/RDMA client's binary sysctls
Support for binary sysctls is being deprecated in 2.6.24. Since there
are no applications using the NFS/RDMA client's binary sysctls, it
makes sense to remove them. The patch below does this while leaving
the /proc/sys interface unchanged.

Please consider this for 2.6.24.

Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-11-26 16:21:19 -05:00
Adrian Bunk
3660019e5f [IPV4]: Remove bogus ifdef mess in arp_process
The #ifdef's in arp_process() were not only a mess, they were also wrong 
in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or 
CONFIG_NETDEV_10000=y) cases.

Since they are not required this patch removes them.

Also removed are some #ifdef's around #include's that caused compile 
errors after this change.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-26 23:17:53 +08:00
Herbert Xu
2d4baff8da [SKBUFF]: Free old skb properly in skb_morph
The skb_morph function only freed the data part of the dst skb, but leaked
the auxiliary data such as the netfilter fields.  This patch fixes this by
moving the relevant parts from __kfree_skb to skb_release_all and calling
it in skb_morph.

It also makes kfree_skbmem static since it's no longer called anywhere else
and it now no longer does skb_release_data.

Thanks to Yasuyuki KOZAKAI for finding this problem and posting a patch for
it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-26 23:11:19 +08:00
Herbert Xu
8053fc3de7 [IPSEC]: Temporarily remove locks around copying of non-atomic fields
The change 050f009e16

	[IPSEC]: Lock state when copying non-atomic fields to user-space

caused a regression.

Ingo Molnar reports that it causes a potential dead-lock found by the
lock validator as it tries to take x->lock within xfrm_state_lock while
numerous other sites take the locks in opposite order.

For 2.6.24, the best fix is to simply remove the added locks as that puts
us back in the same state as we've been in for years.  For later kernels
a proper fix would be to reverse the locking order for every xfrm state
user such that if x->lock is taken together with xfrm_state_lock then
it is to be taken within it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-26 19:07:34 +08:00
Ilpo Järvinen
7f9c33e515 [TCP] MTUprobe: Cleanup send queue check (no need to loop)
The original code has striking complexity to perform a query
which can be reduced to a very simple compare.

FIN seqno may be included to write_seq but it should not make
any significant difference here compared to skb->len which was
used previously. One won't end up there with SYN still queued.

Use of write_seq check guarantees that there's a valid skb in
send_head so I removed the extra check.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: John Heffner <jheffner@psc.edu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-23 19:10:56 +08:00
Ilpo Järvinen
91cc17c0e5 [TCP]: MTUprobe: receiver window & data available checks fixed
It seems that the checked range for receiver window check should
begin from the first rather than from the last skb that is going
to be included to the probe. And that can be achieved without
reference to skbs at all, snd_nxt and write_seq provides the
correct seqno already. Plus, it SHOULD account packets that are
necessary to trigger fast retransmit [RFC4821].

Location of snd_wnd < probe_size/size_needed check is bogus
because it will cause the other if() match as well (due to
snd_nxt >= snd_una invariant).

Removed dead obvious comment.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-23 19:08:16 +08:00
Jiri Slaby
5ba03e82b3 [SUNRPC]: Remove SPIN_LOCK_UNLOCKED
SPIN_LOCK_UNLOCKED is deprecated, use DEFINE_SPINLOCK instead

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-22 19:40:22 +08:00
Adrian Bunk
5fe4a33430 [SUNRPC]: Make xprtsock.c:xs_setup_{udp,tcp}() static
xs_setup_{udp,tcp}() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-22 19:38:25 +08:00
Charles Hardin
435000bebd [PFKEY]: Sending an SADB_GET responds with an SADB_GET
From: Charles Hardin <chardin@2wire.com>

Kernel needs to respond to an SADB_GET with the same message type to
conform to the RFC 2367 Section 3.1.5

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-22 19:35:15 +08:00
Pavel Emelyanov
8c92e6b0bf [IRDA]: Compilation for CONFIG_INET=n case
Found this occasionally. 

The CONFIG_INET=n is hardly ever set, but if it is the 
irlan_eth_send_gratuitous_arp() compilation should produce a 
warning about unused variable in_dev.

Too pedantic? :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-22 19:15:56 +08:00
Pavel Emelyanov
d535a916cd [IPVS]: Fix compiler warning about unused register_ip_vs_protocol
This is silly, but I have turned the CONFIG_IP_VS to m,
to check the compilation of one (recently sent) fix
and set all the CONFIG_IP_VS_PROTO_XXX options to n to
speed up the compilation.

In this configuration the compiler warns me about

  CC [M]  net/ipv4/ipvs/ip_vs_proto.o
net/ipv4/ipvs/ip_vs_proto.c:49: warning: 'register_ip_vs_protocol' defined but not used

Indeed. With no protocols selected there are no
calls to this function - all are compiled out with
ifdefs.

Maybe the best fix would be to surround this call with
ifdef-s or tune the Kconfig dependences, but I think that
marking this register function as __used is enough. No?

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:44:01 -08:00
Jonas Danielsson
b4a9811c42 [ARP]: Fix arp reply when sender ip 0
Fix arp reply when received arp probe with sender ip 0.

Send arp reply with target ip address 0.0.0.0 and target hardware
address set to hardware address of requester. Previously sent reply
with target ip address and target hardware address set to same as
source fields.

Signed-off-by: Jonas Danielsson <the.sator@gmail.com>
Acked-by: Alexey Kuznetov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:38:16 -08:00
YOSHIFUJI Hideaki
77adefdc98 [IPV6] TCPMD5: Fix deleting key operation.
Due to the bug, refcnt for md5sig pool was leaked when
an user try to delete a key if we have more than one key.
In addition to the leakage, we returned incorrect return
result value for userspace.

This fix should close Bug #9418, reported by <ming-baini@163.com>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:31:23 -08:00
YOSHIFUJI Hideaki
aacbe8c880 [IPV6] TCPMD5: Check return value of tcp_alloc_md5sig_pool().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:56 -08:00
YOSHIFUJI Hideaki
354faf0977 [IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:31 -08:00
YOSHIFUJI Hideaki
a80cc20da4 [IPV4] TCPMD5: Omit redundant NULL check for kfree() argument.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 17:30:06 -08:00
David S. Miller
53438e5d04 Merge branch 'fixes-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 2007-11-20 17:24:29 -08:00
Guillaume Chazarain
92468c53cf ieee80211: Stop net_ratelimit/IEEE80211_DEBUG_DROP log pollution
if (net_ratelimit())
	IEEE80211_DEBUG_DROP(...)

can pollute the logs with messages like:

printk: 1 messages suppressed.
printk: 2 messages suppressed.
printk: 7 messages suppressed.

if debugging information is disabled. These messages are printed by
net_ratelimit(). Add a wrapper to net_ratelimit() that takes into account
the log level, so that net_ratelimit() is called only when we really want
to print something.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-20 16:43:17 -05:00
Bruno Randolf
4b50e388f8 mac80211: add missing space in error message
Signed-off-by: Bruno Randolf <bruno@thinktube.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-20 16:43:17 -05:00
Johannes Berg
c1428b3f45 mac80211: fix allmulti/promisc behaviour
When an interface with promisc/allmulti bit is taken down,
the mac80211 state can become confused. This fixes it by
making mac80211 keep track of all *active* interfaces that
have the promisc/allmulti bit set in the sdata, we sync
the interface bit into sdata at set_multicast_list() time
so this works.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-20 16:20:31 -05:00
Johannes Berg
b52f2198ac mac80211: fix ieee80211_set_multicast_list
I recently experienced unexplainable behaviour with the b43
driver when I had broken firmware uploaded. The cause may have
been that promisc mode was not correctly enabled or disabled
and this bug may have been the cause.

Note how the values are compared later in the function so
just doing the & will result in the wrong thing being
compared and the test being false almost always.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-20 16:20:30 -05:00
Evgeniy Polyakov
1f305323ff [NETFILTER]: Fix kernel panic with REDIRECT target.
When connection tracking entry (nf_conn) is about to copy itself it can
have some of its extension users (like nat) as being already freed and
thus not required to be copied.

Actually looking at this function I suspect it was copied from
nf_nat_setup_info() and thus bug was introduced.

Report and testing from David <david@unsolicited.net>.

[ Patrick McHardy states:

	I now understand whats happening:

	- new connection is allocated without helper
	- connection is REDIRECTed to localhost
	- nf_nat_setup_info adds NAT extension, but doesn't initialize it yet
	- nf_conntrack_alter_reply performs a helper lookup based on the
	   new tuple, finds the SIP helper and allocates a helper extension,
	   causing reallocation because of too little space
	- nf_nat_move_storage is called with the uninitialized nat extension

	So your fix is entirely correct, thanks a lot :)  ]

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 04:27:35 -08:00
David S. Miller
0a06ea8718 [WIRELESS] WEXT: Fix userspace corruption on 64-bit.
On 64-bit systems sizeof(struct ifreq) is 8 bytes larger than
sizeof(struct iwreq).

For GET calls, the wireless extension code copies back into userspace
using sizeof(struct ifreq) but userspace and elsewhere only allocates
a "struct iwreq".  Thus, this copy writes past the end of the iwreq
object and corrupts whatever sits after it in memory.

Fix the copy_to_user() length.

This particularly hurts the compat case because the wireless compat
code uses compat_alloc_userspace() and right after this allocated
buffer is the current bottom of the user stack, and that's what gets
overwritten by the copy_to_user() call.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-20 03:29:53 -08:00
Christoph Lameter
70cf5035de [S390] Explicitly code allocpercpu calls in iucv
The iucv is the only user of the various functions that are used to bring
parts of cpus up and down. Its the only allocpercpu user that will do
I/O on per cpu objects (which is difficult to do with virtually mapped memory).
And its the only use of allocpercpu where a GFP_DMA allocation is done.

Remove the allocpercpu calls from iucv and code the allocation and freeing
manually. After this patch it is possible to remove a large part of
the allocpercpu API.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2007-11-20 11:13:47 +01:00
Joe Perches
a572da4373 [IRDA]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:48:30 -08:00
Joe Perches
c3c9d45828 [SUNRPC]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:48:08 -08:00
Joe Perches
9ee46f1d31 [SCTP]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:47:47 -08:00
Joe Perches
3b6d821c4f [IPV6]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:47:25 -08:00
Joe Perches
3a47a68b05 [BRIDGE]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:46:55 -08:00
Joe Perches
464c4f184a [IPV4]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:46:29 -08:00
Joe Perches
4756daa3b6 [DCCP]: Add missing "space"
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:46:02 -08:00
Sam Jansen
5487796f0c [TCP]: Problem bug with sysctl_tcp_congestion_control function
From: "Sam Jansen" <sjansen@google.com>

sysctl_tcp_congestion_control seems to have a bug that prevents it
from actually calling the tcp_set_default_congestion_control
function. This is not so apparent because it does not return an error
and generally the /proc interface is used to configure the default TCP
congestion control algorithm.  This is present in 2.6.18 onwards and
probably earlier, though I have not inspected 2.6.15--2.6.17.

sysctl_tcp_congestion_control calls sysctl_string and expects a successful
return code of 0. In such a case it actually sets the congestion control
algorithm with tcp_set_default_congestion_control. Otherwise, it returns the
value returned by sysctl_string. This was correct in 2.6.14, as sysctl_string
returned 0 on success. However, sysctl_string was updated to return 1 on
success around about 2.6.15 and sysctl_tcp_congestion_control was not updated.
Even though sysctl_tcp_congestion_control returns 1, do_sysctl_strategy
converts this return code to '0', so the caller never notices the error.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:28:21 -08:00
Ilpo Järvinen
6e42141009 [TCP] MTUprobe: fix potential sk_send_head corruption
When the abstraction functions got added, conversion here was
made incorrectly. As a result, the skb may end up pointing
to skb which got included to the probe skb and then was freed.
For it to trigger, however, skb_transmit must fail sending as
well.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 23:24:09 -08:00
Pavel Emelyanov
1f8170b0ec [PKTGEN]: Fix double unlock of xfrm_state->lock
The pktgen_output_ipsec() function can unlock this lock twice
due to merged error and plain paths. Remove one of the calls
to spin_unlock.

Other possible solution would be to place "return 0" right 
after the first unlock, but at this place the err is known 
to be 0, so these solutions are the same except for this one
makes the code shorter.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 22:51:24 -08:00
Simon Horman
9055fa1f3d [IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED
Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED,
I stronly doubt that anyone is using the sys_sysctl interface to
these variables.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:51:13 -08:00
Simon Horman
9e103fa6bd [IPVS]: Fix sysctl warnings about missing strategy in schedulers
sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy

Switch these entried over to use CTL_UNNUMBERED as clearly
the sys_syscal portion wasn't working.

This is along the same lines as Christian Borntraeger's patch that fixes
up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:50:21 -08:00
Christian Borntraeger
611cd55b15 [IPVS]: Fix sysctl warnings about missing strategy
Running the latest git code I get the following messages during boot:
sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy
[...]		  
sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy

I removed the binary sysctl handler for those messages and also removed
the definitions in ip_vs.h. The alternative would be to implement a 
proper strategy handler, but syscall sysctl is deprecated.

There are other sysctl definitions that are commented out or work with 
the default sysctl_data strategy. I did not touch these. 

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 21:49:25 -08:00
Eric Dumazet
483b23ffa3 [NET]: Corrects a bug in ip_rt_acct_read()
It seems that stats of cpu 0 are counted twice, since
for_each_possible_cpu() is looping on all possible cpus, including 0

Before percpu conversion of ip_rt_acct, we should also remove the
assumption that CPU 0 is online (or even possible)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-18 18:47:38 -08:00
J. Bruce Fields
eda4f9b799 sunrpc: rpc_pipe_poll may miss available data in some cases
Pipe messages start out life on a queue on the inode, but when first
read they're moved to the filp's private pointer.  So it's possible for
a poll here to return null even though there's a partially read message
available.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-11-17 13:08:47 -05:00
Kevin Coffman
ef338bee3f sunrpc: return error if unsupported enctype or cksumtype is encountered
Return an error from gss_import_sec_context_kerberos if the
negotiated context contains encryption or checksum types not
supported by the kernel code.

This fixes an Oops because success was assumed and later code found
no internal_ctx_id.

Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-11-17 13:08:46 -05:00
Kevin Coffman
ffc40f5692 sunrpc: gss_pipe_downcall(), don't assume all errors are transient
Instead of mapping all errors except EACCES to EAGAIN, map all errors
except EAGAIN to EACCES.

An example is user-land negotiating a Kerberos context with an encryption
type that is not supported by the kernel code.  (This can happen due to
mis-configuration or a bug in the Kerberos code that does not honor our
request to limit the encryption types negotiated.)  This failure is not
transient, and returning EAGAIN causes mount to continuously retry rather
than giving up.

Signed-off-by: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2007-11-17 13:08:45 -05:00
Evgeniy Polyakov
7799652557 [NETFILTER]: Fix NULL pointer dereference in nf_nat_move_storage()
Reported by Chuck Ebbert as:

	https://bugzilla.redhat.com/show_bug.cgi?id=259501#c14

This routine is called each time hash should be replaced, nf_conn has
extension list which contains pointers to connection tracking users
(like nat, which is right now the only such user), so when replace takes
place it should copy own extensions. Loop above checks for own
extension, but tries to move higer-layer one, which can lead to above
oops.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-15 15:52:32 -08:00
Patrick McHardy
6452a5fde0 [NETFILTER]: fix compat_nf_sockopt typo
It should pass opt to the ->get/->set functions, not ops.

Tested-by: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-15 14:29:21 -08:00
Pavel Emelyanov
dab6ba3688 [INET]: Fix potential kfree on vmalloc-ed area of request_sock_queue
The request_sock_queue's listen_opt is either vmalloc-ed or
kmalloc-ed depending on the number of table entries. Thus it 
is expected to be handled properly on free, which is done in 
the reqsk_queue_destroy().

However the error path in inet_csk_listen_start() calls 
the lite version of reqsk_queue_destroy, called 
__reqsk_queue_destroy, which calls the kfree unconditionally. 

Fix this and move the __reqsk_queue_destroy into a .c file as 
it looks too big to be inline.

As David also noticed, this is an error recovery path only,
so no locking is required and the lopt is known to be not NULL.

reqsk_queue_yank_listen_sk is also now only used in
net/core/request_sock.c so we should move it there too.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-15 02:57:06 -08:00
David S. Miller
d06fc1d9b5 Merge branch 'fixes-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 2007-11-14 19:44:02 -08:00
Linus Torvalds
6f37ac793d Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NET]: rt_check_expire() can take a long time, add a cond_resched()
  [ISDN] sc: Really, really fix warning
  [ISDN] sc: Fix sndpkt to have the correct number of arguments
  [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
  [NET]: Remove notifier block from chain when register_netdevice_notifier fails
  [FS_ENET]: Fix module build.
  [TCP]: Make sure write_queue_from does not begin with NULL ptr
  [TCP]: Fix size calculation in sk_stream_alloc_pskb
  [S2IO]: Fixed memory leak when MSI-X vector allocation fails
  [BONDING]: Fix resource use after free
  [SYSCTL]: Fix warning for token-ring from sysctl checker
  [NET] random : secure_tcp_sequence_number should not assume CONFIG_KTIME_SCALAR
  [IWLWIFI]: Not correctly dealing with hotunplug.
  [TCP] FRTO: Plug potential LOST-bit leak
  [TCP] FRTO: Limit snd_cwnd if TCP was application limited
  [E1000]: Fix schedule while atomic when called from mii-tool.
  [NETX]: Fix build failure added by 2.6.24 statistics cleanup.
  [EP93xx_ETH]: Build fix after 2.6.24 NAPI changes.
  [PKT_SCHED]: Check subqueue status before calling hard_start_xmit
2007-11-14 18:51:48 -08:00
Adrian Bunk
d5cd97872d sunrpc/xprtrdma/transport.c: fix use-after-free
Fix an obvious use-after-free spotted by the Coverity checker.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:41 -08:00
Helmut Schaa
a0af5f1454 mac80211: Fix queuing of scan containing a SSID
This patch fixes scanning for specific ssid's which is broken due to the
scan being queued up without respecting the ssid to scan for.

Signed-off-by: Helmut Schaa <hschaa@suse.de>
Signed-off-by: Jiri Benc <jbenc@suse.cz>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-14 21:16:46 -05:00
Eric Dumazet
d90bf5a976 [NET]: rt_check_expire() can take a long time, add a cond_resched()
On commit 39c90ece75:

	[IPV4]: Convert rt_check_expire() from softirq processing to workqueue.

we converted rt_check_expire() from softirq to workqueue, allowing the
function to perform all work it was supposed to do.

When the IP route cache is big, rt_check_expire() can take a long time
to run.  (default settings : 20% of the hash table is scanned at each
invocation)

Adding cond_resched() helps giving cpu to higher priority tasks if
necessary.

Using a "if (need_resched())" test before calling "cond_resched();" is
necessary to avoid spending too much time doing the resched check.
(My tests gave a time reduction from 88 ms to 25 ms per
rt_check_expire() run on my i686 test machine)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 16:14:05 -08:00
Ilpo Järvinen
e1cd8f78f8 [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
I broke this in commit 3de96471bd:

	[TCP]: Wrap-safed reordering detection FRTO check

tcp_process_frto should always see a valid frto_highmark. An invalid
frto_highmark (zero) is very likely what ultimately caused a seqno
compare in tcp_frto_enter_loss to do the wrong leading to the LOST-bit
leak.

Having LOST-bits integry ensured like done after commit
23aeeec365:

	[TCP] FRTO: Plug potential LOST-bit leak

won't hurt. It may still be useful in some other, possibly legimate,
scenario.

Reported by Chazarain Guillaume <guichaz@yahoo.fr>.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 15:55:09 -08:00
Pavel Emelyanov
c67625a1ec [NET]: Remove notifier block from chain when register_netdevice_notifier fails
Commit fcc5a03ac4:

	[NET]: Allow netdev REGISTER/CHANGENAME events to fail

makes the register_netdevice_notifier() handle the error from the
NETDEV_REGISTER event, sent to the registering block.

The bad news is that in this case the notifier block is 
not removed from the list, but the error is returned to the 
caller. In case the caller is in module init function and 
handles this error this can abort the module loading. The
notifier block will be then removed from the kernel, but 
will be left in the list. Oops :(

I think that the notifier block should be removed from the
chain in case of error, regardless whether this error is 
handled by the caller or not. In the worst case (the error 
is _not_ handled) module will not receive the events any 
longer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 15:53:16 -08:00
Ilpo Järvinen
96a2d41a3e [TCP]: Make sure write_queue_from does not begin with NULL ptr
NULL ptr can be returned from tcp_write_queue_head to cached_skb
and then assigned to skb if packets_out was zero. Without this,
system is vulnerable to a carefully crafted ACKs which obviously
is remotely triggerable.

Besides, there's very little that needs to be done in sacktag
if there weren't any packets outstanding, just skipping the rest
doesn't hurt.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 15:47:18 -08:00
Ilpo Järvinen
23aeeec365 [TCP] FRTO: Plug potential LOST-bit leak
It might be possible that, in some extreme scenario that
I just cannot now construct in my mind, end_seq <=
frto_highmark check does not match causing the lost_out
and LOST bits become out-of-sync due to clearing and
recounting in the loop.

This may fix LOST-bit leak reported by Chazarain Guillaume
<guichaz@yahoo.fr>.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:03:13 -08:00
Ilpo Järvinen
746aa32d28 [TCP] FRTO: Limit snd_cwnd if TCP was application limited
Otherwise TCP might violate packet ordering principles that FRTO
is based on. If conventional recovery path is chosen, this won't
be significant at all. In practice, any small enough value will
be sufficient to provide proper operation for FRTO, yet other
users of snd_cwnd might benefit from a "close enough" value.

FRTO's formula is now equal to what tcp_enter_cwr() uses.

FRTO used to check application limitedness a bit differently but
I changed that in commit 575ee7140d
and as a result checking for application limitedness became
completely non-existing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 21:01:23 -08:00
Peter P Waskiewicz Jr
5f1a485d59 [PKT_SCHED]: Check subqueue status before calling hard_start_xmit
The only qdiscs that check subqueue state before dequeue'ing are PRIO
and RR.  The other qdiscs, including the default pfifo_fast qdisc,
will allow traffic bound for subqueue 0 through to hard_start_xmit.
The check for netif_queue_stopped() is done above in pkt_sched.h, so
it is unnecessary for qdisc_restart().  However, if the underlying
driver is multiqueue capable, and only sets queue states on subqueues,
this will allow packets to enter the driver when it's currently unable
to process packets, resulting in expensive requeues and driver
entries.  This patch re-adds the check for the subqueue status before
calling hard_start_xmit, so we can try and avoid the driver entry when
the queues are stopped.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 20:40:55 -08:00
Eric Dumazet
53756524e4 [NETFILTER]: xt_time should not assume CONFIG_KTIME_SCALAR
It is not correct to assume one can get nsec from a ktime directly by
using .tv64 field.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:49:53 -08:00
Denis V. Lunev
022cbae611 [NET]: Move unneeded data to initdata section.
This patch reverts Eric's commit 2b008b0a8e

It diets .text & .data section of the kernel if CONFIG_NET_NS is not set.
This is safe after list operations cleanup.

Signed-of-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:23:50 -08:00
Denis V. Lunev
ed160e839d [NET]: Cleanup pernet operation without CONFIG_NET_NS
If CONFIG_NET_NS is not set, the only namespace is possible.

This patch removes list of pernet_operations and cleanups code a bit.
This list is not needed if there are no namespaces. We should just call
->init method.

Additionally, the ->exit will be called on module unloading only. This
case is safe - the code is not discarded. For the in/kernel code, ->exit
should never be called.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:23:21 -08:00
Patrick McHardy
81d9ddae85 [NETFILTER]: bridge: fix double POSTROUTING hook invocation
Packets routed between bridges have the POST_ROUTING hook invoked
twice since bridging mistakes them for bridged packets because
they have skb->nf_bridge set.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 02:58:44 -08:00
Pavel Emelyanov
4ce5ba6aec [NETFILTER]: Consolidate nf_sockopt and compat_nf_sockopt
Both lookup the nf_sockopt_ops object to call the get/set callbacks
from, but they perform it in a completely similar way.

Introduce the helper for finding the ops.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 02:58:09 -08:00
Li Zefan
e0bf9cf15f [NETFILTER]: nf_nat: fix memset error
The size passing to memset is the size of a pointer.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 02:57:16 -08:00
Pavel Emelyanov
d71209ded2 [INET]: Use list_head-s in inetpeer.c
The inetpeer.c tracks the LRU list of inet_perr-s, but makes
it by hands. Use the list_head-s for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:27:28 -08:00
Adrian Bunk
22649d1afb [IPVS]: Remove unused exports.
This patch removes the following unused EXPORT_SYMBOL's:
- ip_vs_try_bind_dest
- ip_vs_find_dest

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:25:24 -08:00
Adrian Bunk
6aed42159d [NET]: Unexport sysctl_{r,w}mem_max.
sysctl_{r,w}mem_max can now be unexported.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:24:14 -08:00
Urs Thuermann
be85d4ad8a [AF_PACKET]: Fix minor code duplication
Simplify some code by eliminating duplicate if-else clauses in
packet_do_bind().

Signed-off-by: Urs Thuermann <urs@isnogud.escape.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:05:20 -08:00
David S. Miller
bce943278d Merge branch 'pending' of master.kernel.org:/pub/scm/linux/kernel/git/vxy/lksctp-dev 2007-11-12 18:16:13 -08:00
Trond Myklebust
91cf45f02a [NET]: Add the helper kernel_sock_shutdown()
...and fix a couple of bugs in the NBD, CIFS and OCFS2 socket handlers.

Looking at the sock->op->shutdown() handlers, it looks as if all of them
take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
RCV_SHUTDOWN/SEND_SHUTDOWN arguments.
Add a helper, and then define the SHUT_* enum to ensure that kernel users
of shutdown() don't get confused.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Mark Fasheh <mark.fasheh@oracle.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 18:10:39 -08:00
Pierre Ynard
dbb2ed2485 [IPV6]: Add ifindex field to ND user option messages.
Userland neighbor discovery options are typically heavily involved with
the interface on which thay are received: add a missing ifindex field to
the original struct. Thanks to Rémi Denis-Courmont.

Signed-off-by: Pierre Ynard <linkfanel@yahoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 17:58:35 -08:00
Jesper Juhl
9abed245a6 Fix memory leak in discard case of sctp_sf_abort_violation()
In net/sctp/sm_statefuns.c::sctp_sf_abort_violation() we may leak
the storage allocated for 'abort' by returning from the function
without using or freeing it. This happens in case
"sctp_auth_recv_cid(SCTP_CID_ABORT, asoc)" is true and we jump to
the 'discard' label.
Spotted by the Coverity checker.

The simple fix is to simply move the creation of the "abort chunk"
to after the possible jump to the 'discard' label. This way we don't
even have to allocate the memory at all in the problem case.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-12 10:13:24 -05:00
Denis V. Lunev
2994c63863 [INET]: Small possible memory leak in FIB rules
This patch fixes a small memory leak. Default fib rules can be deleted by
the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
	ip rule flush

Such a rule will not be freed as the ref-counter has 2 on start and becomes
clearly unreachable after removal.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:12:03 -08:00
Alexey Dobriyan
33d36bb83c [NETNS]: init dev_base_lock only once
* it already statically initialized
* reinitializing live global spinlock every time netns is
  setup is also wrong

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:09:25 -08:00
Pavel Emelyanov
284b327be2 [UNIX]: The unix_nr_socks limit can be exceeded
The unix_nr_socks value is limited with the 2 * get_max_files() value,
as seen from the unix_create1(). However, the check and the actual
increment are separated with the GFP_KERNEL allocation, so this limit
can be exceeded under a memory pressure - task may go to sleep freeing
the pages and some other task will be allowed to allocate a new sock
and so on and so forth.

So make the increment before the check (similar thing is done in the
sock_kmalloc) and go to kmalloc after this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:08:30 -08:00
Pavel Emelyanov
5c80f1ae98 [AF_UNIX]: Convert socks to unix_socks in scan_inflight, not in callbacks
The scan_inflight() routine scans through the unix sockets and calls
some passed callback. The fact is that all these callbacks work with
the unix_sock objects, not the sock ones, so make this conversion in
the scan_inflight() before calling the callbacks.

This removes one unneeded variable from the inc_inflight_move_tail().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:07:13 -08:00
Pavel Emelyanov
9305cfa444 [AF_UNIX]: Make unix_tot_inflight counter non-atomic
This counter is _always_ modified under the unix_gc_lock spinlock, 
so its atomicity can be provided w/o additional efforts.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:06:01 -08:00
Peter P Waskiewicz Jr
8032b46489 [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded
The socket option for packet sockets to return the original ifindex instead
of the bonded ifindex will not match multicast traffic.  Since this socket
option is the most useful for layer 2 traffic and multicast traffic, make
the option multicast-aware.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:03:25 -08:00
Johannes Berg
d52a60ad38 mac80211: fix MAC80211_RCSIMPLE Kconfig
I meant for this to be selectable only with EMBEDDED, not enabled only
with EMBEDDED. This does it that way. Sorry.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:01:42 -08:00
John W. Linville
7f3ad8943e mac80211: make "decrypt failed" messages conditional upon MAC80211_DEBUG
Make "decrypt failed" and "have no key" debugging messages compile
conditionally upon CONFIG_MAC80211_DEBUG.  They have been useful for
finding certain problems in the past, but in many cases they just
clutter a user's logs.

A typical example is an enviornment where multiple SSIDs are using a
single BSSID but with different protection schemes or different keys
for each SSID.  In such an environment these messages are just noise.
Let's just leave them for those interested enough to turn-on debugging.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:01:34 -08:00
Johannes Berg
5b98b1f7da mac80211: use IW_AUTH_PRIVACY_INVOKED rather than IW_AUTH_KEY_MGMT
In the long bug-hunt for why dynamic WEP networks didn't work it
turned out that mac80211 incorrectly uses IW_AUTH_KEY_MGMT while
it should use IW_AUTH_PRIVACY_INVOKED to determine whether to
associate to protected networks or not.

This patch changes the behaviour to be that way and clarifies the
existing code.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:01:25 -08:00
Johannes Berg
56db6c52bb mac80211: remove unused driver ops
The driver operations set_ieee8021x(), set_port_auth() and
set_privacy_invoked() are not used by any drivers, except
set_privacy_invoked() they aren't even used by mac80211.
Remove them at least until we need to support drivers with
mac80211 that require getting this information.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Michael Wu <flamingice@sourmilk.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:01:15 -08:00
Johannes Berg
8636bf6513 mac80211: remove ieee80211_common.h
Robert pointed out that I missed this file when removing the management
interface. Do it now.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:01:04 -08:00
Michael Buesch
2736622344 rfkill: Fix sparse warning
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:00:28 -08:00
Michael Buesch
7319f1e6bc rfkill: Use mutex_lock() at register and add sanity check
Replace mutex_lock_interruptible() by mutex_lock() in rfkill_register(),
as interruptible doesn't make sense there.

Add a sanity check for rfkill->type, as that's used for an unchecked dereference
in an array and might cause hard to debug crashes if the driver sets this
to an invalid value.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 22:00:15 -08:00
Johannes Berg
830f903866 mac80211: allow driver to ask for a rate control algorithm
This allows a driver to ask for a specific rate control algorithm.
The rate control algorithm asked for must be registered and be
available as a module or built-in.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 21:59:54 -08:00
Johannes Berg
999acd9c33 mac80211: don't allow registering the same rate control twice
Previously, mac80211 would allow registering the same rate control
algorithm twice. This is a programming error in the registration
and should not happen; additionally the second version could never
be selected. Disallow this and warn about it.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 21:59:43 -08:00
Michael Buesch
2bf236d55e rfkill: Use subsys_initcall
We must use subsys_initcall, because we must initialize before a
driver calls rfkill_register().

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 21:59:33 -08:00
Johannes Berg
ac71c691e6 mac80211: make simple rate control algorithm built-in
Too frequently people do not have module autoloading enabled
or fail to install the rate control module correctly, hence
their hardware probing fails due to no rate control algorithm
being available. This makes the 'simple' algorithm built into
the mac80211 module unless EMBEDDED is enabled in which case
it can be disabled (eg. if the wanted driver requires another
rate control algorithm.)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 21:59:23 -08:00
Michael Buesch
8a8f1c0437 rfkill: Register LED triggers before registering switch
Registering the switch triggers a LED event, so we must register
LED triggers before the switch.
This has a potential to fix a crash, depending on how the device
driver initializes the rfkill data structure.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 21:59:11 -08:00
Johannes Berg
94e10bfb8a softmac: fix wext MLME request reason code endianness
The MLME request reason code is host-endian and our passing
it to the low level functions is host-endian as well since
they do the swapping. I noticed that the reason code 768 was
sent (0x300) rather than 3 when wpa_supplicant terminates.
This removes the superfluous cpu_to_le16() call.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-11-10 21:58:41 -08:00
Radu Rendec
b226801676 [PKT_SCHED] CLS_U32: Use ffs() instead of C code on hash mask to get first set bit.
Computing the rank of the first set bit in the hash mask (for using later
in u32_hash_fold()) was done with plain C code. Using ffs() instead makes
the code more readable and improves performance (since ffs() is better
optimized in assembler).

Using the conditional operator on hash mask before applying ntohl() also
saves one ntohl() call if mask is 0.

Signed-off-by: Radu Rendec <radu.rendec@ines.ro>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:54:50 -08:00
Patrick McHardy
39aaac114e [VLAN]: Allow setting mac address while device is up
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:52:35 -08:00
Patrick McHardy
d932e04a5e [VLAN]: Don't synchronize addresses while the vlan device is down
While the VLAN device is down, the unicast addresses are not configured
on the underlying device, so we shouldn't attempt to sync them.

Noticed by Dmitry Butskoy <buc@odusz.so-cdu.ru>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:51:40 -08:00
Pavel Emelyanov
358352b8b8 [INET]: Cleanup the xfrm4_tunnel_(un)register
Both check for the family to select an appropriate tunnel list.
Consolidate this check and make the for() loop more readable.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:48:54 -08:00
Pavel Emelyanov
99f933263a [INET]: Add missed tunnel64_err handler
The tunnel64_protocol uses the tunnel4_protocol's err_handler and
thus calls the tunnel4_protocol's handlers.

This is not very good, as in case of (icmp) error the wrong error
handlers will be called (e.g. ipip ones instead of sit) and this
won't be noticed at all, because the error is not reported.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:47:39 -08:00
Pavel Emelyanov
c2b42336f4 [IPX]: Use existing sock refcnt debugging infrastructure
Just like in the af_packet.c, the ipx_sock_nr variable is used
for debugging purposes.

Switch to using existing infrastructure. Thanks to Arnaldo for
pointing this out.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:39:26 -08:00
Pavel Emelyanov
17ab56a260 [PACKET]: Use existing sock refcnt debugging infrastructure
The packet_socks_nr variable is used purely for debugging
the number of sockets.

As Arnaldo pointed out, there's already an infrastructure
for this purposes, so switch to using it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:38:48 -08:00
Joe Perches
e9671fcb3b [NET]: Fix infinite loop in dev_mc_unsync().
From: Joe Perches <joe@perches.com>

Based upon an initial patch and report by Luis R. Rodriguez.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:36:04 -08:00
Pavel Emelyanov
03f49f3457 [NET]: Make helper to get dst entry and "use" it
There are many places that get the dst entry, increase the
__use counter and set the "lastuse" time stamp.

Make a helper for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:28:34 -08:00
Pavel Emelyanov
b1667609cd [IPV4]: Remove bugus goto-s from ip_route_input_slow
Both places look like

        if (err == XXX) 
               goto yyy;
   done:

while both yyy targets look like

        err = XXX;
        goto done;

so this is ok to remove the above if-s.

yyy labels are used in other places and are not removed.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:26:41 -08:00
Ilpo Järvinen
fbd52eb2bd [TCP]: Split SACK FRTO flag clearing (fixes FRTO corner case bug)
In case we run out of mem when fragmenting, the clearing of
FLAG_ONLY_ORIG_SACKED might get missed which then feeds FRTO
with false information. Move clearing outside skb processing
loop so that it will get executed even if the skb loop
terminates prematurely due to out-of-mem.

Besides, now the core of the loop truly deals with a single
skb only, which also enables creation a more self-contained
of tcp_sacktag_one later on.

In addition, small reorganization of if branches was made.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:24:19 -08:00
Ilpo Järvinen
e49aa5d456 [TCP]: Add unlikely() to sacktag out-of-mem in fragment case
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:23:08 -08:00
Ilpo Järvinen
c7caf8d3ed [TCP]: Fix reord detection due to snd_una covered holes
Fixes subtle bug like the one with fastpath_cnt_hint happening
due to the way the GSO and hints interact. Because hints are not
reset when just a GSOed skb is partially ACKed, there's no
guarantee that the relevant part of the write queue is going to
be processed in sacktag at all (skbs below snd_una) because
fastpath hint can fast forward the entrypoint.

This was also on the way of future reductions in sacktag's skb
processing. Also future cleanups in sacktag can be made after
this (in 2.6.25).

This may make reordering update in tcp_try_undo_partial
redundant but I'm not too sure so I left it there.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:22:18 -08:00
Ilpo Järvinen
8dd71c5d28 [TCP]: Consider GSO while counting reord in sacktag
Reordering detection fails to take account that the reordered
skb may have pcount larger than 1. In such case the lowest of
them had the largest reordering, the old formula used the
highest of them which is pcount - 1 packets less reordered.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:20:59 -08:00
Vlad Yasevich
7d54dc6876 SCTP: Always flush the queue when uncorcking.
When the code calls uncork, trigger a queue flush, even
if the queue was not corked.  Most callers that explicitely
cork the queue will have additinal checks to see if they 
corked it.  Callers who do not cork the queue expect packets
to flow when they call uncork.

The scneario that showcased this bug happend when we were not
able to bundle DATA with outgoing COOKIE-ECHO.  As a result
the data just sat in the outqueue and did not get transmitted.
The application expected a response, but nothing happened.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-09 11:43:41 -05:00
Vlad Yasevich
cd3ae8e615 SCTP: Fix PR-SCTP to deliver all the accumulated ordered chunks
There is a small bug when we process a FWD-TSN.  We'll deliver
anything upto the current next expected SSN.  However, if the
next expected is already in the queue, it will take another
chunk to trigger its delivery.  The fix is to simply check
the current queued SSN is the next expected one.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-09 11:43:41 -05:00
Vlad Yasevich
7ab9080467 SCTP: Make sctp_verify_param return multiple indications.
SCTP-AUTH and future ADD-IP updates have a requirement to
do additional verification of parameters and an ability to
ABORT the association if verification fails.  So, introduce
additional return code so that we can clear signal a required
action.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-09 11:43:41 -05:00
Vlad Yasevich
d970dbf845 SCTP: Convert custom hash lists to use hlist.
Convert the custom hash list traversals to use hlist functions.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-09 11:43:40 -05:00
Vlad Yasevich
123ed979ea SCTP: Use hashed lookup when looking for an association.
A SCTP endpoint may have a lot of associations on them and walking
the list is fairly inefficient.  Instead, use a hashed lookup,
and filter out the hash list based on the endopoing we already have.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-09 11:41:36 -05:00
Vlad Yasevich
027f6e1ad3 SCTP: Fix a potential race between timers and receive path.
There is a possible race condition where the timer code will
free the association and the next packet in the queue will also
attempt to free the same association.

The example is, when we receive an ABORT at about the same time
as the retransmission timer fires.  If the timer wins the race,
it will free the association.  Once it releases the lock, the
queue processing will recieve the ABORT and will try to free
the association again.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:27 -05:00
Vlad Yasevich
73d9c4fd1a SCTP: Allow ADD_IP to work with AUTH for backward compatibility.
This patch adds a tunable that will allow ADD_IP to work without
AUTH for backward compatibility.  The default value is off since
the default value for ADD_IP is off as well.  People who need
to use ADD-IP with older implementations take risks of connection
hijacking and should consider upgrading or turning this tunable on.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:27 -05:00
Vlad Yasevich
88799fe5ec SCTP: Correctly disable ADD-IP when AUTH is not supported.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:27 -05:00
Vlad Yasevich
0ed90fb0f6 SCTP: Update RCU handling during the ADD-IP case
After learning more about rcu, it looks like the ADD-IP hadling
doesn't need to call call_rcu_bh.  All the rcu critical sections
use rcu_read_lock, so using call_rcu_bh is wrong here.
Now, restore the local_bh_disable() code blocks and use normal
call_rcu() calls.  Also restore the missing return statement.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:27 -05:00
Vlad Yasevich
b6157d8e03 SCTP: Fix difference cases of retransmit.
Commit d0ce92910b broke several retransmit
cases including fast retransmit.  The reason is that we should
only delay by rto while doing retranmists as a result of a timeout.
Retransmit as a result of path mtu discover, fast retransmit, or
other evernts that should trigger immidiate retransmissions got broken.

Also, since rto is doubled prior to marking of packets elegable for
retransmission, we never marked correct chunks anyway.

The fix is provide a reason for a given retransmission so that we
can mark chunks appropriately and to save the old rto value to do
comparisons against.

All regressions tests passed with this code.

Spotted by Wei Yongjun <yjwei@cn.fujitsu.com>

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:27 -05:00
Wei Yongjun
f3830ccc2e SCTP : Fix to process bundled ASCONF chunk correctly
If ASCONF chunk is bundled with other chunks as the first chunk, when
process the ASCONF parameters, full packet data will be process as the
parameters of the ASCONF chunk, not only the real parameters. So if you
send a ASCONF chunk bundled with other chunks, you will get an unexpect
result.
This problem also exists when ASCONF-ACK chunk is bundled with other chunks.

This patch fix this problem.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:27 -05:00
Wei Yongjun
64b0812b6d SCTP : Fix bad formatted comment in outqueue.c
Just fix the bad format of the comment in outqueue.c.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
2007-11-07 11:39:26 -05:00
Patrick McHardy
c3d8d1e30c [NETLINK]: Fix unicast timeouts
Commit ed6dcf4a in the history.git tree broke netlink_unicast timeouts
by moving the schedule_timeout() call to a new function that doesn't
propagate the remaining timeout back to the caller. This means on each
retry we start with the full timeout again.

ipc/mqueue.c seems to actually want to wait indefinitely so this
behaviour is retained.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:12 -08:00
Eric Dumazet
230140cffa [INET]: Remove per bucket rwlock in tcp/dccp ehash table.
As done two years ago on IP route cache table (commit
22c047ccbc) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.

On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.

Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.

This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:11 -08:00
Rumen G. Bogdanovski
efac52762b [IPVS]: Synchronize closing of Connections
This patch makes the master daemon to sync the connection when it is about
to close.  This makes the connections on the backup to close or timeout
according their state.  Before the sync was performed only if the
connection is in ESTABLISHED state which always made the connections to
timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch
([IPVS]: use proper timeout instead of fixed value) effectively did nothing
more than increasing this to 15 minutes (Established state timeout).  So
this patch makes use of proper timeout since it syncs the connections on
status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout).
However if the backup misses CLOSE hopefully it did not miss FIN_WAIT.
Otherwise we will just have to wait for the ESTABLISHED state timeout. As
it is without this patch.  This way the number of the hanging connections
on the backup is kept to minimum. And very few of them will be left to
timeout with a long timeout.

This is important if we want to make use of the fix for the real server
overcommit on master/backup fail-over.

Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:10 -08:00
Rumen G. Bogdanovski
1e356f9cdf [IPVS]: Bind connections on stanby if the destination exists
This patch fixes the problem with node overload on director fail-over.
Given the scenario: 2 nodes each accepting 3 connections at a time and 2
directors, director failover occurs when the nodes are fully loaded (6
connections to the cluster) in this case the new director will assign
another 6 connections to the cluster, If the same real servers exist
there.

The problem turned to be in not binding the inherited connections to
the real servers (destinations) on the backup director. Therefore:
"ipvsadm -l" reports 0 connections:
root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  test2.local:5999 wlc
  -> node473.local:5999           Route   1000   0          0
  -> node484.local:5999           Route   1000   0          0

while "ipvs -lnc" is right
root@test2:~# ipvsadm -lnc
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:56  ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999
192.168.0.51:5999
TCP 14:59  ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999
192.168.0.52:5999

So the patch I am sending fixes the problem by binding the received
connections to the appropriate service on the backup director, if it
exists, else the connection will be handled the old way. So if the
master and the backup directors are synchronized in terms of real
services there will be no problem with server over-committing since
new connections will not be created on the nonexistent real services
on the backup. However if the service is created later on the backup,
the binding will be performed when the next connection update is
received. With this patch the inherited connections will show as
inactive on the backup:

root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  test2.local:5999 wlc
  -> node473.local:5999           Route   1000   0          1
  -> node484.local:5999           Route   1000   0          1

rumen@test2:~$ cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  C0A800DE:176F wlc
  -> C0A80033:176F      Route   1000   0          1
  -> C0A80032:176F      Route   1000   0          1

Regards,
Rumen Bogdanovski

Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2007-11-07 04:15:09 -08:00
Pavel Emelyanov
b733c007ed [NET]: Clean proto_(un)register from in-code ifdefs
The struct proto has the per-cpu "inuse" counter, which is handled
with a special care. All the handling code hides under the ifdef
CONFIG_SMP and it introduces some code duplication and makes it
look worse than it could.

Clean this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:04 -08:00
Herbert Xu
4999f3621f [IPSEC]: Fix crypto_alloc_comp error checking
The function crypto_alloc_comp returns an errno instead of NULL
to indicate error.  So it needs to be tested with IS_ERR.

This is based on a patch by Vicenç Beltran Querol.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:03 -08:00
Patrick McHardy
fffe470a80 [VLAN]: Fix SET_VLAN_INGRESS_PRIORITY_CMD ioctl
Based on report and patch by Doug Kehn <rdkehn@yahoo.com>:

vconfig returns the following error when attempting to execute the
set_ingress_map command:

vconfig: socket or ioctl error for set_ingress_map: Operation not permitted

In vlan.c, vlan_ioctl_handler for SET_VLAN_INGRESS_PRIORITY_CMD
sets err = -EPERM and calls vlan_dev_set_ingress_priority.
vlan_dev_set_ingress_priority is a void function so err remains
at -EPERM and results in the vconfig error (even though the ingress
map was set).

Fix by setting err = 0 after the vlan_dev_set_ingress_priority call.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:02 -08:00
Johann Felix Soden
45a19b0a72 [NETNS]: Fix compiler error in net_namespace.c
Because net_free is called by copy_net_ns before its declaration, the
compiler gives an error. This patch puts net_free before copy_net_ns
to fix this.

The compiler error:
net/core/net_namespace.c: In function 'copy_net_ns':
net/core/net_namespace.c:97: error: implicit declaration of function 'net_free'
net/core/net_namespace.c: At top level:
net/core/net_namespace.c:104: warning: conflicting types for 'net_free'
net/core/net_namespace.c:104: error: static declaration of 'net_free' follows non-static declaration
net/core/net_namespace.c:97: error: previous implicit declaration of 'net_free' was here

The error was introduced by the '[NET]: Hide the dead code in the
net_namespace.c' patch (6a1a3b9f68).

Signed-off-by: Johann Felix Soden <johfel@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:02 -08:00
Radu Rendec
543821c6f5 [PKT_SCHED] CLS_U32: Fix endianness problem with u32 classifier hash masks.
While trying to implement u32 hashes in my shaping machine I ran into
a possible bug in the u32 hash/bucket computing algorithm
(net/sched/cls_u32.c).

The problem occurs only with hash masks that extend over the octet
boundary, on little endian machines (where htonl() actually does
something).

Let's say that I would like to use 0x3fc0 as the hash mask. This means
8 contiguous "1" bits starting at b6. With such a mask, the expected
(and logical) behavior is to hash any address in, for instance,
192.168.0.0/26 in bucket 0, then any address in 192.168.0.64/26 in
bucket 1, then 192.168.0.128/26 in bucket 2 and so on.

This is exactly what would happen on a big endian machine, but on
little endian machines, what would actually happen with current
implementation is 0x3fc0 being reversed (into 0xc03f0000) by htonl()
in the userspace tool and then applied to 192.168.x.x in the u32
classifier. When shifting right by 16 bits (rank of first "1" bit in
the reversed mask) and applying the divisor mask (0xff for divisor
256), what would actually remain is 0x3f applied on the "168" octet of
the address.

One could say is this can be easily worked around by taking endianness
into account in userspace and supplying an appropriate mask (0xfc03)
that would be turned into contiguous "1" bits when reversed
(0x03fc0000). But the actual problem is the network address (inside
the packet) not being converted to host order, but used as a
host-order value when computing the bucket.

Let's say the network address is written as n31 n30 ... n0, with n0
being the least significant bit. When used directly (without any
conversion) on a little endian machine, it becomes n7 ... n0 n8 ..n15
etc in the machine's registers. Thus bits n7 and n8 would no longer be
adjacent and 192.168.64.0/26 and 192.168.128.0/26 would no longer be
consecutive.

The fix is to apply ntohl() on the hmask before computing fshift,
and in u32_hash_fold() convert the packet data to host order before
shifting down by fshift.

With helpful feedback from Jamal Hadi Salim and Jarek Poplawski.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:11:45 -08:00
Jiri Olsa
40208d71e0 [NET]: Removing duplicit #includes
Removing duplicit #includes for net/

Signed-off-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:11:44 -08:00
Pavel Emelyanov
c3e9a353d8 [IPV4]: Compact some ifdefs in the fib code.
There are places that check for CONFIG_IP_MULTIPLE_TABLES
twice in the same file, but the internals of these #ifdefs
can be merged.

As a side effect - remove one ifdef from inside a function.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:11:41 -08:00
Alexey Dobriyan
33120b30cc [IPV6]: Convert /proc/net/ipv6_route to seq_file interface
This removes last proc_net_create() user. Kudos to Benjamin Thery and
Stephen Hemminger for comments on previous version.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:09:18 -08:00
Evgeniy Polyakov
4f9f8311a0 [PKT_SCHED]: Fix OOPS when removing devices from a teql queuing discipline
tecl_reset() is called from deactivate and qdisc is set to noop already,
but subsequent teql_xmit does not know about it and dereference private
data as teql qdisc and thus oopses.
not catch it first :)

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:09:17 -08:00
David S. Miller
c62cf5cb17 [DCCP]: Use DEFINE_PROTO_INUSE infrastructure.
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:09:01 -08:00
Eric Dumazet
8295b6d9e6 [SCTP]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure
Trivial patch to make "sctcp,sctpv6" protocols uses the fast "inuse
sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:09:00 -08:00
Eric Dumazet
c5a432f1a1 [IPV6]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure
Trivial patch to make "tcpv6,udpv6,udplitev6,rawv6" protocols uses the
fast "inuse sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:59 -08:00
Eric Dumazet
47a31a6ffc [IPV4]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure
Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
"inuse sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:58 -08:00
Eric Dumazet
286ab3d460 [NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way.
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.

If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.

In this patch, I tried to :

- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation

Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP

If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.

When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:57 -08:00
Pavel Emelyanov
6a9fb9479f [IPV4]: Clean the ip_sockglue.c from some ugly ifdefs
The #idfed CONFIG_IP_MROUTE is sometimes places inside the if-s,
which looks completely bad. Similar ifdefs inside the functions
looks a bit better, but they are also not recommended to be used.

Provide an ifdef-ed ip_mroute_opt() helper to cleanup the code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:55 -08:00
Alexey Dobriyan
4e058063f4 [DECNET]: "addr" module param can't be __initdata
sysfs keeps references to module parameters via /sys/module/*/parameters,
so marking them as __initdata can't work.

Steps to reproduce:

	modprobe decnet
	cat /sys/module/decnet/parameters/addr

BUG: unable to handle kernel paging request at virtual address f88cd410
printing eip: c043dfd1 *pdpt = 0000000000004001 *pde = 0000000004408067 *pte = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: decnet sunrpc af_packet ipv6 binfmt_misc dm_mirror dm_multipath dm_mod sbs sbshc fan dock battery backlight ac power_supply parport loop rtc_cmos serio_raw rtc_core rtc_lib button amd_rng sr_mod cdrom shpchp pci_hotplug ehci_hcd ohci_hcd uhci_hcd usbcore
Pid: 2099, comm: cat Not tainted (2.6.24-rc1-b1d08ac064268d0ae2281e98bf5e82627e0f0c56-bloat #6)
EIP: 0060:[<c043dfd1>] EFLAGS: 00210286 CPU: 1
EIP is at param_get_int+0x6/0x20
EAX: c5c87000 EBX: 00000000 ECX: 000080d0 EDX: f88cd410
ESI: f8a108f8 EDI: c5c87000 EBP: 00000000 ESP: c5c97f00
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process cat (pid: 2099, ti=c5c97000 task=c641ee10 task.ti=c5c97000)
Stack: 00000000 f8a108f8 c5c87000 c043db6b f8a108f1 00000124 c043de1a c043db2f
       f88cd410 ffffffff c5c87000 f8a16bc8 f8a16bc8 c043dd69 c043dd54 c5dd5078
       c043dbc8 c5cc7580 c06ee64c c5d679f8 c04c431f c641f480 c641f484 00001000
Call Trace:
 [<c043db6b>] param_array_get+0x3c/0x62
 [<c043de1a>] param_array_set+0x0/0xdf
 [<c043db2f>] param_array_get+0x0/0x62
 [<c043dd69>] param_attr_show+0x15/0x2d
 [<c043dd54>] param_attr_show+0x0/0x2d
 [<c043dbc8>] module_attr_show+0x1a/0x1e
 [<c04c431f>] sysfs_read_file+0x7c/0xd9
 [<c04c42a3>] sysfs_read_file+0x0/0xd9
 [<c048d4b2>] vfs_read+0x88/0x134
 [<c042090b>] do_page_fault+0x0/0x7d5
 [<c048d920>] sys_read+0x41/0x67
 [<c04080fa>] sysenter_past_esp+0x6b/0xc1
 =======================
Code: 00 83 c4 0c c3 83 ec 0c 8b 52 10 8b 12 c7 44 24 04 27 dd 6c c0 89 04 24 89 54 24 08 e8 ea 01 0c 00 83 c4 0c c3 83 ec 0c 8b 52 10 <8b> 12 c7 44 24 04 58 8c 6a c0 89 04 24 89 54 24 08 e8 ca 01 0c
EIP: [<c043dfd1>] param_get_int+0x6/0x20 SS:ESP 0068:c5c97f00

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:55 -08:00