When we pull a received packet from a link's 'deferred packets' queue
for processing, its 'next' pointer is not cleared, and still refers to
the next packet in that queue, if any. This is incorrect, but caused
no harm before commit 40ba3cdf54 ("tipc:
message reassembly using fragment chain") was introduced. After that
commit, it may sometimes lead to the following oops:
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: tipc
CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W 3.13.0-rc2+ #6
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
task: ffff880017af4880 ti: ffff880017aee000 task.ti: ffff880017aee000
RIP: 0010:[<ffffffff81710694>] [<ffffffff81710694>] skb_try_coalesce+0x44/0x3d0
RSP: 0018:ffff880016603a78 EFLAGS: 00010212
RAX: 6b6b6b6bd6d6d6d6 RBX: ffff880013106ac0 RCX: ffff880016603ad0
RDX: ffff880016603ad7 RSI: ffff88001223ed00 RDI: ffff880013106ac0
RBP: ffff880016603ab8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88001223ed00
R13: ffff880016603ad0 R14: 000000000000058c R15: ffff880012297650
FS: 0000000000000000(0000) GS:ffff880016600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000805b000 CR3: 0000000011f5d000 CR4: 00000000000006e0
Stack:
ffff880016603a88 ffffffff810a38ed ffff880016603aa8 ffff88001223ed00
0000000000000001 ffff880012297648 ffff880016603b68 ffff880012297650
ffff880016603b08 ffffffffa0006c51 ffff880016603b08 00ffffffa00005fc
Call Trace:
<IRQ>
[<ffffffff810a38ed>] ? trace_hardirqs_on+0xd/0x10
[<ffffffffa0006c51>] tipc_link_recv_fragment+0xd1/0x1b0 [tipc]
[<ffffffffa0007214>] tipc_recv_msg+0x4e4/0x920 [tipc]
[<ffffffffa00016f0>] ? tipc_l2_rcv_msg+0x40/0x250 [tipc]
[<ffffffffa000177c>] tipc_l2_rcv_msg+0xcc/0x250 [tipc]
[<ffffffffa00016f0>] ? tipc_l2_rcv_msg+0x40/0x250 [tipc]
[<ffffffff8171e65b>] __netif_receive_skb_core+0x80b/0xd00
[<ffffffff8171df94>] ? __netif_receive_skb_core+0x144/0xd00
[<ffffffff8171eb76>] __netif_receive_skb+0x26/0x70
[<ffffffff8171ed6d>] netif_receive_skb+0x2d/0x200
[<ffffffff8171fe70>] napi_gro_receive+0xb0/0x130
[<ffffffff815647c2>] e1000_clean_rx_irq+0x2c2/0x530
[<ffffffff81565986>] e1000_clean+0x266/0x9c0
[<ffffffff81985f7b>] ? notifier_call_chain+0x2b/0x160
[<ffffffff8171f971>] net_rx_action+0x141/0x310
[<ffffffff81051c1b>] __do_softirq+0xeb/0x480
[<ffffffff819817bb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff810b8c42>] ? handle_fasteoi_irq+0x72/0x100
[<ffffffff81052346>] irq_exit+0x96/0xc0
[<ffffffff8198cbc3>] do_IRQ+0x63/0xe0
[<ffffffff81981def>] common_interrupt+0x6f/0x6f
<EOI>
This happens when the last fragment of a message has passed through the
the receiving link's 'deferred packets' queue, and at least one other
packet was added to that queue while it was there. After the fragment
chain with the complete message has been successfully delivered to the
receiving socket, it is released. Since 'next' pointer of the last
fragment in the released chain now is non-NULL, we get the crash shown
above.
We fix this by clearing the 'next' pointer of all received packets,
including those being pulled from the 'deferred' queue, before they
undergo any further processing.
Fixes: 40ba3cdf54 ("tipc: message reassembly using fragment chain")
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A deadlock might occur if name table is withdrawn in socket release
routine, and while packets are still being received from bearer.
CPU0 CPU1
T0: recv_msg() release()
T1: tipc_recv_msg() tipc_withdraw()
T2: [grab node lock] [grab port lock]
T3: tipc_link_wakeup_ports() tipc_nametbl_withdraw()
T4: [grab port lock]* named_cluster_distribute()
T5: wakeupdispatch() tipc_link_send()
T6: [grab node lock]*
The opposite order of holding port lock and node lock on above two
different paths may result in a deadlock. If socket lock instead of
port lock is used to protect port instance in tipc_withdraw(), the
reverse order of holding port lock and node lock will be eliminated,
as a result, the deadlock is killed as well.
Reported-by: Lars Everbrand <lars.everbrand@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'handler_enabled' is a global flag indicating whether the TIPC
signal handling service is enabled or not. The lack of lock
protection for this flag incurs a risk for contention, so that
a tipc_k_signal() call might queue a signal handler to a destroyed
signal queue, with unpredictable results. To correct this, we let
the already existing 'qitem_lock' protect the flag, as it already
does with the queue itself. This way, we ensure that the flag
always is consistent across all cores.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'signal handler' service in TIPC is a mechanism that makes it
possible to postpone execution of functions, by launcing them into
a job queue for execution in a separate tasklet, independent of
the launching execution thread.
When we do rmmod on the tipc module, this service is stopped after
the network service. At the same time, the stopping of the network
service may itself launch jobs for execution, with the risk that these
functions may be scheduled for execution after the data structures
meant to be accessed by the job have already been deleted. We have
seen this happen, most often resulting in an oops.
This commit ensures that the signal handler is the very first to be
stopped when TIPC is shut down, so there are no surprises during
the cleanup of the other services.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.
This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.
Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.
Also document these changes in include/linux/net.h as suggested by David
Miller.
Changes since RFC:
Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.
With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
msg->msg_name = NULL
".
This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.
Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.
Cc: David Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.
The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the following Smatch warning:
net/tipc/link.c:2364 tipc_link_recv_fragment()
warn: variable dereferenced before check '*head' (see line 2361)
A null pointer might be passed to skb_try_coalesce if
a malicious sender injects orphan fragments on a link.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If appending a received fragment to the pending fragment chain
in a unicast link fails, the current code tries to force a retransmission
of the fragment by decrementing the 'next received sequence number'
field in the link. This is done under the assumption that the failure
is caused by an out-of-memory situation, an assumption that does
not hold true after the previous patch in this series.
A failure to append a fragment can now only be caused by a protocol
violation by the sending peer, and it must hence be assumed that it
is either malicious or buggy. Either way, the correct behavior is now
to reset the link instead of trying to revert its sequence number.
So, this is what we do in this commit.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the first fragment of a long data data message is received on a link, a
reassembly buffer large enough to hold the data from this and all subsequent
fragments of the message is allocated. The payload of each new fragment is
copied into this buffer upon arrival. When the last fragment is received, the
reassembled message is delivered upwards to the port/socket layer.
Not only is this an inefficient approach, but it may also cause bursts of
reassembly failures in low memory situations. since we may fail to allocate
the necessary large buffer in the first place. Furthermore, after 100 subsequent
such failures the link will be reset, something that in reality aggravates the
situation.
To remedy this problem, this patch introduces a different approach. Instead of
allocating a big reassembly buffer, we now append the arriving fragments
to a reassembly chain on the link, and deliver the whole chain up to the
socket layer once the last fragment has been received. This is safe because
the retransmission layer of a TIPC link always delivers packets in strict
uninterrupted order, to the reassembly layer as to all other upper layers.
Hence there can never be more than one fragment chain pending reassembly at
any given time in a link, and we can trust (but still verify) that the
fragments will be chained up in the correct order.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a message fragment is received in a broadcast or unicast link,
the reception code will append the fragment payload to a big reassembly
buffer through a call to the function tipc_recv_fragm(). However, after
the return of that call, the logics goes on and passes the fragment
buffer to the function tipc_net_route_msg(), which will simply drop it.
This behavior is a remnant from the now obsolete multi-cluster
functionality, and has no relevance in the current code base.
Although currently harmless, this unnecessary call would be fatal
after applying the next patch in this series, which introduces
a completely new reassembly algorithm. So we change the code to
eliminate the redundant call.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The message dispatching part of tipc_recv_msg() is wrapped layers of
while/if/if/switch, causing out-of-control indentation and does not
look very good. We reduce two indentation levels by separating the
message dispatching from the blocks that checks link state and
sequence numbers, allowing longer function and arg names to be
consistently indented without wrapping. Additionally we also rename
"cont" label to "discard" and add one new label called "unlock_discard"
to make code clearer. In all, these are cosmetic changes that do not
alter the operation of TIPC in any way.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: David Laight <david.laight@aculab.com>
Cc: Andreas Bofjäll <andreas.bofjall@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When checking statistics or changing parameters on a link, the
link_find_link function is used to locate the link with a given
name. The complex method of deconstructing the name into local
and remote address/interface is error prone and may fail if the
interface names contains special characters. We change the lookup
method to iterate over the list of nodes and compare the link
names.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
link_cmd_set_value() takes commands for link, bearer and media related
configuration. Genereally the function returns 0 when a command is
recognized, and -EINVAL when it is not. However, in the switch for link
related commands it returns 0 even when the command is unrecognized. This
will sometimes make it look as if a failed configuration command has been
successful, but has otherwise no negative effects.
We remove this anomaly by returning -EINVAL even for link commands. We also
rework all three switches to make them conforming to common kernel coding
style.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, rcv_msg() always returns zero on a packet delivery upcall
from net_device.
To make its behavior more compliant with the way this API should be
used, we change this to let it return NET_RX_SUCCESS (which is zero
anyway) when it is able to handle the packet, and NET_RX_DROP otherwise.
The latter does not imply any functional change, it only enables the
driver to keep more accurate statistics about the fate of delivered
packets.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tipc_block_bearer() currently takes a bearer name (const char*)
as argument. This requires the function to make a lookup to find
the pointer to the corresponding bearer struct. In the current
code base this is not necessary, since the only two callers
(tipc_continue(),recv_notification()) already have validated
copies of this pointer, and hence can pass it directly in the
function call.
We change tipc_block_bearer() to directly take struct tipc_bearer*
as argument instead.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TIPC 'bearer' exists as an abstract concept, while 'media'
is deemed a specific implementation of a bearer, such as Ethernet
or Infiniband media. When a component inside TIPC wants to control
a specific media, it only needs to access the generic bearer API
to achieve this. However, in the current media implementations,
the 'bearer' name is also extensively used in media specific
function and variable names.
This may create confusion, so we choose to replace the term 'bearer'
with 'media' in all function names, variable names, and prefixes
where this is what really is meant.
Note that this change is cosmetic only, and no runtime behaviour
changes are made here.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tipc_msg_build() now copies message data from iovec to skb_buff
using memcpy_fromiovecend(), which doesn't need to be passed the
iovec length to perform the copying.
So we remove the parameter indicating iovec length in all
functions where TIPC messages are built and sent.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tipc_msg_build() calls skb_copy_to_linear_data_offset() to copy data
from user space to kernel space. However, the latter function does
in its turn call memcpy() to perform the actual copying. This poses
an obvious security and robustness risk, since memcpy() never makes
any validity check on the pointer it is copying from.
To correct this, we the replace the offending function call with
a call to memcpy_fromiovecend(), which uses copy_from_user() to
perform the copying.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Should a connect fail, if the publication/server is unavailable or
due to some other error, a positive value will be returned and errno
is never set. If the application code checks for an explicit zero
return from connect (success) or a negative return (failure), it
will not catch the error and subsequent send() calls will fail as
shown from the strace snippet below.
socket(0x1e /* PF_??? */, SOCK_SEQPACKET, 0) = 3
connect(3, {sa_family=0x1e /* AF_??? */, sa_data="\2\1\322\4\0\0\322\4\0\0\0\0\0\0"}, 16) = 111
sendto(3, "test", 4, 0, NULL, 0) = -1 EPIPE (Broken pipe)
The reason for this behaviour is that TIPC wrongly inverts error
codes set in sk_err.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert enable_bearer() to RCU locking with dev_get_by_name().
Based on a similar changeset in commit 840a185d ["aoe: remove
dev_base_lock use from aoecmd_cfg_pkts()"] -- quoting that:
"dev_base_lock is the legacy way to lock the device list,
and is planned to disappear. (writers hold RTNL, readers
hold RCU lock)"
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>