Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial session number when a link is created is based on a random
value, taken from struct tipc_net->random. It is then incremented for
each link reset to avoid mixing protocol messages from different link
sessions.
However, when a bearer is reset all its links are deleted, and will
later be re-created using the same random value as the first time.
This means that if the link never went down between creation and
deletion we will still sometimes have two subsequent sessions with
the same session number. In virtual environments with potentially
long transmission times this has turned out to be a real problem.
We now fix this by randomizing the session number each time a link
is created.
With a session number size of 16 bits this gives a risk of session
collision of 1/64k. To reduce this further, we also introduce a sanity
check on the very first STATE message arriving at a link. If this has
an acknowledge value differing from 0, which is logically impossible,
we ignore the message. The final risk for session collision is hence
reduced to 1/4G, which should be sufficient.
Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We see the following scenario:
1) Link endpoint B on node 1 discovers that its peer endpoint is gone.
Since there is a second working link, failover procedure is started.
2) Link endpoint A on node 1 sends a FAILOVER message to peer endpoint
A on node 2. The node item 1->2 goes to state FAILINGOVER.
3) Linke endpoint A/2 receives the failover, and is supposed to take
down its parallell link endpoint B/2, while producing a FAILOVER
message to send back to A/1.
4) However, B/2 has already been deleted, so no FAILOVER message can
created.
5) Node 1->2 remains in state FAILINGOVER forever, refusing to receive
any messages that can bring B/1 up again. We are left with a non-
redundant link between node 1 and 2.
We fix this with letting endpoint A/2 build a dummy FAILOVER message
to send to back to A/1, so that the situation can be resolved.
Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Default socket receive buffer size for a listener socket is 2Mb. For
each arriving empty SYN, the linux kernel allocates a 768 bytes buffer.
This means that a listener socket can serve maximum 2700 simultaneous
empty connection setup requests before it hits a receive buffer
overflow, and much fewer if the SYN is carrying any significant
amount of data.
When this happens the setup request is rejected, and the client
receives an ECONNREFUSED error.
This commit mitigates this problem by letting the client socket try to
retransmit the SYN message multiple times when it sees it rejected with
the code TIPC_ERR_OVERLOAD. Retransmission is done at random intervals
in the range of [100 ms, setup_timeout / 4], as many times as there is
room for within the setup timeout limit.
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Messages intended for intitating a connection are currently
indistinguishable from regular datagram messages. The TIPC
protocol specification defines bit 17 in word 0 as a SYN bit
to allow sanity check of such messages in the listening socket,
but this has so far never been implemented.
We do that in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We refactor the function tipc_sk_filter_connect(), both to make it
more readable and as a preparation for the next commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function tipc_msg_reverse() is reversing the header of a message
while reusing the original buffer. We have seen at several occasions
that this may have unfortunate side effects when the buffer to be
reversed is a clone.
In one of the following commits we will again need to reverse cloned
buffers, so this is the right time to permanently eliminate this
problem. In this commit we let the said function always consume the
original buffer and replace it with a new one when applicable.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tipc_link_reset() we copy the wakeup queue to input queue using
skb_queue_splice_init(link->wakeupq, link->inputq).
This is performed without holding any locks. The lists might be
simultaneously be accessed by other cpu threads in tipc_sk_rcv(),
something leading to to random missing packets.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of implicit connect message with data > 1K, the flow
control accounting is incorrect. At this state, the socket does not
know the peer nodes capability and falls back to legacy flow control
by return 1, however the receiver of this message will perform the
new block accounting. This leads to a slack and eventually traffic
disturbance.
In this commit, we perform tipc_node_get_capabilities() at implicit
connect and perform accounting based on the peer's capability.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
__tipc_nl_compat_dumpit() uses a netlink_callback on stack,
so the only way to align it with other ->dumpit() call path
is calling tipc_dump_start() and tipc_dump_done() directly
inside it. Otherwise ->dumpit() would always get NULL from
cb->args[].
But tipc_dump_start() uses sock_net(cb->skb->sk) to retrieve
net pointer, the cb->skb here doesn't set skb->sk, the net pointer
is saved in msg->net instead, so introduce a helper function
__tipc_dump_start() to pass in msg->net.
Ying pointed out cb->args[0...3] are already used by other
callbacks on this call path, so we can't use cb->args[0] any
more, use cb->args[4] instead.
Fixes: 9a07efa9ae ("tipc: switch to rhashtable iterator")
Reported-and-tested-by: syzbot+e93a2c41f91b8e2c7d9b@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the following obsolete parameter comments of tipc_topsrv struct:
@rcvbuf_cache
@tipc_conn_new
@tipc_conn_release
@tipc_conn_recvmsg
@imp
@type
Add the comments for the missing parameters below of tipc_topsrv struct:
@awork
@listener
Remove the unused or duplicated parameter comments of tipc_conn struct:
@outqueue_lock
@rx_action
Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a use-after-free in tipc_group_fill_sock_diag(),
where tipc_group_fill_sock_diag() still reads tsk->group meanwhile
tipc_group_delete() just deletes it in tipc_release().
tipc_nl_sk_walk() aims to lock this sock when walking each sock
in the hash table to close race conditions with sock changes like
this one, by acquiring tsk->sk.sk_lock.slock spinlock, unfortunately
this doesn't work at all. All non-BH call path should take
lock_sock() instead to make it work.
tipc_nl_sk_walk() brutally iterates with raw rht_for_each_entry_rcu()
where RCU read lock is required, this is the reason why lock_sock()
can't be taken on this path. This could be resolved by switching to
rhashtable iterator API's, where taking a sleepable lock is possible.
Also, the iterator API's are friendly for restartable calls like
diag dump, the last position is remembered behind the scence,
all we need to do here is saving the iterator into cb->args[].
I tested this with parallel tipc diag dump and thousands of tipc
socket creation and release, no crash or memory leak.
Reported-by: syzbot+b9c8f3ab2994b7cd1625@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function tipc_dest_push, the 32bit variables 'node' and 'port'
are stored separately in uppper and lower part of 64bit 'value'.
Then this value is assigned to dst->value which is a union like:
union
{
struct {
u32 port;
u32 node;
};
u64 value;
}
This works on little-endian machines like x86 but fails on big-endian
machines.
The fix remove the 'value' stack parameter and even the 'value'
member of the union in tipc_dest, assign the 'node' and 'port' member
directly with the input parameter to avoid the endian issue.
Fixes: a80ae5306a ("tipc: improve destination linked list")
Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Haiqing Bai <Haiqing.Bai@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9faa89d4ed ("tipc: make function tipc_net_finalize() thread
safe") tries to make it thread safe to set node address, so it uses
node_list_lock lock to serialize the whole process of setting node
address in tipc_net_finalize(). But it causes the following interrupt
unsafe locking scenario:
CPU0 CPU1
---- ----
rht_deferred_worker()
rhashtable_rehash_table()
lock(&(&ht->lock)->rlock)
tipc_nl_compat_doit()
tipc_net_finalize()
local_irq_disable();
lock(&(&tn->node_list_lock)->rlock);
tipc_sk_reinit()
rhashtable_walk_enter()
lock(&(&ht->lock)->rlock);
<Interrupt>
tipc_disc_rcv()
tipc_node_check_dest()
tipc_node_create()
lock(&(&tn->node_list_lock)->rlock);
*** DEADLOCK ***
When rhashtable_rehash_table() holds ht->lock on CPU0, it doesn't
disable BH. So if an interrupt happens after the lock, it can create
an inverse lock ordering between ht->lock and tn->node_list_lock. As
a consequence, deadlock might happen.
The reason causing the inverse lock ordering scenario above is because
the initial purpose of node_list_lock is not designed to do the
serialization of node address setting.
As cmpxchg() can guarantee CAS (compare-and-swap) process is atomic,
we use it to replace node_list_lock to ensure setting node address can
be atomically finished. It turns out the potential deadlock can be
avoided as well.
Fixes: 9faa89d4ed ("tipc: make function tipc_net_finalize() thread safe")
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <maloy@donjonn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>