kernel

mirror of https://github.com/ukui/kernel.git synced 2026-03-09 10:07:04 -07:00

Author	SHA1	Message	Date
Sowmini Varadhan	26e4e6bb68	RDS: TCP: Remove dead logic around c_passive in rds-tcp The c_passive bit is only intended for the IB transport and will never be encountered in rds-tcp, so remove the dead logic that predicates on this bit. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
Sowmini Varadhan	226f7a7d97	RDS: Rework path specific indirections Refactor code to avoid separate indirections for single-path and multipath transports. All transports (both single and mp-capable) will get a pointer to the rds_conn_path, and can trivially derive the rds_connection from the ->cp_conn. Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:45:17 -04:00
David S. Miller	dc9a20020a	Merge branch 'bpf-cgroup2' Martin KaFai Lau says: ==================== cgroup: bpf: cgroup2 membership test on skb This series is to implement a bpf-way to check the cgroup2 membership of a skb (sk_buff). It is similar to the feature added in netfilter: `c38c4597e4` ("netfilter: implement xt_cgroup cgroup2 path match") The current target is the tc-like usage. v3: - Remove WARN_ON_ONCE(!rcu_read_lock_held()) - Stop BPF_MAP_TYPE_CGROUP_ARRAY usage in patch 2/4 - Avoid mounting bpf fs manually in patch 4/4 - Thanks for Daniel's review and the above suggestions - Check CONFIG_SOCK_CGROUP_DATA instead of CONFIG_CGROUPS. Thanks to the kbuild bot's report. Patch 2/4 only needs CONFIG_CGROUPS while patch 3/4 needs CONFIG_SOCK_CGROUP_DATA. Since a single bpf cgrp2 array alone is not useful for now, CONFIG_SOCK_CGROUP_DATA is also used in patch 2/4. We can fine tune it later if we find other use cases for the cgrp2 array. - Return EAGAIN instead of ENOENT if the cgrp2 array entry is NULL. It is to distinguish these two cases: 1) the userland has not populated this array entry yet. or 2) not finding cgrp2 from the skb. - Be-lated thanks to Alexei and Tejun on reviewing v1 and giving advice on this work. v2: - Fix two return cases in cgroup_get_from_fd() - Fix compilation errors when CONFIG_CGROUPS is not used: - arraymap.c: avoid registering BPF_MAP_TYPE_CGROUP_ARRAY - filter.c: tc_cls_act_func_proto() returns NULL on BPF_FUNC_skb_in_cgroup - Add comments to BPF_FUNC_skb_in_cgroup and cgroup_get_from_fd() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:32:27 -04:00
Martin KaFai Lau	a3f7461734	cgroup: bpf: Add an example to do cgroup checking in BPF test_cgrp2_array_pin.c: A userland program that creates a bpf_map (BPF_MAP_TYPE_GROUP_ARRAY), pouplates/updates it with a cgroup2's backed fd and pins it to a bpf-fs's file. The pinned file can be loaded by tc and then used by the bpf prog later. This program can also update an existing pinned array and it could be useful for debugging/testing purpose. test_cgrp2_tc_kern.c: A bpf prog which should be loaded by tc. It is to demonstrate the usage of bpf_skb_in_cgroup. test_cgrp2_tc.sh: A script that glues the test_cgrp2_array_pin.c and test_cgrp2_tc_kern.c together. The idea is like: 1. Load the test_cgrp2_tc_kern.o by tc 2. Use test_cgrp2_array_pin.c to populate a BPF_MAP_TYPE_CGROUP_ARRAY with a cgroup fd 3. Do a 'ping -6 ff02::1%ve' to ensure the packet has been dropped because of a match on the cgroup Most of the lines in test_cgrp2_tc.sh is the boilerplate to setup the cgroup/bpf-fs/net-devices/netns...etc. It is not bulletproof on errors but should work well enough and give enough debug info if things did not go well. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:32:13 -04:00
Martin KaFai Lau	4a482f34af	cgroup: bpf: Add bpf_skb_in_cgroup_proto Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit `c38c4597e4` ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:32:13 -04:00
Martin KaFai Lau	4ed8ec521e	cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:30:38 -04:00
Martin KaFai Lau	1f3fe7ebf6	cgroup: Add cgroup_get_from_fd Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:30:38 -04:00
David S. Miller	6bd3847bdc	Merge branch 'bpf-robustify' Daniel Borkmann says: ==================== Further robustify putting BPF progs This series addresses a potential issue reported to us by Jann Horn with regards to putting progs. First patch moves progs generally under RCU destruction and second patch refactors getting of progs to simplify code a bit. For details, please see individual patches. Note, we think that addressing this one in net-next should be sufficient. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:52 -04:00
Daniel Borkmann	113214be7f	bpf: refactor bpf_prog_get and type check into helper Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Daniel Borkmann	1aacde3d22	bpf: generally move prog destruction to RCU deferral Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in `ceb5607035` ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Amitoj Kaur Chawla	466fc793da	atm: horizon: Use setup_timer Convert a call to init_timer and accompanying intializations of the timer's data and function fields to a call to setup_timer. The Coccinelle semantic patch that fixes this problem is as follows: @@ expression t,d,f,e1; identifier x1; statement S1; @@ ( -t.data = d; \| -t.function = f; \| -init_timer(&t); +setup_timer(&t,f,d); \| -init_timer_on_stack(&t); +setup_timer_on_stack(&t,f,d); ) <... when != S1 t.x1 = e1; ...> Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:58:26 -04:00
David S. Miller	e3cc6e37a7	Merge branch 'qed-next' Manish Chopra says: ==================== qede: Enhancements This patch series have few small fastpath features support and code refactoring. Note - regarding get/set tunable configuration via ethtool Surprisingly, there is NO ethtool application support for such configuration given that we have kernel support. Do let us know if we need to add support for that in user ethtool. Please consider applying this series to "net-next". ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:58 -04:00
Manish Chopra	831a8e6c40	qede: Bump up driver version to 8.10.1.20 Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	3d789994b0	qede: Add get/set rx copy break tunable support Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	312e06761c	qede: Utilize xmit_more This patch uses xmit_more optimization to reduce number of TX doorbells write per packet. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	c774169d8f	qede: qede_poll refactoring This patch cleanups qede_poll() routine a bit and allows qede_poll() to do single iteration to handle TX completion [As under heavy TX load qede_poll() might run for indefinite time in the while(1) loop for TX completion processing and cause CPU stuck]. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
Manish Chopra	c72a6125d0	qede: Add support for handling IP fragmented packets. When handling IP fragmented packets with csum in their transport header, the csum isn't changed as part of the fragmentation. As a result, the packet containing the transport headers would have the correct csum of the original packet, but one that mismatches the actual packet that passes on the wire. As a result, on receive path HW would give an indication that the packet has incorrect csum, which would cause qede to discard the incoming packet. Since HW also delivers a notification of IP fragments, change driver behavior to pass such incoming packets to stack and let it make the decision whether it needs to be dropped. Signed-off-by: Manish <manish.chopra@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:40:53 -04:00
David S. Miller	beb528d01f	Merge branch 'tun-skb_array' Jason Wang says: ==================== switch to use tx skb array in tun This series tries to switch to use skb array in tun. This is used to eliminate the spinlock contention between producer and consumer. The conversion was straightforward: just introdce a tx skb array and use it instead of sk_receive_queue. A minor issue is to keep the tx_queue_len behaviour, since tun used to use it for the length of sk_receive_queue. This is done through: - add the ability to resize multiple rings at once to avoid handling partial resize failure for mutiple rings. - add the support for zero length ring. - introduce a notifier which was triggered when tx_queue_len was changed for a netdev. - resize all queues during the tx_queue_len changing. Tests shows about 15% improvement on guest rx pps: Before: ~1300000pps After : ~1500000pps Changes from V3: - fix kbuild warnings - call NETDEV_CHANGE_TX_QUEUE_LEN on IFLA_TXQLEN Changes from V2: - add multiple rings resizing support for ptr_ring/skb_array - add zero length ring support - introdce a NETDEV_CHANGE_TX_QUEUE_LEN - drop new flags Changes from V1: - switch to use skb array instead of a customized circular buffer - add non-blocking support - rename .peek to .peek_len - drop lockless peeking since test show very minor improvement ==================== Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-from-altitude: 34697 feet. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:28 -04:00
Jason Wang	1576d98605	tun: switch to use skb array for tx We used to queue tx packets in sk_receive_queue, this is less efficient since it requires spinlocks to synchronize between producer and consumer. This patch tries to address this by: - switch from sk_receive_queue to a skb_array, and resize it when tx_queue_len was changed. - introduce a new proto_ops peek_len which was used for peeking the skb length. - implement a tun version of peek_len for vhost_net to use and convert vhost_net to use peek_len if possible. Pktgen test shows about 15.3% improvement on guest receiving pps for small buffers: Before: ~1300000pps After : ~1500000pps Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Jason Wang	08294a26e1	net: introduce NETDEV_CHANGE_TX_QUEUE_LEN This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Jason Wang	bf900b3dbe	skb_array: add wrappers for resizing Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Michael S. Tsirkin	59e6ae5324	ptr_ring: support resizing multiple queues Sometimes, we need support resizing multiple queues at once. This is because it was not easy to recover to recover from a partial failure of multiple queues resizing. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Jason Wang	fd68adec9d	skb_array: minor tweak Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:16 -04:00
Jason Wang	982fb490c2	ptr_ring: support zero length ring Sometimes, we need zero length ring. But current code will crash since we don't do any check before accessing the ring. This patch fixes this. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:16 -04:00
David S. Miller	8dc7243abb	Merge branch 'sch_hfsc-fixes-cleanups' Michal Soltys says: ==================== HFSC patches, part 1 It's revised version of part of the patches I submitted really, really long time ago (back then I asked Patrick to ignore them as I found some issues shortly after submitting). Anyway this is the first set with very simple fixes/changes though some of them relatively subtle (I tried to do very exhaustive commit messages explaining what and why with those). The patches are against net-next tree. The second set will be heavier - or rather with more complex explanations, among those I have: - a fix to subtle issue introduced in http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281 along with simplifying related stuff - update times to 96 bits (which allows to "just" use 32 bit shifts and improves curve definition accuracy at more extreme low/high speeds) - add curve "merging" instead of just selecting in convex case (computations mirror those from concave intersection) But these are eventually for later. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:51 -04:00

1 2 3 4 5 ...

603882 Commits