What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.
There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for
migration, bug work-arounds in userspace)
- write logging is supported (good for migration)
- support memory table and not just an offset (needed for kvm)
common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.
What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.
How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a tap
device. Backend is also configured by userspace, including vlan/mac
etc.
Status: This works for me, and I haven't see any crashes.
Compared to userspace, people reported improved latency (as I save up to
4 system calls per packet), as well as better bandwidth and CPU
utilization.
Features that I plan to look at in the future:
- mergeable buffers
- zero copy
- scalability tuning: figure out the best threading model to use
Note on RCU usage (this is also documented in vhost.h, near
private_pointer which is the value protected by this variant of RCU):
what is happening is that the rcu_dereference() is being used in a
workqueue item. The role of rcu_read_lock() is taken on by the start of
execution of the workqueue item, of rcu_read_unlock() by the end of
execution of the workqueue item, and of synchronize_rcu() by
flush_workqueue()/flush_work(). In the future we might need to apply
some gcc attribute or sparse annotation to the function passed to
INIT_WORK(). Paul's ack below is for this RCU usage.
(Includes fixes by Alan Cox <alan@linux.intel.com>,
David L Stevens <dlstevens@us.ibm.com>,
Chris Wright <chrisw@redhat.com>)
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tun device looks similar to a packet socket
in that both pass complete frames from/to userspace.
This patch fills in enough fields in the socket underlying tun driver
to support sendmsg/recvmsg operations, and message flags
MSG_TRUNC and MSG_DONTWAIT, and exports access to this socket
to modules. Regular read/write behaviour is unchanged.
This way, code using raw sockets to inject packets
into a physical device, can support injecting
packets into host network stack almost without modification.
First user of this interface will be vhost virtualization
accelerator.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds error checking of ctrlmode values for CAN devices. As
an example all availabe bits are implemented in the mcp251x driver.
Signed-off-by: Christian Pellegrin <chripell@fsfe.org>
Acked-by: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert code away from ->read_proc/->write_proc interfaces. Switch to
proc_create()/proc_create_data() which make addition of proc entries
reliable wrt NULL ->proc_fops, NULL ->data and so on.
Problem with ->read_proc et al is described here commit
786d7e1612 "Fix rmmod/read/write races in
/proc entries"
[akpm@linux-foundation.org: CONFIG_PROC_FS=n build fix]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: Karsten Keil <keil@b1-systems.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
To prevent the CAN drivers to operate on invalid socketbuffers the skbs are
now checked and silently dropped at the xmit-function consistently.
Also the netdev stats are consistently using the CAN data length code (dlc)
for [rx|tx]_bytes now.
Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net>
Acked-by: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the kernel portions needed to implement
RFC 5082 Generalized TTL Security Mechanism (GTSM).
It is a lightweight security measure against forged
packets causing DoS attacks (for BGP).
This is already implemented the same way in BSD kernels.
For the necessary Quagga patch
http://www.gossamer-threads.com/lists/quagga/dev/17389
Description from Cisco
http://www.cisco.com/en/US/docs/ios/12_3t/12_3t7/feature/guide/gt_btsh.html
It does add one byte to each socket structure, but I did
a little rearrangement to reuse a hole (on 64 bit), but it
does grow the structure on 32 bit
This should be documented on ip(4) man page and the Glibc in.h
file also needs update. IPV6_MINHOPLIMIT should also be added
(although BSD doesn't support that).
Only TCP is supported, but could also be added to UDP, DCCP, SCTP
if desired.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to be used together with switch technologies, like RFC3069,
that where the individual ports are not allowed to communicate with
each other, but they are allowed to talk to the upstream router. As
described in RFC 3069, it is possible to allow these hosts to
communicate through the upstream router by proxy_arp'ing.
This patch basically allow proxy arp replies back to the same
interface (from which the ARP request/solicitation was received).
Tunable per device via proc "proxy_arp_pvlan":
/proc/sys/net/ipv4/conf/*/proxy_arp_pvlan
This switch technology is known by different vendor names:
- In RFC 3069 it is called VLAN Aggregation.
- Cisco and Allied Telesyn call it Private VLAN.
- Hewlett-Packard call it Source-Port filtering or port-isolation.
- Ericsson call it MAC-Forced Forwarding (RFC Draft).
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the flag CAN_CTRLMODE_ONE_SHOT. It is used as mask
or flag in the "struct can_ctrlmode".
It allows userspace via netlink to set a CAN controller into the special
"one-shot" mode. In this mode, if supported by the CAN controller, it
tries only once to deliver a CAN frame and aborts it if an error
(e.g.: arbitration lost) happens.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since hibernation assumes power loss, we should fully reinitialize
PHYs (including platform fixups), as if PHYs were just attached.
This patch factors phy_init_hw() out of phy_attach_direct(), then
converts mdio_bus to dev_pm_ops and adds an appropriate restore()
callback.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new commands for requesting the driver to remain awake
on a specified channel for the specified amount of time
(and another command to cancel such an operation). This
can be used to implement userspace-controlled off-channel
operations, like Public Action frame exchange on another
channel than the operation channel.
The off-channel operation should behave similarly to scan,
i.e. the local station (if associated) moves into power
save mode to request the AP to buffer frames for it and
then moves to the other channel to allow the off-channel
operation to be completed. The duration parameter can be
used to request enough time to receive a response from
the target station.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently, we insert all user-specified IEs before the HT
IE for association, and after the HT IE for probe requests.
For association, that's correct only if the user-specified
IEs are RSN only, incorrect in all other cases including
WPA. Change this to split apart the user-specified IEs in
two places for association: before the HT IE (e.g. RSN),
after the HT IE (generally empty right now I think?) and
after WMM (all other vendor-specific IEs). For probes,
split the IEs in different places to be correct according
to the spec.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
when using policy routing and the skb mark:
there are cases where a back path validation requires us
to use a different routing table for src ip validation than
the one used for mapping ingress dst ip.
One such a case is transparent proxying where we pretend to be
the destination system and therefore the local table
is used for incoming packets but possibly a main table would
be used on outbound.
Make the default behavior to allow the above and if users
need to turn on the symmetry via sysctl src_valid_mark
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add rtnetlink init_rcvwnd to set the TCP initial receive window size
advertised by passive and active TCP connections.
The current Linux TCP implementation limits the advertised TCP initial
receive window to the one prescribed by slow start. For short lived
TCP connections used for transaction type of traffic (i.e. http
requests), bounding the advertised TCP initial receive window results
in increased latency to complete the transaction.
Support for setting initial congestion window is already supported
using rtnetlink init_cwnd, but the feature is useless without the
ability to set a larger TCP initial receive window.
The rtnetlink init_rcvwnd allows increasing the TCP initial receive
window, allowing TCP connection to advertise larger TCP receive window
than the ones bounded by slow start.
Signed-off-by: Laurent Chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
devtmpfs: unlock mutex in case of string allocation error
Driver core: export platform_device_register_data as a GPL symbol
driver core: Prevent reference to freed memory on error path
Driver-core: Fix bogus 0 error return in device_add()
Driver core: driver_attribute parameters can often be const*
Driver core: bin_attribute parameters can often be const*
Driver core: device_attribute parameters can often be const*
Doc/stable rules: add new cherry-pick logic
vfs: get_sb_single() - do not pass options twice
devtmpfs: Convert dirlock to a mutex
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6:
Staging/vt66*: kconfig, depends on WLAN
Staging: batman-adv: introduce missing kfree
Staging: batman-adv: Add Kconfig dependancies on PROC_FS and PACKET.
Staging: panel: Adjust range for PANEL_KEYPAD in Kconfig
Staging: panel: Fix compilation error with custom lcd charset
Staging: ramzswap: remove ARM specific d-cache hack
Staging: rtl8192x: fix printk formats
Staging: wlan-ng: fix Correct size given to memset
staging: rtl8192su: add USB VID/PID for HWNUm-300
staging: fix rtl8192su compilation errors with mac80211
staging: fix rtl8192e compilation errors with mac80211
Staging: fix rtl8187se compilation errors with mac80211
Staging: rtl8192su: fix test for negative error in rtl8192_rx_isr()
Staging: comedi: jr3_pci: Don't ioremap too much space. Check result.
Staging: comedi: removed "depricated" from COMEDI_CB_BLOCK
Staging: comedi: usbdux.c: fix locking up of the driver when the comedi ringbuffer runs empty
Staging: dst: remove from the tree
Staging: sm7xx: add a new framebuffer driver
Staging: batman: fix debug Kconfig option
DST is dead, no one is using it and upstream
has abandoned it, so remove it from the tree because
it is not going anywhere.
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Many struct driver_attribute descriptors are purely read-only
structures, and there's no need to change them. Therefore make
the promise not to, which will let those descriptors be put in
a ro section.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>