Pull the v3.5 RCU tree from Paul E. McKenney:
1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature
(with more on the way for 3.6). Posted to LKML:
https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
https://lkml.org/lkml/2012/4/16/611 (commit 4),
https://lkml.org/lkml/2012/4/30/390 (commit 6), and
https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
the other commits for the convenience of the tester).
2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
that have no RCU callbacks. Posted to LKML:
https://lkml.org/lkml/2012/4/23/322.
3) A couple of commits that improve the efficiency of the interaction
between preemptible RCU and the scheduler, these two being all
that survived an abortive attempt to allow preemptible RCU's
__rcu_read_lock() to be inlined. The full set was posted to
LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
third patches of that set remain.
4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
call_srcu() and srcu_barrier(). A major feature of this new
implementation is that synchronize_srcu() no longer disturbs
the execution of other CPUs. This work is based on earlier
implementations by Peter Zijlstra and Paul E. McKenney. Posted to
LKML: https://lkml.org/lkml/2012/2/22/82.
5) A number of miscellaneous bug fixes and improvements which were
posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
subsequent updates posted to LKML.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David S. Miller:
1) Since we do RCU lookups on ipv4 FIB entries, we have to test if the
entry is dead before returning it to our caller.
2) openvswitch locking and packet validation fixes from Ansis Atteka,
Jesse Gross, and Pravin B Shelar.
3) Fix PM resume locking in IGB driver, from Benjamin Poirier.
4) Fix VLAN header handling in vhost-net and macvtap, from Basil Gor.
5) Revert a bogus network namespace isolation change that was causing
regressions on S390 networking devices.
6) If bonding decides to process and handle a LACPDU frame, we
shouldn't bump the rx_dropped counter. From Jiri Bohac.
7) Fix mis-calculation of available TX space in r8169 driver when doing
TSO, which can lead to crashes and/or hung device. From Julien
Ducourthial.
8) SCTP does not validate cached routes properly in all cases, from
Nicolas Dichtel.
9) Link status interrupt needs to be handled in ks8851 driver, from
Stephen Boyd.
10) Use capable(), not cap_raised(), in connector/userns netlink code.
From Eric W. Biederman via Andrew Morton.
11) Fix pktgen OOPS on module unload, from Eric Dumazet.
12) iwlwifi under-estimates SKB truesizes, also from Eric Dumazet.
13) Cure division by zero in SFC driver, from Ben Hutchings.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
ks8851: Update link status during link change interrupt
macvtap: restore vlan header on user read
vhost-net: fix handle_rx buffer size
bonding: don't increase rx_dropped after processing LACPDUs
connector/userns: replace netlink uses of cap_raised() with capable()
sctp: check cached dst before using it
pktgen: fix crash at module unload
Revert "net: maintain namespace isolation between vlan and real device"
ehea: fix losing of NEQ events when one event occurred early
igb: fix rtnl race in PM resume path
ipv4: Do not use dead fib_info entries.
r8169: fix unsigned int wraparound with TSO
sfc: Fix division by zero when using one RX channel and no SR-IOV
openvswitch: Validation of IPv6 set port action uses IPv4 header
net: compare_ether_addr[_64bits]() has no ordering
cdc_ether: Ignore bogus union descriptor for RNDIS devices
bnx2x: bug fix when loading after SAN boot
e1000: Silence sparse warnings by correcting type
igb, ixgbe: netdev_tx_reset_queue incorrectly called from tx init path
openvswitch: Release rtnl_lock if ovs_vport_cmd_build_info() failed.
...
barrier: Reduce the amount of disturbance by rcu_barrier() to the rest of
the system. This branch also includes improvements to
RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes: Miscellaneous fixes.
inline: Remaining changes from an abortive attempt to inline
preemptible RCU's __rcu_read_lock(). These are (1) making
exit_rcu() avoid unnecessary work and (2) avoiding having
preemptible RCU record a blocked thread when the scheduler
declines to do a context switch.
srcu: Lai Jiangshan's algorithmic implementation of SRCU, including
call_srcu().
dst_check() will take care of SA (and obsolete field), hence
IPsec rekeying scenario is taken into account.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Vlad Yaseivch <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 8a83a00b07.
It causes regressions for S390 devices, because it does an
unconditional DST drop on SKBs for vlans and the QETH device
needs the neighbour entry hung off the DST for certain things
on transmit.
Arnd can't remember exactly why he even needed this change.
Conflicts:
drivers/net/macvlan.c
net/8021q/vlan_dev.c
net/core/dev.c
Signed-off-by: David S. Miller <davem@davemloft.net>
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
CPU goes offline, in which case it assumes that the CPU will have to come
out of dyntick-idle mode (cancelling the timer) in order to go offline.
This is important because when RCU_FAST_NO_HZ permits a CPU to enter
dyntick-idle mode despite having RCU callbacks pending, it posts a timer
on that CPU to force a wakeup on that CPU. This wakeup ensures that the
CPU will eventually handle the end of the grace period, including invoking
its RCU callbacks.
However, Pascal Chapperon's test setup shows that the timer handler
rcu_idle_gp_timer_func() really does get invoked in some cases. This is
problematic because this can cause the CPU that entered dyntick-idle
mode despite still having RCU callbacks pending to remain in
dyntick-idle mode indefinitely, which means that its RCU callbacks might
never be invoked. This situation can result in grace-period delays or
even system hangs, which matches Pascal's observations of slow boot-up
and shutdown (https://lkml.org/lkml/2012/4/5/142). See also the bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=806548
This commit therefore causes the "should never be invoked" timer handler
rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
the CPU for which the timer was intended, allowing that CPU to invoke
its RCU callbacks in a timely manner.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Neither compare_ether_addr() nor compare_ether_addr_64bits()
(as it can fall back to the former) have comparison semantics
like memcmp() where the sign of the return value indicates sort
order. We had a bug in the wireless code due to a blind memcmp
replacement because of this.
A cursory look suggests that the wireless bug was the only one
due to this semantic difference.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes form Peter Anvin
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intel_mid_powerbtn: mark irq as IRQF_NO_SUSPEND
arch/x86/platform/geode/net5501.c: change active_low to 0 for LED driver
x86, relocs: Remove an unused variable
asm-generic: Use __BITS_PER_LONG in statfs.h
x86/amd: Re-enable CPU topology extensions in case BIOS has disabled it
Pull an ACPI patch from Len Brown:
"It fixes a D3 issue new in 3.4-rc1."
By Lin Ming via Len Brown:
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
ACPI: Fix D3hot v D3cold confusion
Before this patch, ACPI_STATE_D3 incorrectly referenced D3hot
in some places, but D3cold in other places.
After this patch, ACPI_STATE_D3 always means ACPI_STATE_D3_COLD;
and all references to D3hot use ACPI_STATE_D3_HOT.
ACPI's _PR3 method is used to enter both D3hot and D3cold states.
What distinguishes D3hot from D3cold is the presence _PR3
(Power Resources for D3hot) If these resources are all ON,
then the state is D3hot. If _PR3 is not present,
or all _PR0 resources for the devices are OFF,
then the state is D3cold.
This patch applies after Linux-3.4-rc1.
A future syntax cleanup may remove ACPI_STATE_D3
to emphasize that it always means ACPI_STATE_D3_COLD.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Aaron Lu <aaron.lu@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The normal read_seqcount_begin() function will wait for any current
writers to exit their critical region by looping until the sequence
count is even.
That "wait for sequence count to stabilize" is the right thing to do if
the read-locker will just retry the whole operation on contention: no
point in doing a potentially expensive reader sequence if we know at the
beginning that we'll just end up re-doing it all.
HOWEVER. Some users don't actually retry the operation, but instead
will abort and do the operation with proper locking. So the sequence
count case may be the optimistic quick case, but in the presense of
writers you may want to do full locking in order to guarantee forward
progress. The prime example of this would be the RCU name lookup.
And in that case, you may well be better off without the "retry early",
and are in a rush to instead get to the failure handling. Thus this
"raw" interface that just returns the sequence number without testing it
- it just forces the low bit to zero so that read_seqcount_retry() will
always fail such a "active concurrent writer" scenario.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We really need to use a ACCESS_ONCE() on the sequence value read in
__read_seqcount_begin(), because otherwise the compiler might end up
reloading the value in between the test and the return of it. As a
result, it might end up returning an odd value (which means that a write
is in progress).
If the reader is then fast enough that that odd value is still the
current one when the read_seqcount_retry() is done, we might end up with
a "successful" read sequence, even despite the concurrent write being
active.
In practice this probably never really happens - there just isn't
anything else going on around the read of the sequence count, and the
common case is that we end up having a read barrier immediately
afterwards.
So the code sequence in which gcc might decide to reaload from memory is
small, and there's no reason to believe it would ever actually do the
reload. But if the compiler ever were to decide to do so, it would be
incredibly annoying to debug. Let's just make sure.
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Transfer padding was wrong for full-speed USB in ASIX driver, fix
from Ingo van Lil.
2) Propagate the negative packet offset fix into the PowerPC BPF JIT.
From Jan Seiffert.
3) dl2k driver's private ioctls were letting unprivileged tasks make
MII writes and other ugly bits like that. Fix from Jeff Mahoney.
4) Fix TX VLAN and RX packet drops in ucc_geth, from Joakim Tjernlund.
5) OOPS and network namespace fixes in IPVS from Hans Schillstrom and
Julian Anastasov.
6) Fix races and sleeping in locked context bugs in drop_monitor, from
Neil Horman.
7) Fix link status indication in smsc95xx driver, from Paolo Pisati.
8) Fix bridge netfilter OOPS, from Peter Huang.
9) L2TP sendmsg can return on error conditions with the socket lock
held, oops. Fix from Sasha Levin.
10) udp_diag should return meaningful values for socket memory usage,
from Shan Wei.
11) Eric Dumazet is so awesome he gets his own section:
Socket memory cgroup code (I never should have applied those
patches, grumble...) made erroneous changes to
sk_sockets_allocated_read_positive(). It was changed to
use percpu_counter_sum_positive (which requires BH disabling)
instead of percpu_counter_read_positive (which does not).
Revert back to avoid crashes and lockdep warnings.
Adjust the default tcp_adv_win_scale and tcp_rmem[2] values
to fix throughput regressions. This is necessary as a result
of our more precise skb->truesize tracking.
Fix SKB leak in netem packet scheduler.
12) New device IDs for various bluetooth devices, from Manoj Iyer,
AceLan Kao, and Steven Harms.
13) Fix command completion race in ipw2200, from Stanislav Yakovlev.
14) Fix rtlwifi oops on unload, from Larry Finger.
15) Fix hard_mtu when adjusting hard_header_len in smsc95xx driver.
From Stephane Fillod.
16) ehea driver registers it's IRQ before all the necessary state is
setup, resulting in crashes. Fix from Thadeu Lima de Souza
Cascardo.
17) Fix PHY connection failures in davinci_emac driver, from Anatolij
Gustschin.
18) Missing break; in switch statement in bluetooth's
hci_cmd_complete_evt(). Fix from Szymon Janc.
19) Fix queue programming in iwlwifi, from Johannes Berg.
20) Interrupt throttling defaults not being actually programmed into the
hardware, fix from Jeff Kirsher and Ying Cai.
21) TLAN driver SKB encoding in descriptor busted on 64-bit, fix from
Benjamin Poirier.
22) Fix blind status block RX producer pointer deref in TG3 driver, from
Matt Carlson.
23) Promisc and multicast are busted on ehea, fixes from Thadeu Lima de
Souza Cascardo.
24) Fix crashes in 6lowpan, from Alexander Smirnov.
25) tcp_complete_cwr() needs to be careful to not rewind the CWND to
ssthresh if ssthresh has the "infinite" value. Fix from Yuchung
Cheng.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
sungem: Fix WakeOnLan
tcp: change tcp_adv_win_scale and tcp_rmem[2]
net: l2tp: unlock socket lock before returning from l2tp_ip_sendmsg
drop_monitor: prevent init path from scheduling on the wrong cpu
usbnet: fix failure handling in usbnet_probe
usbnet: fix leak of transfer buffer of dev->interrupt
ucc_geth: Add 16 bytes to max TX frame for VLANs
net: ucc_geth, increase no. of HW RX descriptors
netem: fix possible skb leak
sky2: fix receive length error in mixed non-VLAN/VLAN traffic
sky2: propogate rx hash when packet is copied
net: fix two typos in skbuff.h
cxgb3: Don't call cxgb_vlan_mode until q locks are initialized
ixgbe: fix calling skb_put on nonlinear skb assertion bug
ixgbe: Fix a memory leak in IEEE DCB
igbvf: fix the bug when initializing the igbvf
smsc75xx: enable mac to detect speed/duplex from phy
smsc75xx: declare smsc75xx's MII as GMII capable
smsc75xx: fix phy interrupt acknowledge
smsc75xx: fix phy init reset loop
...
When running preemptible RCU, if a task exits in an RCU read-side
critical section having blocked within that same RCU read-side critical
section, the task must be removed from the list of tasks blocking a
grace period (perhaps the current grace period, perhaps the next grace
period, depending on timing). The exit() path invokes exit_rcu() to
do this cleanup.
However, the current implementation of exit_rcu() needlessly does the
cleanup even if the task did not block within the current RCU read-side
critical section, which wastes time and needlessly increases the size
of the state space. Fix this by only doing the cleanup if the current
task is actually on the list of tasks blocking some grace period.
While we are at it, consolidate the two identical exit_rcu() functions
into a single function.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
kernel/rcupdate.c
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
This is inefficient because enqueuing is required only if there is a
context switch, and entry to the scheduler does not guarantee a context
switch.
The commit therefore moves the enqueuing to immediately precede the
call to switch_to() from the scheduler.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull SCSI fixes from James Bottomley:
"This is a set of SAS and SATA fixes; there are one or two longstanding
bug fixes, but most of this is regression fixes."
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
[SCSI] libfc: update mfs boundry checking
[SCSI] Revert "[SCSI] libsas: fix sas port naming"
[SCSI] libsas: fix false positive 'device attached' conditions
[SCSI] libsas, libata: fix start of life for a sas ata_port
[SCSI] libsas: fix ata_eh clobbering ex_phys via smp_ata_check_ready
[SCSI] libsas: unify domain_device sas_rphy lifetimes
[SCSI] libsas: fix sas_get_port_device regression
[SCSI] libsas: fix sas_find_bcast_phy() in the presence of 'vacant' phys
[SCSI] libsas: introduce sas_work to fix sas_drain_work vs sas_queue_work
[SCSI] libata: Pass correct DMA device to scsi host
[SCSI] scsi_lib: use correct DMA device in __scsi_alloc_queue
This commit implements an SRCU state machine in support of call_srcu().
The state machine is preemptible, light-weight, and single-threaded,
minimizing synchronization overhead. In particular, there is no longer
any need for synchronize_srcu() to be guarded by a mutex.
Expedited processing is handled, at least in the absence of concurrent
grace-period operations on that same srcu_struct structure, by having
the synchronize_srcu_expedited() thread take on the role of the
workqueue thread for one iteration.
There is a reasonable probability that a given SRCU callback will
be invoked on the same CPU that registered it, however, there is no
guarantee. Concurrent SRCU grace-period primitives can cause callbacks
to be executed elsewhere, even in absence of CPU-hotplug operations.
Callbacks execute in process context, but under the influence of
local_bh_disable(), so it is illegal to sleep in an SRCU callback
function.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The old srcu_barrier() macro is now unused. This commit removes it so
that it may be used for the SRCU flavor of rcu_barrier(), which will in
turn be needed to allow the upcoming call_srcu() to be used from within
modules.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit implements a variant of Peter's algorithm, which may be found
at https://lkml.org/lkml/2012/2/1/119.
o Make the checking lock-free to enable parallel checking.
Parallel checking is required when (1) the original checking
task is preempted for a long time, (2) sychronize_srcu_expedited()
starts during an ongoing SRCU grace period, or (3) we wish to
avoid acquiring a lock.
o Since the checking is lock-free, we avoid a mutex in state machine
for call_srcu().
o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
This might allow us to remove the preempt_disable() in future
versions, though such removal will need great care because it
rescinds the one-old-reader-per-CPU guarantee.
o Remove a smp_mb(), simplify the comments and make the smp_mb() pairs
more intuitive.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The purpose of the upper bit of SRCU's per-CPU counters is to guarantee
that no reasonable series of srcu_read_lock() and srcu_read_unlock()
operations can return the value of the counter to its original value.
This guarantee is require only after the index has been switched to
the other set of counters, so at most one srcu_read_lock() can affect
a given CPU's counter. The number of srcu_read_unlock() operations
on a given counter is limited to the number of tasks in the system,
which given the Linux kernel's current structure is limited to far less
than 2^30 on 32-bit systems and far less than 2^62 on 64-bit systems.
(Something about a limited number of bytes in the kernel's address space.)
Therefore, if srcu_read_lock() increments the upper bits, then
srcu_read_unlock() need not do so. In this case, an srcu_read_lock() and
an srcu_read_unlock() will flip the lower bit of the upper field of the
counter. An unreasonably large additional number of srcu_read_unlock()
operations would be required to return the counter to its initial value,
thus preserving the guarantee.
This commit takes this approach, which further allows it to shrink
the size of the upper field to one bit, making the number of
srcu_read_unlock() operations required to return the counter to its
initial value even more unreasonable than before.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current implementation of synchronize_srcu_expedited() can cause
severe OS jitter due to its use of synchronize_sched(), which in turn
invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
This can result in severe performance degradation for real-time workloads
and especially for short-interation-length HPC workloads. Furthermore,
because only one instance of try_stop_cpus() can be making forward progress
at a given time, only one instance of synchronize_srcu_expedited() can
make forward progress at a time, even if they are all operating on
distinct srcu_struct structures.
This commit, inspired by an earlier implementation by Peter Zijlstra
(https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
takes a strictly algorithmic bits-in-memory approach. This has the
disadvantage of requiring one explicit memory-barrier instruction in
each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
completely dispenses with OS jitter and furthermore allows SRCU to be
used freely by CPUs that RCU believes to be idle or offline.
The update-side implementation handles the single read-side memory
barrier by rechecking the per-CPU counters after summing them and
by running through the update-side state machine twice.
This implementation has passed moderate rcutorture testing on both
x86 and Power. Also updated to use this_cpu_ptr() instead of per_cpu_ptr(),
as suggested by Peter Zijlstra.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>