Pull core locking updates from Ingo Molnar:
"The main updates in this cycle were:
- mutex MCS refactoring finishing touches: improve comments, refactor
and clean up code, reduce debug data structure footprint, etc.
- qrwlock finishing touches: remove old code, self-test updates.
- small rwsem optimization
- various smaller fixes/cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Revert qrwlock recusive stuff
locking/rwsem: Avoid double checking before try acquiring write lock
locking/rwsem: Move EXPORT_SYMBOL() lines to follow function definition
locking/rwlock, x86: Delete unused asm/rwlock.h and rwlock.S
locking/rwlock, x86: Clean up asm/spinlock*.h to remove old rwlock code
locking/semaphore: Resolve some shadow warnings
locking/selftest: Support queued rwlock
locking/lockdep: Restrict the use of recursive read_lock() with qrwlock
locking/spinlocks: Always evaluate the second argument of spin_lock_nested()
locking/Documentation: Update locking/mutex-design.txt disadvantages
locking/Documentation: Move locking related docs into Documentation/locking/
locking/mutexes: Use MUTEX_SPIN_ON_OWNER when appropriate
locking/mutexes: Refactor optimistic spinning code
locking/mcs: Remove obsolete comment
locking/mutexes: Document quick lock release when unlocking
locking/mutexes: Standardize arguments in lock/unlock slowpaths
locking: Remove deprecated smp_mb__() barriers
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix an NFSv4.1 state renewal regression
- fix open/lock state recovery error handling
- fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
- fix statd when reconnection fails
- don't wake tasks during connection abort
- don't start reboot recovery if lease check fails
- fix duplicate proc entries
Features:
- pNFS block driver fixes and clean ups from Christoph
- More code cleanups from Anna
- Improve mmap() writeback performance
- Replace use of PF_TRANS with a more generic mechanism for avoiding
deadlocks in nfs_release_page"
* tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
NFS: Fabricate fscache server index key correctly
SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
NFSv3: Fix missing includes of nfs3_fs.h
NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
NFS: avoid waiting at all in nfs_release_page when congested.
NFS: avoid deadlocks with loop-back mounted NFS filesystems.
MM: export page_wakeup functions
SCHED: add some "wait..on_bit...timeout()" interfaces.
NFS: don't use STABLE writes during writeback.
NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
rpc: Add -EPERM processing for xs_udp_send_request()
rpc: return sent and err from xs_sendpages()
lockd: Try to reconnect if statd has moved
SUNRPC: Don't wake tasks during connection abort
Fixing lease renewal
nfs: fix duplicate proc entries
pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
...
In commit c1221321b7
sched: Allow wait_on_bit_action() functions to support a timeout
I suggested that a "wait_on_bit_timeout()" interface would not meet my
need. This isn't true - I was just over-engineering.
Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization. If some other
use is ever found, it can be generalized or added later.
So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:
wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example. Others can be added as needed.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This function will help an async task processing batched jobs from
workqueue decide if it wants to keep processing on more chunks of batched
work that can be delayed, or to accumulate more work for more efficient
batched processing later.
If no other tasks are running on the cpu, the batching process can take
advantgae of the available cpu cycles to a make decision to continue
processing the existing accumulated work to minimize delay,
otherwise it will yield.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* pm-sleep:
PM / hibernate: avoid unsafe pages in e820 reserved regions
* pm-cpufreq:
cpufreq: arm_big_little: fix module license spec
cpufreq: speedstep-smi: fix decimal printf specifiers
cpufreq: OPP: Avoid sleeping while atomic
cpufreq: cpu0: Do not print error message when deferring
cpufreq: integrator: Use set_cpus_allowed_ptr
* pm-cpuidle:
cpuidle: menu: Lookup CPU runqueues less
cpuidle: menu: Call nr_iowait_cpu less times
cpuidle: menu: Use ktime_to_us instead of reinventing the wheel
cpuidle: menu: Use shifts when calculating averages where possible
Pull ACPI and power management updates from Rafael Wysocki:
"Again, ACPICA leads the pack (47 commits), followed by cpufreq (18
commits) and system suspend/hibernation (9 commits).
From the new code perspective, the ACPICA update brings ACPI 5.1 to
the table, including a new device configuration object called _DSD
(Device Specific Data) that will hopefully help us to operate device
properties like Device Trees do (at least to some extent) and changes
related to supporting ACPI on ARM.
Apart from that we have hibernation changes making it use radix trees
to store memory bitmaps which should speed up some operations carried
out by it quite significantly. We also have some power management
changes related to suspend-to-idle (the "freeze" sleep state) support
and more preliminary changes needed to support ACPI on ARM (outside of
ACPICA).
The rest is fixes and cleanups pretty much everywhere.
Specifics:
- ACPICA update to upstream version 20140724. That includes ACPI 5.1
material (support for the _CCA and _DSD predefined names, changes
related to the DMAR and PCCT tables and ARM support among other
things) and cleanups related to using ACPICA's header files. A
major part of it is related to acpidump and the core code used by
that utility. Changes from Bob Moore, David E Box, Lv Zheng,
Sascha Wildner, Tomasz Nowicki, Hanjun Guo.
- Radix trees for memory bitmaps used by the hibernation core from
Joerg Roedel.
- Support for waking up the system from suspend-to-idle (also known
as the "freeze" sleep state) using ACPI-based PCI wakeup signaling
(Rafael J Wysocki).
- Fixes for issues related to ACPI button events (Rafael J Wysocki).
- New device ID for an ACPI-enumerated device included into the
Wildcat Point PCH from Jie Yang.
- ACPI video updates related to backlight handling from Hans de Goede
and Linus Torvalds.
- Preliminary changes needed to support ACPI on ARM from Hanjun Guo
and Graeme Gregory.
- ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui.
- Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros
(Rafael J Wysocki).
- ACPI-based device hotplug cleanups from Wei Yongjun and Rafael J
Wysocki.
- Cleanups and improvements related to system suspend from Lan
Tianyu, Randy Dunlap and Rafael J Wysocki.
- ACPI battery cleanup from Wei Yongjun.
- cpufreq core fixes from Viresh Kumar.
- Elimination of a deadband effect from the cpufreq ondemand governor
and intel_pstate driver cleanups from Stratos Karafotis.
- 350MHz CPU support for the powernow-k6 cpufreq driver from Mikulas
Patocka.
- Fix for the imx6 cpufreq driver from Anson Huang.
- cpuidle core and governor cleanups from Daniel Lezcano, Sandeep
Tripathy and Mohammad Merajul Islam Molla.
- Build fix for the big_little cpuidle driver from Sachin Kamat.
- Configuration fix for the Operation Performance Points (OPP)
framework from Mark Brown.
- APM cleanup from Jean Delvare.
- cpupower utility fixes and cleanups from Peter Senna Tschudin,
Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas
Renninger"
* tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (118 commits)
ACPI / LPSS: add LPSS device for Wildcat Point PCH
ACPI / PNP: Replace faulty is_hex_digit() by isxdigit()
ACPICA: Update version to 20140724.
ACPICA: ACPI 5.1: Update for PCCT table changes.
ACPICA/ARM: ACPI 5.1: Update for GTDT table changes.
ACPICA/ARM: ACPI 5.1: Update for MADT changes.
ACPICA/ARM: ACPI 5.1: Update for FADT changes.
ACPICA: ACPI 5.1: Support for the _CCA predifined name.
ACPICA: ACPI 5.1: New notify value for System Affinity Update.
ACPICA: ACPI 5.1: Support for the _DSD predefined name.
ACPICA: Debug object: Add current value of Timer() to debug line prefix.
ACPICA: acpihelp: Add UUID support, restructure some existing files.
ACPICA: Utilities: Fix local printf issue.
ACPICA: Tables: Update for DMAR table changes.
ACPICA: Remove some extraneous printf arguments.
ACPICA: Update for comments/formatting. No functional changes.
ACPICA: Disassembler: Add support for the ToUUID opererator (macro).
ACPICA: Remove a redundant cast to acpi_size for ACPI_OFFSET() macro.
ACPICA: Work around an ancient GCC bug.
ACPI / processor: Make it possible to get local x2apic id via _MAT
...
The menu governer makes separate lookups of the CPU runqueue to get
load and number of IO waiters but it can be done with a single lookup.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpuidle:
cpuidle: Remove time measurement in poll state
cpuidle: Remove manual selection of the multiple driver support
cpuidle: ladder governor - use macro instead of hardcoded value
cpuidle: big_little: Fix build error
cpuidle: menu governor - remove unused macro STDDEV_THRESH
cpuidle: fix permission for driver name sysfs node
cpuidle: move idle traces to cpuidle_enter_state()
Pull scheduler updates from Ingo Molnar:
- Move the nohz kick code out of the scheduler tick to a dedicated IPI,
from Frederic Weisbecker.
This necessiated quite some background infrastructure rework,
including:
* Clean up some irq-work internals
* Implement remote irq-work
* Implement nohz kick on top of remote irq-work
* Move full dynticks timer enqueue notification to new kick
* Move multi-task notification to new kick
* Remove unecessary barriers on multi-task notification
- Remove proliferation of wait_on_bit() action functions and allow
wait_on_bit_action() functions to support a timeout. (Neil Brown)
- Another round of sched/numa improvements, cleanups and fixes. (Rik
van Riel)
- Implement fast idling of CPUs when the system is partially loaded,
for better scalability. (Tim Chen)
- Restructure and fix the CPU hotplug handling code that may leave
cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
cpu. (Kirill Tkhai)
- Robustify the sched topology setup code. (Peterz Zijlstra)
- Improve sched_feat() handling wrt. static_keys (Jason Baron)
- Misc fixes.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
sched/fair: Fix 'make xmldocs' warning caused by missing description
sched: Use macro for magic number of -1 for setparam
sched: Robustify topology setup
sched: Fix sched_setparam() policy == -1 logic
sched: Allow wait_on_bit_action() functions to support a timeout
sched: Remove proliferation of wait_on_bit() action functions
sched/numa: Revert "Use effective_load() to balance NUMA loads"
sched: Fix static_key race with sched_feat()
sched: Remove extra static_key*() function indirection
sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
sched: Transform resched_task() into resched_curr()
sched/deadline: Kill task_struct->pi_top_task
sched: Rework check_for_tasks()
sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
sched/fair: Disable runtime_enabled on dying rq
sched/numa: Change scan period code to match intent
sched/numa: Rework best node setting in task_numa_migrate()
sched/numa: Examine a task move when examining a task swap
sched/numa: Simplify task_numa_compare()
sched/numa: Use effective_load() to balance NUMA loads
...
Pull cgroup changes from Tejun Heo:
"Mostly changes to get the v2 interface ready. The core features are
mostly ready now and I think it's reasonable to expect to drop the
devel mask in one or two devel cycles at least for a subset of
controllers.
- cgroup added a controller dependency mechanism so that block cgroup
can depend on memory cgroup. This will be used to finally support
IO provisioning on the writeback traffic, which is currently being
implemented.
- The v2 interface now uses a separate table so that the interface
files for the new interface are explicitly declared in one place.
Each controller will explicitly review and add the files for the
new interface.
- cpuset is getting ready for the hierarchical behavior which is in
the similar style with other controllers so that an ancestor's
configuration change doesn't change the descendants' configurations
irreversibly and processes aren't silently migrated when a CPU or
node goes down.
All the changes are to the new interface and no behavior changed for
the multiple hierarchies"
* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
cpuset: fix the WARN_ON() in update_nodemasks_hier()
cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
cgroup: distinguish the default and legacy hierarchies when handling cftypes
cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
cpuset: export effective masks to userspace
cpuset: allow writing offlined masks to cpuset.cpus/mems
cpuset: enable onlined cpu/node in effective masks
cpuset: refactor cpuset_hotplug_update_tasks()
cpuset: make cs->{cpus, mems}_allowed as user-configured masks
cpuset: apply cs->effective_{cpus,mems}
cpuset: initialize top_cpuset's configured masks at mount
cpuset: use effective cpumask to build sched domains
cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
cpuset: update cs->effective_{cpus, mems} when config changes
cpuset: update cpuset->effective_{cpus,mems} at hotplug
cpuset: add cs->effective_cpus and cs->effective_mems
cgroup: clean up sane_behavior handling
...
The scheduler uses policy == -1 to preserve the current policy state to
implement sched_setparam(). But, as (int) -1 is equals to 0xffffffff,
it's matching the if (policy & SCHED_RESET_ON_FORK) on
_sched_setscheduler(). This match changes the policy value to an
invalid value, breaking the sched_setparam() syscall.
This patch checks policy == -1 before check the SCHED_RESET_ON_FORK flag.
The following program shows the bug:
int main(void)
{
struct sched_param param = {
.sched_priority = 5,
};
sched_setscheduler(0, SCHED_FIFO, ¶m);
param.sched_priority = 1;
sched_setparam(0, ¶m);
param.sched_priority = 0;
sched_getparam(0, ¶m);
if (param.sched_priority != 1)
printf("failed priority setting (found %d instead of 1)\n",
param.sched_priority);
else
printf("priority setting fine\n");
}
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org> # 3.14+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Fixes: 7479f3c9cf "sched: Move SCHED_RESET_ON_FORK into attr::sched_flags"
Link: http://lkml.kernel.org/r/9ebe0566a08dbbb3999759d3f20d6004bb2dbcfa.1406079891.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fix from Thomas Gleixner:
"Prevent a possible divide by zero in the debugging code"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix possible divide by zero in avg_atom() calculation
It is currently not possible for various wait_on_bit functions
to implement a timeout.
While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.
The 'action' function is currently passed a pointer to the word
containing the bit being waited on. No current action functions
use this pointer. So changing it to something else will be a
little noisy but will have no immediate effect.
This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.
It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.
An action function can now implement a timeout with something
like
static int timed_out_waiter(struct wait_bit_key *key)
{
unsigned long waited;
if (key->private == 0) {
key->private = jiffies;
if (key->private == 0)
key->private -= 1;
}
waited = jiffies - key->private;
if (waited > 10 * HZ)
return -EAGAIN;
schedule_timeout(waited - 10 * HZ);
return 0;
}
If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".
My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.
While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need. I need the timeout to be sensitive to
the state of the connection with the server, which could change.
So I need to use an 'action' interface.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We always use resched_task() with rq->curr argument.
It's not possible to reschedule any task but rq's current.
The patch introduces resched_curr(struct rq *) to
replace all of the repeating patterns. The main aim
is cleanup, but there is a little size profit too:
(before)
$ size kernel/sched/built-in.o
text data bss dec hex filename
155274 16445 7042 178761 2ba49 kernel/sched/built-in.o
$ size vmlinux
text data bss dec hex filename
7411490 1178376 991232 9581098 92322a vmlinux
(after)
$ size kernel/sched/built-in.o
text data bss dec hex filename
155130 16445 7042 178617 2b9b9 kernel/sched/built-in.o
$ size vmlinux
text data bss dec hex filename
7411362 1178376 991232 9580970 9231aa vmlinux
I was choosing between resched_curr() and resched_rq(),
and the first name looks better for me.
A little lie in Documentation/trace/ftrace.txt. I have not
actually collected the tracing again. With a hope the patch
won't make execution times much worse :)
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>