Pull perf fixes from Ingo Molnar:
"An ABI documentation fix, and a mixed-PMU perf-info-corruption fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Document the new transaction sample type
perf: Disable all pmus on unthrottling and rescheduling
Merge patches from Andrew Morton:
"23 fixes and a MAINTAINERS update"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
mm/hugetlb: check for pte NULL pointer in __page_check_address()
fix build with make 3.80
mm/mempolicy: fix !vma in new_vma_page()
MAINTAINERS: add Davidlohr as GPT maintainer
mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
mm/compaction: respect ignore_skip_hint in update_pageblock_skip
mm/mempolicy: correct putback method for isolate pages if failed
mm: add missing dependency in Kconfig
sh: always link in helper functions extracted from libgcc
mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
mm: numa: defer TLB flush for THP migration as long as possible
mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates
mm: fix TLB flush race between migration, and change_protection_range
mm: numa: avoid unnecessary disruption of NUMA hinting during migration
mm: numa: clear numa hinting information on mprotect
sched: numa: skip inaccessible VMAs
mm: numa: avoid unnecessary work on the failure path
mm: numa: ensure anon_vma is locked to prevent parallel THP splits
mm: numa: do not clear PTE for pte_numa update
mm: numa: do not clear PMD during PTE update scan
...
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1b3a5d02ee ("reboot: move arch/x86 reboot= handling to generic
kernel") moved reboot= handling to generic code. In the process it also
removed the code in native_machine_shutdown() which are moving reboot
process to reboot_cpu/cpu0.
I guess that thought must have been that all reboot paths are calling
migrate_to_reboot_cpu(), so we don't need this special handling. But
kexec reboot path (kernel_kexec()) is not calling
migrate_to_reboot_cpu() so above change broke kexec. Now reboot can
happen on non-boot cpu and when INIT is sent in second kerneo to bring
up BP, it brings down the machine.
So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
this problem.
Bisected by WANG Chao.
Reported-by: Matthew Whitehead <mwhitehe@redhat.com>
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Tested-by: WANG Chao <chaowang@redhat.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull crypto key patches from David Howells:
"There are four items:
- A patch to fix X.509 certificate gathering. The problem was that I
was coming up with a different path for signing_key.x509 in the
build directory if it didn't exist to if it did exist. This meant
that the X.509 cert container object file would be rebuilt on the
second rebuild in a build directory and the kernel would get
relinked.
- Unconditionally remove files generated by SYSTEM_TRUSTED_KEYRING=y
when doing make mrproper.
- Actually initialise the persistent-keyring semaphore for
init_user_ns. I have no idea why this works at all for users in
the base user namespace unless it's something to do with systemd
containerising the system.
- Documentation for module signing"
* 'keys-devel' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
Add Documentation/module-signing.txt file
KEYS: fix uninitialized persistent_keyring_register_sem
KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y
X.509: Fix certificate gathering
Pull scheduler fixes from Ingo Molnar:
"Three fixes for scheduler crashes, each triggers in relatively rare,
hardware environment dependent situations"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Rework sched_fair time accounting
math64: Add mul_u64_u32_shr()
sched: Remove PREEMPT_NEED_RESCHED from generic code
sched: Initialize power_orig for overlapping groups
When mutex debugging is enabled and an imbalanced mutex_unlock()
is called, we get the following, slightly confusing warning:
[ 364.208284] DEBUG_LOCKS_WARN_ON(lock->owner != current)
But in that case the warning is due to an imbalanced mutex_unlock() call,
and the lock->owner is NULL - so the message is misleading.
So improve the message by testing for this case specifically:
DEBUG_LOCKS_WARN_ON(!lock->owner)
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1386136693.3650.48.camel@cliu38-desktop-build
[ Improved the changelog, changed the patch to use !lock->owner consistently. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince Weaver reports that, on all architectures apart from ARM,
PERF_EVENT_IOC_PERIOD doesn't actually update the period until the next
event fires. This is counter-intuitive behaviour and is better dealt
with in the core code.
This patch ensures that the period is forcefully reset when dealing with
such a request in the core code. A subsequent patch removes the
equivalent hack from the ARM back-end.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1385560479-11014-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch touches the RT group scheduling case.
Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.
The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq. The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.
The patch below fixes the problem.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
dereference on sd_busy but the fix also altered what scheduling domain it
used for the 'sd_llc' percpu variable.
One impact of this is that a task selecting a runqueue may consider
idle CPUs that are not cache siblings as candidates for running.
Tasks are then running on CPUs that are not cache hot.
This was found through bisection where ebizzy threads were not seeing equal
performance and it looked like a scheduling fairness issue. This patch
mitigates but does not completely fix the problem on all machines tested
implying there may be an additional bug or a common root cause. Here are
the average range of performance seen by individual ebizzy threads. It
was tested on top of candidate patches related to x86 TLB range flushing.
4-core machine
3.13.0-rc3 3.13.0-rc3
vanilla fixsd-v3r3
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%)
Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%)
Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%)
Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%)
Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%)
Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%)
Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%)
8-core machine
Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%)
Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%)
Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%)
Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%)
Mean 6 23.21 ( 0.00%) 69.46 (-199.27%)
Mean 7 15.85 ( 0.00%) 101.72 (-541.77%)
Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%)
Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%)
Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%)
It's eliminated for one machine and reduced for another.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: H Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hugh reported this bug:
> CONFIG_MEMCG_SWAP is broken in 3.13-rc. Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1 # let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319c ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.
Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.
To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.
tj: Updated comment to note planned changes for cgrp->id.
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
test, we will lose the trace information on that CPU.
Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# echo 1 > /sys/devices/system/cpu/cpu1/online
# run test
- If we offline a CPU before we enable the function profile, we will not clear
the trace information when we enable the function profile. It will trouble
the users.
Steps to reproduce:
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# run test
# cat trace_stat/function*
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# cat trace_stat/function*
# run test
# cat trace_stat/function*
So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.
Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull PCI updates from Bjorn Helgaas:
"PCI device hotplug
- Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael
Wysocki)
Host bridge drivers
- Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn
Helgaas)
- mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
(Jason Gunthorpe)
Miscellaneous
- Avoid unnecessary CPU switch when calling .probe() (Alexander
Duyck)
- Revert "workqueue: allow work_on_cpu() to be called recursively"
(Bjorn Helgaas)
- Disable Bus Master only on kexec reboot (Khalid Aziz)
- Omit PCI ID macro strings to shorten quirk names for LTO (Michal
Marek)"
* tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
MAINTAINERS: Add DesignWare, i.MX6, Armada, R-Car PCI host maintainers
PCI: Disable Bus Master only on kexec reboot
PCI: mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
PCI: Omit PCI ID macro strings to shorten quirk names
PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()
Revert "workqueue: allow work_on_cpu() to be called recursively"
PCI: Avoid unnecessary CPU switch when calling driver .probe() method