Merge more updates from Andrew Morton:
"55 patches.
Subsystems affected by this patch series: percpu, procfs, sysctl,
misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2,
hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits)
lib: remove redundant assignment to variable ret
ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR
lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB
btrfs: use generic Kconfig option for 256kB page size limit
arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB
configs: introduce debug.config for CI-like setup
delayacct: track delays from memory compact
Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact
delayacct: cleanup flags in struct task_delay_info and functions use it
delayacct: fix incomplete disable operation when switch enable to disable
delayacct: support swapin delay accounting for swapping without blkio
panic: remove oops_id
panic: use error_report_end tracepoint on warnings
fs/adfs: remove unneeded variable make code cleaner
FAT: use io_schedule_timeout() instead of congestion_wait()
hfsplus: use struct_group_attr() for memcpy() region
nilfs2: remove redundant pointer sbufs
fs/binfmt_elf: use PT_LOAD p_align values for static PIE
const_structs.checkpatch: add frequently used ops structs
...
When I was implementing a new per-cpu kthread cfs_migration, I found the
comm of it "cfs_migration/%u" is truncated due to the limitation of
TASK_COMM_LEN. For example, the comm of the percpu thread on CPU10~19
all have the same name "cfs_migration/1", which will confuse the user.
This issue is not critical, because we can get the corresponding CPU
from the task's Cpus_allowed. But for kthreads corresponding to other
hardware devices, it is not easy to get the detailed device info from
task comm, for example,
jbd2/nvme0n1p2-
xfs-reclaim/sdf
Currently there are so many truncated kthreads:
rcu_tasks_kthre
rcu_tasks_rude_
rcu_tasks_trace
poll_mpt3sas0_s
ext4-rsv-conver
xfs-reclaim/sd{a, b, c, ...}
xfs-blockgc/sd{a, b, c, ...}
xfs-inodegc/sd{a, b, c, ...}
audit_send_repl
ecryptfs-kthrea
vfio-irqfd-clea
jbd2/nvme0n1p2-
...
We can shorten these names to work around this problem, but it may be
not applied to all of the truncated kthreads. Take 'jbd2/nvme0n1p2-'
for example, it is a nice name, and it is not a good idea to shorten it.
One possible way to fix this issue is extending the task comm size, but
as task->comm is used in lots of places, that may cause some potential
buffer overflows. Another more conservative approach is introducing a
new pointer to store kthread's full name if it is truncated, which won't
introduce too much overhead as it is in the non-critical path. Finally
we make a dicision to use the second approach. See also the discussions
in this thread:
https://lore.kernel.org/lkml/20211101060419.4682-1-laoar.shao@gmail.com/
After this change, the full name of these truncated kthreads will be
displayed via /proc/[pid]/comm:
rcu_tasks_kthread
rcu_tasks_rude_kthread
rcu_tasks_trace_kthread
poll_mpt3sas0_statu
ext4-rsv-conversion
xfs-reclaim/sdf1
xfs-blockgc/sdf1
xfs-inodegc/sdf1
audit_send_reply
ecryptfs-kthread
vfio-irqfd-cleanup
jbd2/nvme0n1p2-8
Link: https://lkml.kernel.org/r/20211120112850.46047-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull signal/exit/ptrace updates from Eric Biederman:
"This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs found
along the way.
The end-game which I have not yet reached yet is for fatal signals
that generate coredumps to be short-circuit deliverable from
complete_signal, for force_siginfo_to_task not to require changing
userspace configured signal delivery state, and for the ptrace stops
to always happen in locations where we can guarantee on all
architectures that the all of the registers are saved and available on
the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues
reported were small enough that they would not affect bisection so I
simply added the fixes and did not fold the fixes into the changes
they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.
Some successes after this set of changes: There are few enough calls
to do_exit to audit in a reasonable amount of time. The lifetime of
struct kthread now matches the lifetime of struct task, and the
pointer to struct kthread is no longer stored in set_child_tid. The
flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
removed. Issues where task->exit_code was examined with
signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am
cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at:
https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org
https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org
https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
taskstats: Cleanup the use of task->exit_code
exit: Use the correct exit_code in /proc/<pid>/stat
exit: Fix the exit_code for wait_task_zombie
exit: Coredumps reach do_group_exit
exit: Remove profile_handoff_task
exit: Remove profile_task_exit & profile_munmap
signal: clean up kernel-doc comments
signal: Remove the helper signal_group_exit
signal: Rename group_exit_task group_exec_task
coredump: Stop setting signal->group_exit_task
signal: Remove SIGNAL_GROUP_COREDUMP
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Make coredump handling explicit in complete_signal
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Have the oom killer detect coredumps using signal->core_state
exit: Move force_uaccess back into do_exit
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
...
Today the rules are a bit iffy and arbitrary about which kernel
threads have struct kthread present. Both idle threads and thread
started with create_kthread want struct kthread present so that is
effectively all kernel threads. Make the rule that if PF_KTHREAD
and the task is running then struct kthread is present.
This will allow the kernel thread code to using tsk->exit_code
with different semantics from ordinary processes.
To make ensure that struct kthread is present for all
kernel threads move it's allocation into copy_process.
Add a deallocation of struct kthread in exec for processes
that were kernel threads.
Move the allocation of struct kthread for the initial thread
earlier so that it is not repeated for each additional idle
thread.
Move the initialization of struct kthread into set_kthread_struct
so that the structure is always and reliably initailized.
Clear set_child_tid in free_kthread_struct to ensure the kthread
struct is reliably freed during exec. The function
free_kthread_struct does not need to clear vfork_done during exec as
exec_mm_release called from exec_mmap has already cleared vfork_done.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update complete_and_exit to call kthread_exit instead of do_exit.
Change the name to reflect this change in functionality. All of the
users of complete_and_exit are causing the current kthread to exit so
this change makes it clear what is happening.
Move the implementation of kthread_complete_and_exit from
kernel/exit.c to to kernel/kthread.c. As this function is kthread
specific it makes most sense to live with the kthread functions.
There are no functional change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The way the per task_struct exit_code is used by kernel threads is not
quite compatible how it is used by userspace applications. The low
byte of the userspace exit_code value encodes the exit signal. While
kthreads just use the value as an int holding ordinary kernel function
exit status like -EPERM.
Add kthread_exit to clearly separate the two kinds of uses.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Merge misc updates from Andrew Morton:
"191 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...
For all intents and purposes, the idle task is a per-CPU kthread. It isn't
created via the same route as other pcpu kthreads however, and as a result
it is missing a few bells and whistles: it fails kthread_is_per_cpu() and
it doesn't have PF_NO_SETAFFINITY set.
Fix the former by giving the idle task a kthread struct along with the
KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle()
call be called more than once on the same idle task.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510151024.2448573-2-valentin.schneider@arm.com
There is a need to distinguish geniune per-cpu kthreads from kthreads
that happen to have a single CPU affinity.
Geniune per-cpu kthreads are kthreads that are CPU affine for
correctness, these will obviously have PF_KTHREAD set, but must also
have PF_NO_SETAFFINITY set, lest userspace modify their affinity and
ruins things.
However, these two things are not sufficient, PF_NO_SETAFFINITY is
also set on other tasks that have their affinities controlled through
other means, like for instance workqueues.
Therefore another bit is needed; it turns out kthread_create_per_cpu()
already has such a bit: KTHREAD_IS_PER_CPU, which is used to make
kthread_park()/kthread_unpark() work correctly.
Expose this flag and remove the implicit setting of it from
kthread_create_on_cpu(); the io_uring usage of it seems dubious at
best.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org
Merge some more updates from Andrew Morton:
- various hotfixes and minor things
- hch's use_mm/unuse_mm clearnups
Subsystems affected by this patch series: mm/hugetlb, scripts, kcov,
lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kernel: set USER_DS in kthread_use_mm
kernel: better document the use_mm/unuse_mm API contract
kernel: move use_mm/unuse_mm to kthread.c
kernel: move use_mm/unuse_mm to kthread.c
stacktrace: cleanup inconsistent variable type
lib: test get_count_order/long in test_bitops.c
mm: add comments on pglist_data zones
ocfs2: fix spelling mistake and grammar
mm/debug_vm_pgtable: fix kernel crash by checking for THP support
lib: fix bitmap_parse() on 64-bit big endian archs
checkpatch: correct check for kernel parameters doc
nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
lib/lz4/lz4_decompress.c: document deliberate use of `&'
kcov: check kcov_softirq in kcov_remote_stop()
scripts/spelling: add a few more typos
khugepaged: selftests: fix timeout condition in wait_for_scan()
It's handy to keep the kthread_fn just as a unique cookie to identify
classes of kthreads. E.g. if you can verify that a given task is
running your thread_fn, then you may know what sort of type kthread_data
points to.
We'll use this in nfsd to pass some information into the vfs. Note it
will need kthread_data() exported too.
Original-patch-by: Tejun Heo <tj@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- refcount conversions
- Solve the rq->leaf_cfs_rq_list can of worms for real.
- improve power-aware scheduling
- add sysctl knob for Energy Aware Scheduling
- documentation updates
- misc other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...
The TIMER_IRQSAFE usage was introduced in commit 22597dc3d9 ("kthread:
initial support for delayed kthread work") which modelled the delayed
kthread code after workqueue's code. The workqueue code requires the flag
TIMER_IRQSAFE for synchronisation purpose. This is not true for kthread's
delay timer since all operations occur under a lock.
Remove TIMER_IRQSAFE from the timer initialisation and use timer_setup()
for initialisation purpose which is the official function.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lkml.kernel.org/r/20190212162554.19779-2-bigeasy@linutronix.de
kthread_should_park() is used to check if the calling kthread ('current')
should park, but there is no function to check whether an arbitrary kthread
should be parked. The latter is required to plug a CPU hotplug race vs. a
parking ksoftirqd thread.
The new __kthread_should_park() receives a task_struct as parameter to
check if the corresponding kernel thread should be parked.
Call __kthread_should_park() from kthread_should_park() to avoid code
duplication.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Link: https://lkml.kernel.org/r/20190128234625.78241-2-mka@chromium.org
Gaurav reports that commit:
85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
isn't working for him. Because of the following race:
> controller Thread CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
> smpboot_thread_fn
> set Task interruptible
>
>
> wake_up_process
> if (!(p->state & state))
> goto out;
>
> Kthread_parkme
> SET TASK_PARKED
> schedule
> raw_spin_lock(&rq->lock)
> ttwu_remote
> waiting for __task_rq_lock
> context_switch
>
> finish_lock_switch
>
>
>
> Case TASK_PARKED
> kthread_park_complete
>
>
> SET Running
Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
handling is buggered because the TASK_DEAD thing is done with
preemption disabled, the current code can still complete early on
preemption :/
So basically revert that earlier fix and go with a variant of the
alternative mentioned in the commit. Promote TASK_PARKED to special
state to avoid the store-store issue on task->state leading to the
WARN in kthread_unpark() -> __kthread_bind().
But in addition, add wait_task_inactive() to kthread_park() to ensure
the task really is PARKED when we return from kthread_park(). This
avoids the whole kthread still gets migrated nonsense -- although it
would be really good to get this done differently.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 85f1abe001 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Even with the wait-loop fixed, there is a further issue with
kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
smpboot_park_threads() can return before all those threads are in fact
blocked, due to the placement of the complete() in __kthread_parkme().
When that happens, sched_cpu_dying() -> migrate_tasks() can end up
migrating such a still runnable task onto another CPU.
Normally the task will have hit schedule() and gone to sleep by the
time we do kthread_unpark(), which will then do __kthread_bind() to
re-bind the task to the correct CPU.
However, when we loose the initial TASK_PARKED store to the concurrent
wakeup issue described previously, do the complete(), get migrated, it
is possible to either:
- observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
the park and set TASK_RUNNING, or
- __kthread_bind()'s wait_task_inactive() to observe the competing
TASK_RUNNING store.
Either way the WARN() in __kthread_bind() will trigger and fail to
correctly set the CPU affinity.
Fix this by only issuing the complete() when the kthread has scheduled
out. This does away with all the icky 'still running' nonsense.
The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With all callbacks converted, and the timer callback prototype
switched over, the TIMER_FUNC_TYPE cast is no longer needed,
so remove it. Conversion was done with the following scripts:
perl -pi -e 's|\(TIMER_FUNC_TYPE\)||g' \
$(git grep TIMER_FUNC_TYPE | cut -d: -f1 | sort -u)
perl -pi -e 's|\(TIMER_DATA_TYPE\)||g' \
$(git grep TIMER_DATA_TYPE | cut -d: -f1 | sort -u)
The now unused macros are also dropped from include/linux/timer.h.
Signed-off-by: Kees Cook <keescook@chromium.org>