The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
flags into the corresponding GFP_* flags for memory allocation. In that
function, current->flags is accessed 3 times. That may lead to duplicated
access of the same memory location.
This is not usually a problem with minimal debug config options on as the
compiler can optimize away the duplicated memory accesses. With most of
the debug config options on, however, that may not be the case. For
example, the x86-64 object size of the __need_fs_reclaim() in a debug
kernel that calls current_gfp_context() was 309 bytes. With this patch
applied, the object size is reduced to 202 bytes. This is a saving of 107
bytes and will probably be slightly faster too.
Use READ_ONCE() to access current->flags to prevent the compiler from
possibly accessing current->flags multiple times.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michel Lespinasse <walken@google.com>
Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, memalloc_nocma_{save/restore} API that prevents CMA area
in page allocation is implemented by using current_gfp_context(). However,
there are two problems of this implementation.
First, this doesn't work for allocation fastpath. In the fastpath,
original gfp_mask is used since current_gfp_context() is introduced in
order to control reclaim and it is on slowpath. So, CMA area can be
allocated through the allocation fastpath even if
memalloc_nocma_{save/restore} APIs are used. Currently, there is just
one user for these APIs and it has a fallback method to prevent actual
problem.
Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
to exclude the memory on the ZONE_MOVABLE for allocation target.
To fix these problems, this patch changes the implementation to exclude
CMA area in page allocation. Main point of this change is using the
alloc_flags. alloc_flags is mainly used to control allocation so it fits
for excluding CMA area in allocation.
Fixes: d7fefcc8de (mm/cma: add PF flag to force non cma alloc)
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull fork cleanups from Christian Brauner:
"This is cleanup series from when we reworked a chunk of the process
creation paths in the kernel and switched to struct
{kernel_}clone_args.
High-level this does two main things:
- Remove the double export of both do_fork() and _do_fork() where
do_fork() used the incosistent legacy clone calling convention.
Now we only export _do_fork() which is based on struct
kernel_clone_args.
- Remove the copy_thread_tls()/copy_thread() split making the
architecture specific HAVE_COYP_THREAD_TLS config option obsolete.
This switches all remaining architectures to select
HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
convention. The current split makes the process creation codepaths
more convoluted than they need to be. Each architecture has their own
copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
has a copy_thread_tls() function.
The split is not needed anymore nowadays, all architectures support
CLONE_SETTLS but quite a few of them never bothered to select
HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
and use the old calling convention. Removing this split cleans up the
process creation codepaths and paves the way for implementing clone3()
on such architectures since it requires the copy_thread_tls() calling
convention.
After having made each architectures support copy_thread_tls() this
series simply renames that function back to copy_thread(). It also
switches all architectures that call do_fork() directly over to
_do_fork() and the struct kernel_clone_args calling convention. This
is a corollary of switching the architectures that did not yet support
it over to copy_thread_tls() since do_fork() is conditional on not
supporting copy_thread_tls() (Mostly because it lacks a separate
argument for tls which is trivial to fix but there's no need for this
function to exist.).
The do_fork() removal is in itself already useful as it allows to to
remove the export of both do_fork() and _do_fork() we currently have
in favor of only _do_fork(). This has already been discussed back when
we added clone3(). The legacy clone() calling convention is - as is
probably well-known - somewhat odd:
#
# ABI hall of shame
#
config CLONE_BACKWARDS
config CLONE_BACKWARDS2
config CLONE_BACKWARDS3
that is aggravated by the fact that some architectures such as sparc
follow the CLONE_BACKWARDSx calling convention but don't really select
the corresponding config option since they call do_fork() directly.
So do_fork() enforces a somewhat arbitrary calling convention in the
first place that doesn't really help the individual architectures that
deviate from it. They can thus simply be switched to _do_fork()
enforcing a single calling convention. (I really hope that any new
architectures will __not__ try to implement their own calling
conventions...)
Most architectures already have made a similar switch (m68k comes to
mind).
Overall this removes more code than it adds even with a good portion
of added comments. It simplifies a chunk of arch specific assembly
either by moving the code into C or by simply rewriting the assembly.
Architectures that have been touched in non-trivial ways have all been
actually boot and stress tested: sparc and ia64 have been tested with
Debian 9 images. They are the two architectures which have been
touched the most. All non-trivial changes to architectures have seen
acks from the relevant maintainers. nios2 with a custom built
buildroot image. h8300 I couldn't get something bootable to test on
but the changes have been fairly automatic and I'm sure we'll hear
people yell if I broke something there.
All other architectures that have been touched in trivial ways have
been compile tested for each single patch of the series via git rebase
-x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
even though they have just been trivially touched (removal of the
HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
basically "core architectures" and since it is trivial to get your
hands on a useable image"
* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
arch: rename copy_thread_tls() back to copy_thread()
arch: remove HAVE_COPY_THREAD_TLS
unicore: switch to copy_thread_tls()
sh: switch to copy_thread_tls()
nds32: switch to copy_thread_tls()
microblaze: switch to copy_thread_tls()
hexagon: switch to copy_thread_tls()
c6x: switch to copy_thread_tls()
alpha: switch to copy_thread_tls()
fork: remove do_fork()
h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
sparc: unconditionally enable HAVE_COPY_THREAD_TLS
sparc: share process creation helpers between sparc and sparc64
sparc64: enable HAVE_COPY_THREAD_TLS
fork: fold legacy_clone_args_valid() into _do_fork()
Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.
This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.
- Implement kernel_execve
- Make the user mode driver code a better citizen
With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...
Pull scheduler updates from Ingo Molnar:
- Improve uclamp performance by using a static key for the fast path
- Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
better power efficiency of RT tasks on battery powered devices.
(The default is to maximize performance & reduce RT latencies.)
- Improve utime and stime tracking accuracy, which had a fixed boundary
of error, which created larger and larger relative errors as the
values become larger. This is now replaced with more precise
arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.
- Improve the deadline scheduler, such as making it capacity aware
- Improve frequency-invariant scheduling
- Misc cleanups in energy/power aware scheduling
- Add sched_update_nr_running tracepoint to track changes to nr_running
- Documentation additions and updates
- Misc cleanups and smaller fixes
* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
sched/doc: Document capacity aware scheduling
sched: Document arch_scale_*_capacity()
arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
Documentation/sysctl: Document uclamp sysctl knobs
sched/uclamp: Add a new sysctl to control RT default boost value
sched/uclamp: Fix a deadlock when enabling uclamp static key
sched: Remove duplicated tick_nohz_full_enabled() check
sched: Fix a typo in a comment
sched/uclamp: Remove unnecessary mutex_init()
arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
arch_topology, sched/core: Cleanup thermal pressure definition
trace/events/sched.h: fix duplicated word
linux/sched/mm.h: drop duplicated words in comments
smp: Fix a potential usage of stale nr_cpus
sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
sched: nohz: stop passing around unused "ticks" parameter.
sched: Better document ttwu()
sched: Add a tracepoint to track rq->nr_running
...
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
put_task_struct_many() is as put_task_struct() but puts several
references at once. Useful to batching it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following commit:
14533a16c4 ("thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code")
moved the definition of arch_set_thermal_pressure() to sched/core.c, but
kept its declaration in linux/arch_topology.h. When building e.g. an x86
kernel with CONFIG_SCHED_THERMAL_PRESSURE=y, cpufreq_cooling.c ends up
getting the declaration of arch_set_thermal_pressure() from
include/linux/arch_topology.h, which is somewhat awkward.
On top of this, sched/core.c unconditionally defines
o The thermal_pressure percpu variable
o arch_set_thermal_pressure()
while arch_scale_thermal_pressure() does nothing unless redefined by the
architecture.
arch_*() functions are meant to be defined by architectures, so revert the
aforementioned commit and re-implement it in a way that keeps
arch_set_thermal_pressure() architecture-definable, and doesn't define the
thermal pressure percpu variable for kernels that don't need
it (CONFIG_SCHED_THERMAL_PRESSURE=n).
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200712165917.9168-2-valentin.schneider@arm.com
Create an independent helper thread_group_exited which returns true
when all threads have passed exit_notify in do_exit. AKA all of the
threads are at least zombies and might be dead or completely gone.
Create this helper by taking the logic out of pidfd_poll where it is
already tested, and adding a READ_ONCE on the read of
task->exit_state.
I will be changing the user mode driver code to use this same logic
to know when a user mode driver needs to be restarted.
Place the new helper thread_group_exited in kernel/exit.c and
EXPORT it so it can be used by modules.
Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All architectures support copy_thread_tls() now, so remove the legacy
copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
uses the same process creation calling convention based on
copy_thread_tls() and struct kernel_clone_args. This will make it easier to
maintain the core process creation code under kernel/, simplifies the
callpaths and makes the identical for all architectures.
Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Now that all architectures have been switched to use _do_fork() and the new
struct kernel_clone_args calling convention we can remove the legacy
do_fork() helper completely. The calling convention used to be brittle and
do_fork() didn't buy us anything. The only calling convention accepted
should be based on struct kernel_clone_args going forward. It's cleaner and
uniform.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
So that the target task will exit the wait_event_interruptible-like
loop and call task_work_run() asap.
The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
new TWA_SIGNAL flag implies signal_wake_up(). However, it needs to
avoid the race with recalc_sigpending(), so the patch also adds the
new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.
TODO: once this patch is merged we need to change all current users
of task_work_add(notify = true) to use TWA_RESUME.
Cc: stable@vger.kernel.org # v5.7
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This separate helper only existed to guarantee the mutual exclusivity of
CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD
abuses the parent_tid field to return the pidfd. But we can actually handle
this uniformely thus removing the helper. For legacy clone we can detect
that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID
because they will share the same memory which is invalid and for clone3()
setting the separate pidfd and parent_tid fields to the same memory is
bogus as well. So fold that helper directly into _do_fork() by detecting
this case.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: x86@kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This is a kernel enhancement that configures the cpu affinity of kernel
threads via kernel boot option nohz_full=.
When this option is specified, the cpumask is immediately applied upon
kthread launch. This does not affect kernel threads that specify cpu
and node.
This allows CPU isolation (that is not allowing certain threads
to execute on certain CPUs) without using the isolcpus=domain parameter,
making it possible to enable load balancing on such CPUs
during runtime (see kernel-parameters.txt).
Note-1: this is based off on Wind River's patch at
https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch
Difference being that this patch is limited to modifying kernel thread
cpumask. Behaviour of other threads can be controlled via cgroups or
sched_setaffinity.
Note-2: Wind River's patch was based off Christoph Lameter's patch at
https://lwn.net/Articles/565932/ with the only difference being
the kernel parameter changed from kthread to kthread_cpus.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org