mirror of
https://github.com/Dasharo/linux.git
synced 2026-03-06 15:25:10 -08:00
ffd264bd152cbf88fcf5ced04d3d380c77020231
1092 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
622c72b651 |
Merge tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner: "A single fix for POSIX CPU timers to address a problem where POSIX CPU timer delivery stops working for a new child task because copy_process() copies state information which is only valid for the parent task" * tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Clear task::posix_cputimers_work in copy_process() |
||
|
|
59a2ceeef6 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "87 patches. Subsystems affected by this patch series: mm (pagecache and hugetlb), procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs, init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork, sysvfs, kcov, gdb, resource, selftests, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits) ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL ipc: check checkpoint_restore_ns_capable() to modify C/R proc files selftests/kselftest/runner/run_one(): allow running non-executable files virtio-mem: disallow mapping virtio-mem memory via /dev/mem kernel/resource: disallow access to exclusive system RAM regions kernel/resource: clean up and optimize iomem_is_exclusive() scripts/gdb: handle split debug for vmlinux kcov: replace local_irq_save() with a local_lock_t kcov: avoid enable+disable interrupts if !in_task() kcov: allocate per-CPU memory on the relevant node Documentation/kcov: define `ip' in the example Documentation/kcov: include types.h in the example sysv: use BUILD_BUG_ON instead of runtime check kernel/fork.c: unshare(): use swap() to make code cleaner seq_file: fix passing wrong private data seq_file: move seq_escape() to a header signal: remove duplicate include in signal.h crash_dump: remove duplicate include in crash_dump.h crash_dump: fix boolreturn.cocci warning hfs/hfsplus: use WARN_ON for sanity check ... |
||
|
|
ba1f70ddd1 |
kernel/fork.c: unshare(): use swap() to make code cleaner
Use swap() instead of reimplementing it. Link: https://lkml.kernel.org/r/20210909022046.8151-1-ran.xiaokai@zte.com.cn Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Gabriel Krisman Bertazi <krisman@collabora.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexey Gladkov <legion@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a602285ac1 |
Merge branch 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull per signal_struct coredumps from Eric Biederman: "Current coredumps are mixed up with the exit code, the signal handling code, and the ptrace code making coredumps much more complicated than necessary and difficult to follow. This series of changes starts with ptrace_stop and cleans it up, making it easier to follow what is happening in ptrace_stop. Then cleans up the exec interactions with coredumps. Then cleans up the coredump interactions with exit. Finally the coredump interactions with the signal handling code is cleaned up. The first and last changes are bug fixes for minor bugs. I believe the fact that vfork followed by execve can kill the process the called vfork if exec fails is sufficient justification to change the userspace visible behavior. In previous discussions some of these changes were organized differently and individually appeared to make the code base worse. As currently written I believe they all stand on their own as cleanups and bug fixes. Which means that even if the worst should happen and the last change needs to be reverted for some unimaginable reason, the code base will still be improved. If the worst does not happen there are a more cleanups that can be made. Signals that generate coredumps can easily become eligible for short circuit delivery in complete_signal. The entire rendezvous for generating a coredump can move into get_signal. The function force_sig_info_to_task be written in a way that does not modify the signal handling state of the target task (because coredumps are eligible for short circuit delivery). Many of these future cleanups can be done another way but nothing so cleanly as if coredumps become per signal_struct" * 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: coredump: Limit coredumps to a single thread group coredump: Don't perform any cleanups before dumping core exit: Factor coredump_exit_mm out of exit_mm exec: Check for a pending fatal signal instead of core_state ptrace: Remove the unnecessary arguments from arch_ptrace_stop signal: Remove the bogus sigkill_pending in ptrace_stop |
||
|
|
ca7752caea |
posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
copy_process currently copies task_struct.posix_cputimers_work as-is. If a
timer interrupt arrives while handling clone and before dup_task_struct
completes then the child task will have:
1. posix_cputimers_work.scheduled = true
2. posix_cputimers_work.work queued.
copy_process clears task_struct.task_works, so (2) will have no effect and
posix_cpu_timers_work will never run (not to mention it doesn't make sense
for two tasks to share a common linked list).
Since posix_cpu_timers_work never runs, posix_cputimers_work.scheduled is
never cleared. Since scheduled is set, future timer interrupts will skip
scheduling work, with the ultimate result that the task will never receive
timer expirations.
Together, the complete flow is:
1. Task 1 calls clone(), enters kernel.
2. Timer interrupt fires, schedules task work on Task 1.
2a. task_struct.posix_cputimers_work.scheduled = true
2b. task_struct.posix_cputimers_work.work added to
task_struct.task_works.
3. dup_task_struct() copies Task 1 to Task 2.
4. copy_process() clears task_struct.task_works for Task 2.
5. Future timer interrupts on Task 2 see
task_struct.posix_cputimers_work.scheduled = true and skip scheduling
work.
Fix this by explicitly clearing contents of task_struct.posix_cputimers_work
in copy_process(). This was never meant to be shared or inherited across
tasks in the first place.
Fixes:
|
||
|
|
9a7e0a90a4 |
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ... |
||
|
|
545c6647d2 |
kernel: remove spurious blkdev.h includes
Various files have acquired spurious includes of <linux/blkdev.h> over time. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
4ef0c5c6b5 |
kernel/sched: Fix sched_fork() access an invalid sched_task_group
There is a small race between copy_process() and sched_fork()
where child->sched_task_group point to an already freed pointer.
parent doing fork() | someone moving the parent
| to another cgroup
-------------------------------+-------------------------------
copy_process()
+ dup_task_struct()<1>
parent move to another cgroup,
and free the old cgroup. <2>
+ sched_fork()
+ __set_task_cpu()<3>
+ task_fork_fair()
+ sched_slice()<4>
In the worst case, this bug can lead to "use-after-free" and
cause panic as shown above:
(1) parent copy its sched_task_group to child at <1>;
(2) someone move the parent to another cgroup and free the old
cgroup at <2>;
(3) the sched_task_group and cfs_rq that belong to the old cgroup
will be accessed at <3> and <4>, which cause a panic:
[] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0
[] Oops: 0000 [#1] SMP PTI
[] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1
[] RIP: 0010:sched_slice+0x84/0xc0
[] Call Trace:
[] task_fork_fair+0x81/0x120
[] sched_fork+0x132/0x240
[] copy_process.part.5+0x675/0x20e0
[] ? __handle_mm_fault+0x63f/0x690
[] _do_fork+0xcd/0x3b0
[] do_syscall_64+0x5d/0x1d0
[] entry_SYSCALL_64_after_hwframe+0x65/0xca
[] RIP: 0033:0x7f04418cd7e1
Between cgroup_can_fork() and cgroup_post_fork(), the cgroup
membership and thus sched_task_group can't change. So update child's
sched_task_group at sched_post_fork() and move task_fork() and
__set_task_cpu() (where accees the sched_task_group) from sched_fork()
to sched_post_fork().
Fixes:
|
||
|
|
0258b5fd7c |
coredump: Limit coredumps to a single thread group
Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.
In the case of vfork this looks like a real world problem. Consider
the following well defined sequence.
if (vfork() == 0) {
execve(...);
_exit(EXIT_FAILURE);
}
If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).
Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.
As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.
Solve this by making a userspace visible change to only kill a single
process/thread group. This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening. With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.
To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.
In zap_threads remove the work to kill anything except for the current
thread group.
v2: Remove core_state from the VM_BUG_ON_MM print to fix
compile failure when CONFIG_DEBUG_VM is enabled.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
[1]
|
||
|
|
9230738308 |
coredump: Don't perform any cleanups before dumping core
Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace "may_ptrace_stop()" with a simple test of "current->ptrace". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
13db8c5047 |
mm/hugetlb: initialize hugetlb_usage in mm_init
After fork, the child process will get incorrect (2x) hugetlb_usage. If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes:
|
||
|
|
2d338201d5 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"147 patches, based on
|
||
|
|
05da8113c9 |
kernel/fork.c: unexport get_{mm,task}_exe_file
Only used by core code and the tomoyo which can't be a module either. Link: https://lkml.kernel.org/r/20210820095430.445242-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
49624efa65 |
Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
|
||
|
|
8d0920bde5 |
mm: remove VM_DENYWRITE
All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
||
|
|
fe69d560b5 |
kernel/fork: always deny write access to current MM exe_file
We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
||
|
|
35d7bdc860 |
kernel/fork: factor out replacing the current MM exe_file
Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com> |
||
|
|
9e9fb7655e |
Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Enable memcg accounting for various networking objects.
BPF:
- Introduce bpf timers.
- Add perf link and opaque bpf_cookie which the program can read out
again, to be used in libbpf-based USDT library.
- Add bpf_task_pt_regs() helper to access user space pt_regs in
kprobes, to help user space stack unwinding.
- Add support for UNIX sockets for BPF sockmap.
- Extend BPF iterator support for UNIX domain sockets.
- Allow BPF TCP congestion control progs and bpf iterators to call
bpf_setsockopt(), e.g. to switch to another congestion control
algorithm.
Protocols:
- Support IOAM Pre-allocated Trace with IPv6.
- Support Management Component Transport Protocol.
- bridge: multicast: add vlan support.
- netfilter: add hooks for the SRv6 lightweight tunnel driver.
- tcp:
- enable mid-stream window clamping (by user space or BPF)
- allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
- more accurate DSACK processing for RACK-TLP
- mptcp:
- add full mesh path manager option
- add partial support for MP_FAIL
- improve use of backup subflows
- optimize option processing
- af_unix: add OOB notification support.
- ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
router.
- mac80211: Target Wake Time support in AP mode.
- can: j1939: extend UAPI to notify about RX status.
Driver APIs:
- Add page frag support in page pool API.
- Many improvements to the DSA (distributed switch) APIs.
- ethtool: extend IRQ coalesce uAPI with timer reset modes.
- devlink: control which auxiliary devices are created.
- Support CAN PHYs via the generic PHY subsystem.
- Proper cross-chip support for tag_8021q.
- Allow TX forwarding for the software bridge data path to be
offloaded to capable devices.
Drivers:
- veth: more flexible channels number configuration.
- openvswitch: introduce per-cpu upcall dispatch.
- Add internet mix (IMIX) mode to pktgen.
- Transparently handle XDP operations in the bonding driver.
- Add LiteETH network driver.
- Renesas (ravb):
- support Gigabit Ethernet IP
- NXP Ethernet switch (sja1105):
- fast aging support
- support for "H" switch topologies
- traffic termination for ports under VLAN-aware bridge
- Intel 1G Ethernet
- support getcrosststamp() with PCIe PTM (Precision Time
Measurement) for better time sync
- support Credit-Based Shaper (CBS) offload, enabling HW traffic
prioritization and bandwidth reservation
- Broadcom Ethernet (bnxt)
- support pulse-per-second output
- support larger Rx rings
- Mellanox Ethernet (mlx5)
- support ethtool RSS contexts and MQPRIO channel mode
- support LAG offload with bridging
- support devlink rate limit API
- support packet sampling on tunnels
- Huawei Ethernet (hns3):
- basic devlink support
- add extended IRQ coalescing support
- report extended link state
- Netronome Ethernet (nfp):
- add conntrack offload support
- Broadcom WiFi (brcmfmac):
- add WPA3 Personal with FT to supported cipher suites
- support 43752 SDIO device
- Intel WiFi (iwlwifi):
- support scanning hidden 6GHz networks
- support for a new hardware family (Bz)
- Xen pv driver:
- harden netfront against malicious backends
- Qualcomm mobile
- ipa: refactor power management and enable automatic suspend
- mhi: move MBIM to WWAN subsystem interfaces
Refactor:
- Ambient BPF run context and cgroup storage cleanup.
- Compat rework for ndo_ioctl.
Old code removal:
- prism54 remove the obsoleted driver, deprecated by the p54 driver.
- wan: remove sbni/granch driver"
* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
net: Add depends on OF_NET for LiteX's LiteETH
ipv6: seg6: remove duplicated include
net: hns3: remove unnecessary spaces
net: hns3: add some required spaces
net: hns3: clean up a type mismatch warning
net: hns3: refine function hns3_set_default_feature()
ipv6: remove duplicated 'net/lwtunnel.h' include
net: w5100: check return value after calling platform_get_resource()
net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
fou: remove sparse errors
ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
octeontx2-af: Set proper errorcode for IPv4 checksum errors
octeontx2-af: Fix static code analyzer reported issues
octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
octeontx2-af: Fix loop in free and unmap counter
af_unix: fix potential NULL deref in unix_dgram_connect()
dpaa2-eth: Replace strlcpy with strscpy
octeontx2-af: Use NDC TX for transmit packet data
...
|
||
|
|
5d3c0db459 |
Merge tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: - The biggest change in this cycle is scheduler support for asymmetric scheduling affinity, to support the execution of legacy 32-bit tasks on AArch32 systems that also have 64-bit-only CPUs. Architectures can fill in this functionality by defining their own task_cpu_possible_mask(p). When this is done, the scheduler will make sure the task will only be scheduled on CPUs that support it. (The actual arm64 specific changes are not part of this tree.) For other architectures there will be no change in functionality. - Add cgroup SCHED_IDLE support - Increase node-distance flexibility & delay determining it until a CPU is brought online. (This enables platforms where node distance isn't final until the CPU is only.) - Deadline scheduler enhancements & fixes - Misc fixes & cleanups. * tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) eventfd: Make signal recursion protection a task bit sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case sched: Introduce dl_task_check_affinity() to check proposed affinity sched: Allow task CPU affinity to be restricted on asymmetric systems sched: Split the guts of sched_setaffinity() into a helper function sched: Introduce task_struct::user_cpus_ptr to track requested affinity sched: Reject CPU affinity changes based on task_cpu_possible_mask() cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq() cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus() cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1 sched: Introduce task_cpu_possible_mask() to limit fallback rq selection sched: Cgroup SCHED_IDLE support sched/topology: Skip updating masks for non-online nodes sched: Replace deprecated CPU-hotplug functions. sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS sched: Fix UCLAMP_FLAG_IDLE setting sched/deadline: Fix missing clock update in migrate_task_rq_dl() sched/fair: Avoid a second scan of target in select_idle_cpu sched/fair: Use prev instead of new target as recent_used_cpu sched: Don't report SCHED_FLAG_SUGOV in sched_getattr() ... |
||
|
|
97c78d0af5 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/wwan/mhi_wwan_mbim.c - drop the extra arg. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
5ddf994fa2 |
ucounts: Fix regression preventing increasing of rlimits in init_user_ns
"Ma, XinjianX" <xinjianx.ma@intel.com> reported: > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests > in kselftest failed with following message. > > # selftests: mqueue: mq_perf_tests > # > # Initial system state: > # Using queue path: /mq_perf_tests > # RLIMIT_MSGQUEUE(soft): 819200 > # RLIMIT_MSGQUEUE(hard): 819200 > # Maximum Message Size: 8192 > # Maximum Queue Size: 10 > # Nice value: 0 > # > # Adjusted system state for testing: > # RLIMIT_MSGQUEUE(soft): (unlimited) > # RLIMIT_MSGQUEUE(hard): (unlimited) > # Maximum Message Size: 16777216 > # Maximum Queue Size: 65530 > # Nice value: -20 > # Continuous mode: (disabled) > # CPUs to pin: 3 > # ./mq_perf_tests: mq_open() at 296: Too many open files > not ok 2 selftests: mqueue: mq_perf_tests # exit=1 > ``` > > Test env: > rootfs: debian-10 > gcc version: 9 After investigation the problem turned out to be that ucount_max for the rlimits in init_user_ns was being set to the initial rlimit value. The practical problem is that ucount_max provides a limit that applications inside the user namespace can not exceed. Which means in practice that rlimits that have been converted to use the ucount infrastructure were not able to exceend their initial rlimits. Solve this by setting the relevant values of ucount_max to RLIM_INIFINITY. A limit in init_user_ns is pointless so the code should allow the values to grow as large as possible without riscking an underflow or an overflow. As the ltp test case was a bit of a pain I have reproduced the rlimit failure and tested the fix with the following little C program: > #include <stdio.h> > #include <fcntl.h> > #include <sys/stat.h> > #include <mqueue.h> > #include <sys/time.h> > #include <sys/resource.h> > #include <errno.h> > #include <string.h> > #include <stdlib.h> > #include <limits.h> > #include <unistd.h> > > int main(int argc, char **argv) > { > struct mq_attr mq_attr; > struct rlimit rlim; > mqd_t mqd; > int ret; > > ret = getrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > printf("RLIMIT_MSGQUEUE %lu %lu\n", > rlim.rlim_cur, rlim.rlim_max); > rlim.rlim_cur = RLIM_INFINITY; > rlim.rlim_max = RLIM_INFINITY; > ret = setrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > memset(&mq_attr, 0, sizeof(struct mq_attr)); > mq_attr.mq_maxmsg = 65536 - 1; > mq_attr.mq_msgsize = 16*1024*1024 - 1; > > mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr); > if (mqd == (mqd_t)-1) { > fprintf(stderr, "mq_open failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > ret = mq_close(mqd); > if (ret) { > fprintf(stderr, "mq_close failed; %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > return EXIT_SUCCESS; > } Fixes: |
||
|
|
b90ca8badb |
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
In preparation for saving and restoring the user-requested CPU affinity mask of a task, add a new cpumask_t pointer to 'struct task_struct'. If the pointer is non-NULL, then the mask is copied across fork() and freed on task exit. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-7-will@kernel.org |
||
|
|
c7603cfa04 |
bpf: Add ambient BPF runtime context stored in current
|
||
|
|
65090f30ab |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ... |
||
|
|
a458b76a41 |
mm: gup: pack has_pinned in MMF_HAS_PINNED
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop cleanup. Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast would reintroduce a loss of SMP scalability to pin-fast, so there's no future potential usefulness to keep an atomic in the mm for this. set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like atomic_set after this commit) has to be still issued only once per "mm", so the difference between the two will be lost in the noise. will-it-scale "mmap2" shows no change in performance with enterprise config as expected. will-it-scale "pin_fast" retains the > 4000% SMP scalability performance improvement against upstream as expected. This is a noop as far as overall performance and SMP scalability are concerned. [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED] Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s [akpm@linux-foundation.org: coding style fixes] [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |