[ Upstream commit 7361d1bc307b926cbca214ab67b641123c2d6357 ]
The helper mpi_read_raw_from_sgl sets the number of entries in
the SG list according to nbytes. However, if the last entry
in the SG list contains more data than nbytes, then it may overrun
the buffer because it only allocates enough memory for nbytes.
Fixes: 2d4d1eea54 ("lib/mpi: Add mpi sgl helpers")
Reported-by: Roberto Sassu <roberto.sassu@huaweicloud.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 74e19ef0ff8061ef55957c3abd71614ef0f42f47 upstream.
The results of "access_ok()" can be mis-speculated. The result is that
you can end speculatively:
if (access_ok(from, size))
// Right here
even for bad from/size combinations. On first glance, it would be ideal
to just add a speculation barrier to "access_ok()" so that its results
can never be mis-speculated.
But there are lots of system calls just doing access_ok() via
"copy_to_user()" and friends (example: fstat() and friends). Those are
generally not problematic because they do not _consume_ data from
userspace other than the pointer. They are also very quick and common
system calls that should not be needlessly slowed down.
"copy_from_user()" on the other hand uses a user-controller pointer and
is frequently followed up with code that might affect caches. Take
something like this:
if (!copy_from_user(&kernelvar, uptr, size))
do_something_with(kernelvar);
If userspace passes in an evil 'uptr' that *actually* points to a kernel
addresses, and then do_something_with() has cache (or other)
side-effects, it could allow userspace to infer kernel data values.
Add a barrier to the common copy_from_user() code to prevent
mis-speculated values which happen after the copy.
Also add a stub for architectures that do not define barrier_nospec().
This makes the macro usable in generic code.
Since the barrier is now usable in generic code, the x86 #ifdef in the
BPF code can also go away.
Reported-by: Jordy Zomer <jordyzomer@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net> # BPF bits
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f0950402e8c76e7dcb08563f1b4e8000fbc62455 ]
Most netlink attributes are parsed and validated from
__nla_validate_parse() or validate_nla()
u16 type = nla_type(nla);
if (type == 0 || type > maxtype) {
/* error or continue */
}
@type is then used as an array index and can be used
as a Spectre v1 gadget.
array_index_nospec() can be used to prevent leaking
content of kernel memory to malicious users.
This should take care of vast majority of netlink uses,
but an audit is needed to take care of others where
validation is not yet centralized in core netlink functions.
Fixes: bfa83a9e03 ("[NETLINK]: Type-safe netlink messages/attributes interface")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230119110150.2678537-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f5fe24ef17b5fbe6db49534163e77499fb10ae8c ]
On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.
To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel and
Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.
All results in ops/s. Note there is some variance in re-runs, but the
code is consistently faster when contention is present.
open3 ("Same file open/close"):
proc stock no-pause
1 805603 814942 (+%1)
2 1054980 1054781 (-0%)
8 1544802 1822858 (+18%)
24 1191064 2199665 (+84%)
48 851582 1469860 (+72%)
96 609481 1427170 (+134%)
fstat2 ("Same file fstat"):
proc stock no-pause
1 3013872 3047636 (+1%)
2 4284687 4400421 (+2%)
8 3257721 5530156 (+69%)
24 2239819 5466127 (+144%)
48 1701072 5256609 (+209%)
96 1269157 6649326 (+423%)
Additionally, a kernel with a private patch to help access() scalability:
access2 ("Same file access"):
proc stock patched patched
+nopause
24 2378041 2005501 5370335 (-15% / +125%)
That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.
Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.
Code at hand has a rather tortured history. First modification showed
up in commit d472d9d98b ("lockref: Relax in cmpxchg loop"), written
with Itanium in mind. Later it got patched up to use an arch-dependent
macro to stop doing it on s390 where it caused a significant regression.
Said macro had undergone revisions and was ultimately eliminated later,
going back to cpu_relax.
While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
I would actually prefer just removing it entirely and see if
somebody else hollers. You have the numbers to prove it hurts on
real hardware, and I don't think we have any numbers to the
contrary.
So I think it's better to trust the numbers and remove it as a
failure, than say "let's just remove it on x86-64 and leave
everybody else with the potentially broken code"
Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
So, from the arm64 side of the fence, I'm perfectly happy just
removing the cpu_relax() calls from lockref.
As such, come back full circle in history and whack it altogether.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/all/CAGudoHHx0Nqg6DE70zAVA75eV-HXfWyhVMWZ-aSeOofkA_=WdA@mail.gmail.com/
Acked-by: Tony Luck <tony.luck@intel.com> # ia64
Acked-by: Nicholas Piggin <npiggin@gmail.com> # powerpc
Acked-by: Will Deacon <will@kernel.org> # arm64
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bb90d4bc7b ]
Working through a conversion to a call kmap_local_page() instead of
kmap() revealed many places where the pattern kmap/memcpy/kunmap
occurred.
Eric Biggers, Matthew Wilcox, Christoph Hellwig, Dan Williams, and Al
Viro all suggested putting this code into helper functions. Al Viro
further pointed out that these functions already existed in the iov_iter
code.[1]
Various locations for the lifted functions were considered.
Headers like mm.h or string.h seem ok but don't really portray the
functionality well. pagemap.h made some sense but is for page cache
functionality.[2]
Another alternative would be to create a new header for the promoted
memcpy functions, but it masks the fact that these are designed to copy
to/from pages using the kernel direct mappings and complicates matters
with a new header.
Placing these functions in 'highmem.h' is suboptimal especially with the
changes being proposed in the functionality of kmap. From a caller
perspective including/using 'highmem.h' implies that the functions
defined in that header are only required when highmem is in use which is
increasingly not the case with modern processors. However, highmem.h is
where all the current functions like this reside (zero_user(),
clear_highpage(), clear_user_highpage(), copy_user_highpage(), and
copy_highpage()). So it makes the most sense even though it is
distasteful for some.[3]
Lift memcpy_to_page() and memcpy_from_page() to pagemap.h.
[1] https://lore.kernel.org/lkml/20201013200149.GI3576660@ZenIV.linux.org.uk/https://lore.kernel.org/lkml/20201013112544.GA5249@infradead.org/
[2] https://lore.kernel.org/lkml/20201208122316.GH7338@casper.infradead.org/
[3] https://lore.kernel.org/lkml/20201013200149.GI3576660@ZenIV.linux.org.uk/#thttps://lore.kernel.org/lkml/20201208163814.GN1563847@iweiny-DESK2.sc.intel.com/
Cc: Boris Pismenny <borisp@mellanox.com>
Cc: Or Gerlitz <gerlitz.or@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 956510c0c743 ("fs: ext4: initialize fsdata in pagecache_write()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fa69ee5aa4 ]
It turns out that usage of skb extensions can cause memory leaks. Ido
Schimmel reported: "[...] there are instances that blindly overwrite
'skb->extensions' by invoking skb_copy_header() after __alloc_skb()."
Therefore, give up on using skb extensions for KCOV handle, and instead
directly store kcov_handle in sk_buff.
Fixes: 6370cc3bbd ("net: add kcov handle to skb extensions")
Fixes: 85ce50d337 ("net: kcov: don't select SKB_EXTENSIONS when there is no NET")
Fixes: 97f53a08cb ("net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling")
Link: https://lore.kernel.org/linux-wireless/20201121160941.GA485907@shredder.lan/
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20201125224840.2014773-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eabb7f1ace53e127309407b2b5e74e8199e85270 ]
1. Var debug_objects_allocated tracks valid kmem_cache_alloc calls, so
track it in debug_objects_replace_static_objects. Do similar things in
object_cpu_offline.
2. In debug_objects_mem_init, there is no need to call function
cpuhp_setup_state_nocalls when debug_objects_enabled = 0 (out of
memory).
Link: https://lkml.kernel.org/r/20220611130634.99741-1-wuchi.zero@gmail.com
Fixes: 634d61f45d ("debugobjects: Percpu pool lookahead freeing/allocation")
Fixes: c4b73aabd0 ("debugobjects: Track number of kmem_cache_alloc/kmem_cache_free done")
Signed-off-by: wuchi <wuchi.zero@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8fb0f47a9d ]
In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().
This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.
Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8d192bec53 ]
PA-RISC uses a much bigger frame size for functions than other
architectures. So increase it to 2048 for 32- and 64-bit kernels.
This fixes e.g. a warning in lib/xxhash.c.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Stable-dep-of: 152fe65f30 ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 55b70eed81 ]
parisc uses much bigger frames than other architectures, so increase the
stack frame check value to avoid compiler warnings.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Helge Deller <deller@gmx.de>
Stable-dep-of: 152fe65f30 ("Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a4412fdd49 upstream.
The config to be able to inject error codes into any function annotated
with ALLOW_ERROR_INJECTION() is enabled when FUNCTION_ERROR_INJECTION is
enabled. But unfortunately, this is always enabled on x86 when KPROBES
is enabled, and there's no way to turn it off.
As kprobes is useful for observability of the kernel, it is useful to
have it enabled in production environments. But error injection should
be avoided. Add a prompt to the config to allow it to be disabled even
when kprobes is enabled, and get rid of the "def_bool y".
This is a kernel debug feature (it's in Kconfig.debug), and should have
never been something enabled by default.
Cc: stable@vger.kernel.org
Fixes: 540adea380 ("error-injection: Separate error-injection from kprobe")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e75ef56f74 ]
dyndbg's control-parser: ddebug_parse_query(), requires that search
terms: module, func, file, lineno, are used only once in a query; a
thing cannot be named both foo and bar.
The cited commit added an overriding module modname, taken from the
module loader, which is authoritative. So it set query.module 1st,
which disallowed its use in the query-string.
But now, its useful to allow a module-load to enable classes across a
whole (or part of) a subsystem at once.
# enable (dynamic-debug in) drm only
modprobe drm dyndbg="class DRM_UT_CORE +p"
# get drm_helper too
modprobe drm dyndbg="class DRM_UT_CORE module drm* +p"
# get everything that knows DRM_UT_CORE
modprobe drm dyndbg="class DRM_UT_CORE module * +p"
# also for boot-args:
drm.dyndbg="class DRM_UT_CORE module * +p"
So convert the override into a default, by filling it only when/after
the query-string omitted the module.
NB: the query class FOO handling is forthcoming.
Fixes: 8e59b5cfb9 dynamic_debug: add modname arg to exec_query callchain
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-8-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ee879be38b ]
In https://lore.kernel.org/lkml/20211209150910.GA23668@axis.com/
Vincent's patch commented on, and worked around, a bug toggling
static_branch's, when a 2nd PRINTK-ish flag was added. The bug
results in a premature static_branch_disable when the 1st of 2 flags
was disabled.
The cited commit computed newflags, but then in the JUMP_LABEL block,
failed to use that result, instead using just one of the terms in it.
Using newflags instead made the code work properly.
This is Vincents test-case, reduced. It needs the 2nd flag to
demonstrate the bug, but it's explanatory here.
pt_test() {
echo 5 > /sys/module/dynamic_debug/verbose
site="module tcp" # just one callsite
echo " $site =_ " > /proc/dynamic_debug/control # clear it
# A B ~A ~B
for flg in +T +p "-T #broke here" -p; do
echo " $site $flg " > /proc/dynamic_debug/control
done;
# A B ~B ~A
for flg in +T +p "-p #broke here" -T; do
echo " $site $flg " > /proc/dynamic_debug/control
done
}
pt_test
Fixes: 84da83a6ff dyndbg: combine flags & mask into a struct, simplify with it
CC: vincent.whitchurch@axis.com
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-2-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 62c07983be ]
Christophe Leroy reported a ~80ms latency spike
happening at first TCP connect() time.
This is because __inet_hash_connect() uses get_random_once()
to populate a perturbation table which became quite big
after commit 4c2c8f03a5 ("tcp: increase source port perturb table to 2^16")
get_random_once() uses DO_ONCE(), which block hard irqs for the duration
of the operation.
This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock
for operations where we prefer to stay in process context.
Then __inet_hash_connect() can use get_random_slow_once()
to populate its perturbation table.
Fixes: 4c2c8f03a5 ("tcp: increase source port perturb table to 2^16")
Fixes: 190cc82489 ("tcp: change source port randomizarion at connect() time")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 58efe9f696 ]
In the same spirit as commit c966533f8c ("lib/vdso: Mark do_hres()
and do_coarse() as __always_inline"), mark do_hres_timens() and
do_coarse_timens() __always_inline.
The measurement below in on a non timens process, ie on the fastest path.
On powerpc32, without the patch:
clock-gettime-monotonic-raw: vdso: 1155 nsec/call
clock-gettime-monotonic-coarse: vdso: 813 nsec/call
clock-gettime-monotonic: vdso: 1076 nsec/call
With the patch:
clock-gettime-monotonic-raw: vdso: 1100 nsec/call
clock-gettime-monotonic-coarse: vdso: 667 nsec/call
clock-gettime-monotonic: vdso: 1025 nsec/call
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/90dcf45ebadfd5a07f24241551c62f619d1cb930.1617209142.git.christophe.leroy@csgroup.eu
Signed-off-by: Sasha Levin <sashal@kernel.org>