Commit Graph

90 Commits

Author SHA1 Message Date
Linus Torvalds
81ac1e8884 module: make waiting for a concurrent module loader interruptible
[ Upstream commit 2124d84db293ba164059077944e6b429ba530495 ]

The recursive aes-arm-bs module load situation reported by Russell King
is getting fixed in the crypto layer, but this in the meantime fixes the
"recursive load hangs forever" by just making the waiting for the first
module load be interruptible.

This should now match the old behavior before commit 9b9879fc03
("modules: catch concurrent module loads, treat them as idempotent"),
which used the different "wait for module to be ready" code in
module_patient_check_exists().

End result: a recursive module load will still block, but now a signal
will interrupt it and fail the second module load, at which point the
first module will successfully complete loading.

Fixes: 9b9879fc03 ("modules: catch concurrent module loads, treat them as idempotent")
Cc: Russell King <linux@armlinux.org.uk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14 13:58:53 +02:00
Linus Torvalds
bdb3679cf3 module: warn about excessively long module waits
[ Upstream commit cb5b81bc9a448f8db817566f60f92e2ea788ea0f ]

Russell King reported that the arm cbc(aes) crypto module hangs when
loaded, and Herbert Xu bisected it to commit 9b9879fc03 ("modules:
catch concurrent module loads, treat them as idempotent"), and noted:

 "So what's happening here is that the first modprobe tries to load a
  fallback CBC implementation, in doing so it triggers a load of the
  exact same module due to module aliases.

  IOW we're loading aes-arm-bs which provides cbc(aes). However, this
  needs a fallback of cbc(aes) to operate, which is made out of the
  generic cbc module + any implementation of aes, or ecb(aes). The
  latter happens to also be provided by aes-arm-cb so that's why it
  tries to load the same module again"

So loading the aes-arm-bs module ends up wanting to recursively load
itself, and the recursive load then ends up waiting for the original
module load to complete.

This is a regression, in that it used to be that we just tried to load
the module multiple times, and then as we went on to install it the
second time we would instead just error out because the module name
already existed.

That is actually also exactly what the original "catch concurrent loads"
patch did in commit 9828ed3f69 ("module: error out early on concurrent
load of the same module file"), but it turns out that it ends up being
racy, in that erroring out before the module has been fully initialized
will cause failures in dependent module loading.

See commit ac2263b588 (which was the revert of that "error out early")
commit for details about why erroring out before the module has been
initialized is actually fundamentally racy.

Now, for the actual recursive module load (as opposed to just
concurrently loading the same module twice), the race is not an issue.

At the same time it's hard for the kernel to see that this is recursion,
because the module load is always done from a usermode helper, so the
recursion is not some simple callchain within the kernel.

End result: this is not the real fix, but this at least adds a warning
for the situation (admittedly much too late for all the debugging pain
that Russell and Herbert went through) and if we can come to a
resolution on how to detect the recursion properly, this re-organizes
the code to make that easier.

Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/
Reported-by: Russell King <linux@armlinux.org.uk>
Debugged-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: 2124d84db293 ("module: make waiting for a concurrent module loader interruptible")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14 13:58:52 +02:00
Changbin Du
8c61e3beb0 modules: wait do_free_init correctly
[ Upstream commit 8f8cd6c0a43ed637e620bbe45a8d0e0c2f4d5130 ]

The synchronization here is to ensure the ordering of freeing of a module
init so that it happens before W+X checking.  It is worth noting it is not
that the freeing was not happening, it is just that our sanity checkers
raced against the permission checkers which assume init memory is already
gone.

Commit 1a7b7d9220 ("modules: Use vmalloc special flag") moved calling
do_free_init() into a global workqueue instead of relying on it being
called through call_rcu(..., do_free_init), which used to allowed us call
do_free_init() asynchronously after the end of a subsequent grace period.
The move to a global workqueue broke the gaurantees for code which needed
to be sure the do_free_init() would complete with rcu_barrier().  To fix
this callers which used to rely on rcu_barrier() must now instead use
flush_work(&init_free_wq).

Without this fix, we still could encounter false positive reports in W+X
checking since the rcu_barrier() here can not ensure the ordering now.

Even worse, the rcu_barrier() can introduce significant delay.  Eric
Chanudet reported that the rcu_barrier introduces ~0.1s delay on a
PREEMPT_RT kernel.

  [    0.291444] Freeing unused kernel memory: 5568K
  [    0.402442] Run /sbin/init as init process

With this fix, the above delay can be eliminated.

Link: https://lkml.kernel.org/r/20240227023546.2490667-1-changbin.du@huawei.com
Fixes: 1a7b7d9220 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Xiaoyi Su <suxiaoyi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:19:55 -04:00
James Morse
2abcc4b5a6 module: Expose module_init_layout_section()
module_init_layout_section() choses whether the core module loader
considers a section as init or not. This affects the placement of the
exit section when module unloading is disabled. This code will never run,
so it can be free()d once the module has been initialised.

arm and arm64 need to count the number of PLTs they need before applying
relocations based on the section name. The init PLTs are stored separately
so they can be free()d. arm and arm64 both use within_module_init() to
decide which list of PLTs to use when applying the relocation.

Because within_module_init()'s behaviour changes when module unloading
is disabled, both architecture would need to take this into account when
counting the PLTs.

Today neither architecture does this, meaning when module unloading is
disabled there are insufficient PLTs in the init section to load some
modules, resulting in warnings:
| WARNING: CPU: 2 PID: 51 at arch/arm64/kernel/module-plts.c:99 module_emit_plt_entry+0x184/0x1cc
| Modules linked in: crct10dif_common
| CPU: 2 PID: 51 Comm: modprobe Not tainted 6.5.0-rc4-yocto-standard-dirty #15208
| Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
| pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : module_emit_plt_entry+0x184/0x1cc
| lr : module_emit_plt_entry+0x94/0x1cc
| sp : ffffffc0803bba60
[...]
| Call trace:
|  module_emit_plt_entry+0x184/0x1cc
|  apply_relocate_add+0x2bc/0x8e4
|  load_module+0xe34/0x1bd4
|  init_module_from_file+0x84/0xc0
|  __arm64_sys_finit_module+0x1b8/0x27c
|  invoke_syscall.constprop.0+0x5c/0x104
|  do_el0_svc+0x58/0x160
|  el0_svc+0x38/0x110
|  el0t_64_sync_handler+0xc0/0xc4
|  el0t_64_sync+0x190/0x194

Instead of duplicating module_init_layout_section()s logic, expose it.

Reported-by: Adam Johnston <adam.johnston@arm.com>
Fixes: 055f23b74b ("module: check for exit sections in layout_sections() instead of module_init_section()")
Cc: stable@vger.kernel.org
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-03 13:42:02 -07:00
Christoph Hellwig
9011e49d54 modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
It has recently come to my attention that nvidia is circumventing the
protection added in 262e6ae708 ("modules: inherit
TAINT_PROPRIETARY_MODULE") by importing exports from their proprietary
modules into an allegedly GPL licensed module and then rexporting them.

Given that symbol_get was only ever intended for tightly cooperating
modules using very internal symbols it is logical to restrict it to
being used on EXPORT_SYMBOL_GPL and prevent nvidia from costly DMCA
Circumvention of Access Controls law suites.

All symbols except for four used through symbol_get were already exported
as EXPORT_SYMBOL_GPL, and the remaining four ones were switched over in
the preparation patches.

Fixes: 262e6ae708 ("modules: inherit TAINT_PROPRIETARY_MODULE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-02 11:18:22 -07:00
Linus Torvalds
f196220715 module: fix init_module_from_file() error handling
Vegard Nossum pointed out two different problems with the error handling
in init_module_from_file():

 (a) the idempotent loading code didn't clean up properly in some error
     cases, leaving the on-stack 'struct idempotent' element still in
     the hash table

 (b) failure to read the module file would nonsensically update the
     'invalid_kread_bytes' stat counter with the error value

The first error is quite nasty, in that it can then cause subsequent
idempotent loads of that same file to access stale stack contents of the
previous failure.  The case may not happen in any normal situation
(explaining all the "Tested-by's on the original change), and requires
admin privileges, but syzkaller triggers random bad behavior as a
result:

    BUG: soft lockup in sys_finit_module
    BUG: unable to handle kernel paging request in init_module_from_file
    general protection fault in init_module_from_file
    INFO: task hung in init_module_from_file
    KASAN: out-of-bounds Read in init_module_from_file
    KASAN: slab-out-of-bounds Read in init_module_from_file
    ...

The second error is fairly benign and just leads to nonsensical stats
(and has been around since the debug stats were added).

Vegard also provided a patch for the idempotent loading issue, but I'd
rather re-organize the code and make it more legible using another level
of helper functions than add the usual "goto out" error handling.

Link: https://lore.kernel.org/lkml/20230704100852.23452-1-vegard.nossum@oracle.com/
Fixes: 9b9879fc03 ("modules: catch concurrent module loads, treat them as idempotent")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: syzbot+9c2bdc9d24e4a7abe741@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-04 10:17:11 -07:00
Linus Torvalds
4e3c09e954 Merge tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
 "The changes queued up for modules are pretty tame, mostly code removal
  of moving of code.

  Only two minor functional changes are made, the only one which stands
  out is Sebastian Andrzej Siewior's simplification of module reference
  counting by removing preempt_disable() and that has been tested on
  linux-next for well over a month without no regressions.

  I'm now, I guess, also a kitchen sink for some kallsyms changes"

[ There was a mis-communication about the concurrent module load changes
  that I had expected to come through Luis despite me authoring the
  patch. So some of the module updates were left hanging in the email
  ether, and I just committed them separately.

  It's my bad - I should have made it more clear that I expected my
  own patches to come through the module tree too. Now they missed
  linux-next, but hopefully that won't cause any issues    - Linus ]

* tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  kallsyms: make kallsyms_show_value() as generic function
  kallsyms: move kallsyms_show_value() out of kallsyms.c
  kallsyms: remove unsed API lookup_symbol_attrs
  kallsyms: remove unused arch_get_kallsym() helper
  module: Remove preempt_disable() from module reference counting.
2023-06-28 15:51:08 -07:00
Linus Torvalds
9b9879fc03 modules: catch concurrent module loads, treat them as idempotent
This is the new-and-improved attempt at avoiding huge memory load spikes
when the user space boot sequence tries to load hundreds (or even
thousands) of redundant duplicate modules in parallel.

See commit 9828ed3f69 ("module: error out early on concurrent load of
the same module file") for background and an earlier failed attempt that
was reverted.

That earlier attempt just said "concurrently loading the same module is
silly, just open the module file exclusively and return -ETXTBSY if
somebody else is already loading it".

While it is true that concurrent module loads of the same module is
silly, the reason that earlier attempt then failed was that the
concurrently loaded module would often be a prerequisite for another
module.

Thus failing to load the prerequisite would then cause cascading
failures of the other modules, rather than just short-circuiting that
one unnecessary module load.

At the same time, we still really don't want to load the contents of the
same module file hundreds of times, only to then wait for an eventually
successful load, and have everybody else return -EEXIST.

As a result, this takes another approach, and treats concurrent module
loads from the same file as "idempotent" in the inode.  So if one module
load is ongoing, we don't start a new one, but instead just wait for the
first one to complete and return the same return value as it did.

So unlike the first attempt, this does not return early: the intent is
not to speed up the boot, but to avoid a thundering herd problem in
allocating memory (both physical and virtual) for a module more than
once.

Also note that this does change behavior: it used to be that when you
had concurrent loads, you'd have one "winner" that would return success,
and everybody else would return -EEXIST.

In contrast, this idempotent logic goes all Oprah on the problem, and
says "You are a winner! And you are a winner! We are ALL winners".  But
since there's no possible actual real semantic difference between "you
loaded the module" and "somebody else already loaded the module", this
is more of a feel-good change than an actual honest-to-goodness semantic
change.

Of course, any true Johnny-come-latelies that don't get caught in the
concurrency filter will still return -EEXIST.  It's no different from
not even getting a seat at an Oprah taping.  That's life.

See the long thread on the kernel mailing list about this all, which
includes some numbers for memory use before and after the patch.

Link: https://lore.kernel.org/lkml/20230524213620.3509138-1-mcgrof@kernel.org/
Reviewed-by: Johan Hovold <johan@kernel.org>
Tested-by: Johan Hovold <johan@kernel.org>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Rudi Heitbaum <rudi@heitbaum..com>
Tested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28 15:46:44 -07:00
Linus Torvalds
054a73009c module: split up 'finit_module()' into init_module_from_file() helper
This will simplify the next step, where we can then key off the inode to
do one idempotent module load.

Let's do the obvious re-organization in one step, and then the new code
in another.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-28 15:46:43 -07:00
Song Liu
db3e33dd8b module: fix module load for ia64
Frank reported boot regression in ia64 as:

ELILO v3.16 for EFI/IA-64
..
Uncompressing Linux... done
Loading file AC100221.initrd.img...done
[    0.000000] Linux version 6.4.0-rc3 (root@x4270) (ia64-linux-gcc
(GCC) 12.2.0, GNU ld (GNU Binutils) 2.39) #1 SMP Thu May 25 15:52:20
CEST 2023
[    0.000000] efi: EFI v1.1 by HP
[    0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fe2a000
ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fe28000
[    0.000000] PCDP: v3 at 0x3fe28000
[    0.000000] earlycon: uart8250 at MMIO 0x00000000f4050000 (options
'9600n8')
[    0.000000] printk: bootconsole [uart8250] enabled
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x000000003FE2A000 000028 (v02 HP    )
[    0.000000] ACPI: XSDT 0x000000003FE2A02C 0000CC (v01 HP     rx2620
00000000 HP   00000000)
[...]
[    3.793350] Run /init as init process
Loading, please wait...
Starting systemd-udevd version 252.6-1
[    3.951100] ------------[ cut here ]------------
[    3.951100] WARNING: CPU: 6 PID: 140 at kernel/module/main.c:1547
__layout_sections+0x370/0x3c0
[    3.949512] Unable to handle kernel paging request at virtual address
1000000000000000
[    3.951100] Modules linked in:
[    3.951100] CPU: 6 PID: 140 Comm: (udev-worker) Not tainted 6.4.0-rc3 #1
[    3.956161] (udev-worker)[142]: Oops 11003706212352 [1]
[    3.951774] Hardware name: hp server rx2620                   , BIOS
04.29
11/30/2007
[    3.951774]
[    3.951774] Call Trace:
[    3.958339] Unable to handle kernel paging request at virtual address
1000000000000000
[    3.956161] Modules linked in:
[    3.951774]  [<a0000001000156d0>] show_stack.part.0+0x30/0x60
[    3.951774]                                 sp=e000000183a67b20
bsp=e000000183a61628
[    3.956161]
[    3.956161]

which bisect to module_memory change [1].

Debug showed that ia64 uses some special sections:

__layout_sections: section .got (sh_flags 10000002) matched to MOD_INVALID
__layout_sections: section .sdata (sh_flags 10000003) matched to MOD_INVALID
__layout_sections: section .sbss (sh_flags 10000003) matched to MOD_INVALID

All these sections are loaded to module core memory before [1].

Fix ia64 boot by loading these sections to MOD_DATA (core rw data).

[1] commit ac3b432839 ("module: replace module_layout with module_memory")

Fixes: ac3b432839 ("module: replace module_layout with module_memory")
Reported-by: Frank Scheiner <frank.scheiner@web.de>
Closes: https://lists.debian.org/debian-ia64/2023/05/msg00010.html
Closes: https://marc.info/?l=linux-ia64&m=168509859125505
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Song Liu <song@kernel.org>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-30 09:29:40 -07:00
Sebastian Andrzej Siewior
cb0b50b813 module: Remove preempt_disable() from module reference counting.
The preempt_disable() section in module_put() was added in commit
   e1783a240f ("module: Use this_cpu_xx to dynamically allocate counters")

while the per-CPU counter were switched to another API. The API requires
that during the RMW operation the CPU remained the same.

This counting API was later replaced with atomic_t in commit
   2f35c41f58 ("module: Replace module_ref with atomic_t refcnt")

Since this atomic_t replacement there is no need to keep preemption
disabled while the reference counter is modified.

Remove preempt_disable() from module_put(), __module_get() and
try_module_get().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-23 22:23:18 -07:00
Luis Chamberlain
064f4536d1 module: avoid allocation if module is already present and ready
The finit_module() system call can create unnecessary virtual memory
pressure for duplicate modules. This is because load_module() can in
the worse case allocate more than twice the size of a module in virtual
memory. This saves at least a full size of the module in wasted vmalloc
space memory by trying to avoid duplicates as soon as we can validate
the module name in the read module structure.

This can only be an issue if a system is getting hammered with userspace
loading modules. There are two ways to load modules typically on systems,
one is the kernel moduile auto-loading (*request_module*() calls in-kernel)
and the other is things like udev. The auto-loading is in-kernel, but that
pings back to userspace to just call modprobe. We already have a way to
restrict the amount of concurrent kernel auto-loads in a given time, however
that still allows multiple requests for the same module to go through
and force two threads in userspace racing to call modprobe for the same
exact module. Even though libkmod which both modprobe and udev does check
if a module is already loaded prior calling finit_module() races are
still possible and this is clearly evident today when you have multiple
CPUs.

To avoid memory pressure for such stupid cases put a stop gap for them.
The *earliest* we can detect duplicates from the modules side of things
is once we have blessed the module name, sadly after the first vmalloc
allocation. We can check for the module being present *before* a secondary
vmalloc() allocation.

There is a linear relationship between wasted virtual memory bytes and
the number of CPU counts. The reason is that udev ends up racing to call
tons of the same modules for each of the CPUs.

We can see the different linear relationships between wasted virtual
memory and CPU count during after boot in the following graph:

         +----------------------------------------------------------------------------+
    14GB |-+          +            +            +           +           *+          +-|
         |                                                          ****              |
         |                                                       ***                  |
         |                                                     **                     |
    12GB |-+                                                 **                     +-|
         |                                                 **                         |
         |                                               **                           |
         |                                             **                             |
         |                                           **                               |
    10GB |-+                                       **                               +-|
         |                                       **                                   |
         |                                     **                                     |
         |                                   **                                       |
     8GB |-+                               **                                       +-|
waste    |                               **                             ###           |
         |                             **                           ####              |
         |                           **                      #######                  |
     6GB |-+                     ****                    ####                       +-|
         |                      *                    ####                             |
         |                     *                 ####                                 |
         |                *****              ####                                     |
     4GB |-+            **               ####                                       +-|
         |            **             ####                                             |
         |          **           ####                                                 |
         |        **         ####                                                     |
     2GB |-+    **      #####                                                       +-|
         |     *    ####                                                              |
         |    * ####                                                   Before ******* |
         |  **##      +            +            +           +           After ####### |
         +----------------------------------------------------------------------------+
         0            50          100          150         200          250          300
                                          CPUs count

On the y-axis we can see gigabytes of wasted virtual memory during boot
due to duplicate module requests which just end up failing. Trying to
infer the slope this ends up being about ~463 MiB per CPU lost prior
to this patch. After this patch we only loose about ~230 MiB per CPU, for
a total savings of about ~233 MiB per CPU. This is all *just on bootup*!

On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests
kmod.sh -t 0008 I see a saving in the *highest* side of memory
consumption of up to ~ 84 MiB with the Linux kernel selftests kmod
test 0008. With the new stress-ng module test I see a 145 MiB difference
in max memory consumption with 100 ops. The stress-ng module ops tests can be
pretty pathalogical -- it is not realistic, however it was used to
finally successfully reproduce issues which are only reported to happen on
system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM
system. Running out of virtual memory space is no surprise given the
above graph, since at least on x86_64 we're capped at 128 MiB, eventually
we'd hit a series of errors and once can use the above graph to
guestimate when. This of course will vary depending on the features
you have enabled. So for instance, enabling KASAN seems to make this
much worse.

The results with kmod and stress-ng can be observed and visualized below.
The time it takes to run the test is also not affected.

The kmod tests 0008:

The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib)
given the tests peak around that range.

cat kmod.plot
set term dumb
set output fileout
set yrange [400000:580000]
plot filein with linespoints title "Memory usage (KiB)"

Before:
root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C
root@kmod ~ # sort -n -r log-0008-before.txt | head -1
528732

So ~516.33 MiB

After:

root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C

root@kmod ~ # sort -n -r log-0008-after.txt | head -1
442516

So ~432.14 MiB

That's about 84 ~MiB in savings in the worst case. The graphs:

root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot
root@kmod ~ # gnuplot -e "filein='log-0008-after.txt';  fileout='graph-0008-after.txt'"  kmod.plot

root@kmod ~ # cat graph-0008-before.txt

  580000 +-----------------------------------------------------------------+
         |       +        +       +       +       +        +       +       |
  560000 |-+                                    Memory usage (KiB) ***A***-|
         |                                                                 |
  540000 |-+                                                             +-|
         |                                                                 |
         |        *A     *AA*AA*A*AA          *A*AA    A*A*A *AA*A*AA*A  A |
  520000 |-+A*A*AA  *AA*A           *A*AA*A*AA     *A*A     A          *A+-|
         |*A                                                               |
  500000 |-+                                                             +-|
         |                                                                 |
  480000 |-+                                                             +-|
         |                                                                 |
  460000 |-+                                                             +-|
         |                                                                 |
         |                                                                 |
  440000 |-+                                                             +-|
         |                                                                 |
  420000 |-+                                                             +-|
         |       +        +       +       +       +        +       +       |
  400000 +-----------------------------------------------------------------+
         0       5        10      15      20      25       30      35      40

root@kmod ~ # cat graph-0008-after.txt

  580000 +-----------------------------------------------------------------+
         |       +        +       +       +       +        +       +       |
  560000 |-+                                    Memory usage (KiB) ***A***-|
         |                                                                 |
  540000 |-+                                                             +-|
         |                                                                 |
         |                                                                 |
  520000 |-+                                                             +-|
         |                                                                 |
  500000 |-+                                                             +-|
         |                                                                 |
  480000 |-+                                                             +-|
         |                                                                 |
  460000 |-+                                                             +-|
         |                                                                 |
         |          *A              *A*A                                   |
  440000 |-+A*A*AA*A  A       A*A*AA    A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-|
         |*A           *A*AA*A                                             |
  420000 |-+                                                             +-|
         |       +        +       +       +       +        +       +       |
  400000 +-----------------------------------------------------------------+
         0       5        10      15      20      25       30      35      40

The stress-ng module tests:

This is used to run the test to try to reproduce the vmap issues
reported by David:

  echo 0 > /proc/sys/vm/oom_dump_tasks
  ./stress-ng --module 100 --module-name xfs

Prior to this commit:
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt
root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1
5046456

After this commit:
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt
root@kmod ~ # sort -n -r after-stress-ng.txt | head -1
4896972

5046456 - 4896972
149484
149484/1024
145.98046875000000000000

So this commit using stress-ng reveals saving about 145 MiB in memory
using 100 ops from stress-ng which reproduced the vmap issue reported.

cat kmod.plot
set term dumb
set output fileout
set yrange [4700000:5070000]
plot filein with linespoints title "Memory usage (KiB)"

root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'"  kmod-simple-stress-ng.plot
root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'"  kmod-simple-stress-ng.plot

root@kmod ~ # cat graph-stress-ng-before.txt

           +---------------------------------------------------------------+
  5.05e+06 |-+     + A     +       +       +       +       +       +     +-|
           |         *                          Memory usage (KiB) ***A*** |
           |         *                             A                       |
     5e+06 |-+      **                            **                     +-|
           |        **                            * *    A                 |
  4.95e+06 |-+      * *                          A  *   A*               +-|
           |        * *      A       A           *  *  *  *             A  |
           |       *  *     * *     * *        *A   *  *  *      A      *  |
   4.9e+06 |-+     *  *     * A*A   * A*AA*A  A      *A    **A   **A*A  *+-|
           |       A  A*A  A    *  A       *  *      A     A *  A    * **  |
           |      *      **      **         * *              *  *    * * * |
  4.85e+06 |-+   A       A       A          **               *  *     ** *-|
           |     *                           *               * *      ** * |
           |     *                           A               * *      *  * |
   4.8e+06 |-+   *                                           * *      A  A-|
           |     *                                           * *           |
  4.75e+06 |-+  *                                            * *         +-|
           |    *                                            **            |
           |    *  +       +       +       +       +       + **    +       |
   4.7e+06 +---------------------------------------------------------------+
           0       5       10      15      20      25      30      35      40

root@kmod ~ # cat graph-stress-ng-after.txt

           +---------------------------------------------------------------+
  5.05e+06 |-+     +       +       +       +       +       +       +     +-|
           |                                    Memory usage (KiB) ***A*** |
           |                                                               |
     5e+06 |-+                                                           +-|
           |                                                               |
  4.95e+06 |-+                                                           +-|
           |                                                               |
           |                                                               |
   4.9e+06 |-+                                      *AA                  +-|
           |  A*AA*A*A  A  A*AA*AA*A*AA*A  A  A  A*A   *AA*A*A  A  A*AA*AA |
           |  *      * **  *            *  *  ** *            ***  *       |
  4.85e+06 |-+*       ***  *            * * * ***             A *  *     +-|
           |  *       A *  *             ** * * A               *  *       |
           |  *         *  *             *  **                  *  *       |
   4.8e+06 |-+*         *  *             A   *                  *  *     +-|
           | *          * *                  A                  * *        |
  4.75e+06 |-*          * *                                     * *      +-|
           | *          * *                                     * *        |
           | *     +    * *+       +       +       +       +    * *+       |
   4.7e+06 +---------------------------------------------------------------+
           0       5       10      15      20      25      30      35      40

[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

Reported-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-18 11:15:24 -07:00
Luis Chamberlain
df3e764d8e module: add debug stats to help identify memory pressure
Loading modules with finit_module() can end up using vmalloc(), vmap()
and vmalloc() again, for a total of up to 3 separate allocations in the
worst case for a single module. We always kernel_read*() the module,
that's a vmalloc(). Then vmap() is used for the module decompression,
and if so the last read buffer is freed as we use the now decompressed
module buffer to stuff data into our copy module. The last allocation is
specific to each architectures but pretty much that's generally a series
of vmalloc() calls or a variation of vmalloc to handle ELF sections with
special permissions.

Evaluation with new stress-ng module support [1] with just 100 ops
is proving that you can end up using GiBs of data easily even with all
care we have in the kernel and userspace today in trying to not load modules
which are already loaded. 100 ops seems to resemble the sort of pressure a
system with about 400 CPUs can create on module loading. Although issues
relating to duplicate module requests due to each CPU inucurring a new
module reuest is silly and some of these are being fixed, we currently lack
proper tooling to help diagnose easily what happened, when it happened
and who likely is to blame -- userspace or kernel module autoloading.

Provide an initial set of stats which use debugfs to let us easily scrape
post-boot information about failed loads. This sort of information can
be used on production worklaods to try to optimize *avoiding* redundant
memory pressure using finit_module().

There's a few examples that can be provided:

A 255 vCPU system without the next patch in this series applied:

Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s
graphical.target reached after 6.988s in userspace

And 13.58 GiB of virtual memory space lost due to failed module loading:

root@big ~ # cat /sys/kernel/debug/modules/stats
         Mods ever loaded       67
     Mods failed on kread       0
Mods failed on decompress       0
  Mods failed on becoming       0
      Mods failed on load       1411
        Total module size       11464704
      Total mod text size       4194304
       Failed kread bytes       0
  Failed decompress bytes       0
    Failed becoming bytes       0
        Failed kmod bytes       14588526272
 Virtual mem wasted bytes       14588526272
         Average mod size       171115
    Average mod text size       62602
  Average fail load bytes       10339140
Duplicate failed modules:
              module-name        How-many-times                    Reason
                kvm_intel                   249                      Load
                      kvm                   249                      Load
                irqbypass                     8                      Load
         crct10dif_pclmul                   128                      Load
      ghash_clmulni_intel                    27                      Load
             sha512_ssse3                    50                      Load
           sha512_generic                   200                      Load
              aesni_intel                   249                      Load
              crypto_simd                    41                      Load
                   cryptd                   131                      Load
                    evdev                     2                      Load
                serio_raw                     1                      Load
               virtio_pci                     3                      Load
                     nvme                     3                      Load
                nvme_core                     3                      Load
    virtio_pci_legacy_dev                     3                      Load
    virtio_pci_modern_dev                     3                      Load
                   t10_pi                     3                      Load
                   virtio                     3                      Load
             crc32_pclmul                     6                      Load
           crc64_rocksoft                     3                      Load
             crc32c_intel                    40                      Load
              virtio_ring                     3                      Load
                    crc64                     3                      Load

The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the
next patch in this series applied, shows 226.53 MiB are wasted in virtual
memory allocations which due to duplicate module requests during boot.
It also shows an average module memory size of 167.10 KiB and an an
average module .text + .init.text size of 61.13 KiB. The end shows all
modules which were detected as duplicate requests and whether or not
they failed early after just the first kernel_read*() call or late after
we've already allocated the private space for the module in
layout_and_allocate(). A system with module decompression would reveal
more wasted virtual memory space.

We should put effort now into identifying the source of these duplicate
module requests and trimming these down as much possible. Larger systems
will obviously show much more wasted virtual memory allocations.

root@kmod ~ # cat /sys/kernel/debug/modules/stats
         Mods ever loaded       67
     Mods failed on kread       0
Mods failed on decompress       0
  Mods failed on becoming       83
      Mods failed on load       16
        Total module size       11464704
      Total mod text size       4194304
       Failed kread bytes       0
  Failed decompress bytes       0
    Failed becoming bytes       228959096
        Failed kmod bytes       8578080
 Virtual mem wasted bytes       237537176
         Average mod size       171115
    Average mod text size       62602
  Avg fail becoming bytes       2758544
  Average fail load bytes       536130
Duplicate failed modules:
              module-name        How-many-times                    Reason
                kvm_intel                     7                  Becoming
                      kvm                     7                  Becoming
                irqbypass                     6           Becoming & Load
         crct10dif_pclmul                     7           Becoming & Load
      ghash_clmulni_intel                     7           Becoming & Load
             sha512_ssse3                     6           Becoming & Load
           sha512_generic                     7           Becoming & Load
              aesni_intel                     7                  Becoming
              crypto_simd                     7           Becoming & Load
                   cryptd                     3           Becoming & Load
                    evdev                     1                  Becoming
                serio_raw                     1                  Becoming
                     nvme                     3                  Becoming
                nvme_core                     3                  Becoming
                   t10_pi                     3                  Becoming
               virtio_pci                     3                  Becoming
             crc32_pclmul                     6           Becoming & Load
           crc64_rocksoft                     3                  Becoming
             crc32c_intel                     3                  Becoming
    virtio_pci_modern_dev                     2                  Becoming
    virtio_pci_legacy_dev                     1                  Becoming
                    crc64                     2                  Becoming
                   virtio                     2                  Becoming
              virtio_ring                     2                  Becoming

[0] https://github.com/ColinIanKing/stress-ng.git
[1] echo 0 > /proc/sys/vm/oom_dump_tasks
    ./stress-ng --module 100 --module-name xfs

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-18 11:15:24 -07:00
Luis Chamberlain
f71afa6a42 module: extract patient module check into helper
The patient module check inside add_unformed_module() is large
enough as we need it. It is a bit hard to read too, so just
move it to a helper and do the inverse checks first to help
shift the code and make it easier to read. The new helper then
is module_patient_check_exists().

To make this work we need to mvoe the finished_loading() up,
we do that without making any functional changes to that routine.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-18 11:15:24 -07:00
Luis Chamberlain
430bb0d1c3 module: fix kmemleak annotations for non init ELF sections
Commit ac3b432839 ("module: replace module_layout with module_memory")
reworked the way to handle memory allocations to make it clearer. But it
lost in translation how we handled kmemleak_ignore() or kmemleak_not_leak()
for different ELF sections.

Fix this and clarify the comments a bit more. Contrary to the old way
of using kmemleak_ignore() for init.* ELF sections we stick now only to
kmemleak_not_leak() as per suggestion by Catalin Marinas so to avoid
any false positives and simplify the code.

Fixes: ac3b432839 ("module: replace module_layout with module_memory")
Reported-by: Jim Cromie <jim.cromie@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-14 09:36:22 -07:00
Jim Cromie
33c951f629 module: already_uses() - reduce pr_debug output volume
already_uses() is unnecessarily chatty.

`modprobe i915` yields 491 messages like:

  [   64.108744] i915 uses drm!

This is a normal situation, and isn't worth all the log entries.

NOTE: I've preserved the "does not use %s" messages, which happens
less often, but does happen.  Its not clear to me what it tells a
reader, or what info might improve the pr_debug's utility.

[ 6847.584999] main:already_uses:569: amdgpu does not use ttm!
[ 6847.585001] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585014] main:already_uses:569: amdgpu does not use drm!
[ 6847.585016] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585024] main:already_uses:569: amdgpu does not use drm_display_helper!
[ 6847.585025] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585084] main:already_uses:569: amdgpu does not use drm_kms_helper!
[ 6847.585086] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585175] main:already_uses:569: amdgpu does not use drm_buddy!
[ 6847.585176] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585202] main:already_uses:569: amdgpu does not use i2c_algo_bit!
[ 6847.585204] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585249] main:already_uses:569: amdgpu does not use gpu_sched!
[ 6847.585250] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585314] main:already_uses:569: amdgpu does not use video!
[ 6847.585315] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585409] main:already_uses:569: amdgpu does not use iommu_v2!
[ 6847.585410] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6847.585816] main:already_uses:569: amdgpu does not use drm_ttm_helper!
[ 6847.585818] main:add_module_usage:584: Allocating new usage for amdgpu.
[ 6848.762268] dyndbg: add-module: amdgpu.2533 sites

no functional changes.

Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:09 -07:00
Jim Cromie
66a2301edf module: add section-size to move_module pr_debug
move_module() pr_debug's "Final section addresses for $modname".
Add section addresses to the message, for anyone looking at these.

no functional changes.

Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:09 -07:00
Jim Cromie
b10addf37b module: add symbol-name to pr_debug Absolute symbol
The pr_debug("Absolute symbol" ..) reports value, (which is usually
0), but not the name, which is more informative.  So add it.

no functional changes

Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:09 -07:00
Jim Cromie
6ed81802d4 module: in layout_sections, move_module: add the modname
layout_sections() and move_module() each issue ~50 messages for each
module loaded.  Add mod-name into their 2 header lines, to help the
reader find his module.

no functional changes.

Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:09 -07:00
Luis Chamberlain
3d40bb903e module: merge remnants of setup_load_info() to elf validation
The setup_load_info() was actually had ELF validation checks of its
own. To later cache useful variables as an secondary step just means
looping again over the ELF sections we just validated. We can simply
keep tabs of the key sections of interest as we validate the module
ELF section in one swoop, so do that and merge the two routines
together.

Expand a bit on the documentation / intent / goals.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:08 -07:00
Luis Chamberlain
1bb49db991 module: move more elf validity checks to elf_validity_check()
The symbol and strings section validation currently happen in
setup_load_info() but since they are also doing validity checks
move this to elf_validity_check().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:08 -07:00
Luis Chamberlain
c7ee8aebf6 module: add stop-grap sanity check on module memcpy()
The integrity of the struct module we load is important, and although
our ELF validator already checks that the module section must match
struct module, add a stop-gap check before we memcpy() the final minted
module. This also makes those inspecting the code what the goal is.

While at it, clarify the goal behind updating the sh_addr address.
The current comment is pretty misleading.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:08 -07:00
Luis Chamberlain
46752820f9 module: add sanity check for ELF module section
The ELF ".gnu.linkonce.this_module" section is special, it is what we
use to construct the struct module __this_module, which THIS_MODULE
points to. When userspace loads a module we always deal first with a
copy of the userspace buffer, and twiddle with the userspace copy's
version of the struct module. Eventually we allocate memory to do a
memcpy() of that struct module, under the assumption that the module
size is right. But we have no validity checks against the size or
the requirements for the section.

Add some validity checks for the special module section early and while
at it, cache the module section index early, so we don't have to do that
later.

While at it, just move over the assigment of the info->mod to make the
code clearer. The validity checker also adds an explicit size check to
ensure the module section size matches the kernel's run time size for
sizeof(struct module). This should prevent sloppy loads of modules
which are built today *without* actually increasing the size of
the struct module. A developer today can for example expand the size
of struct module, rebuild a directoroy 'make fs/xfs/' for example and
then try to insmode the driver there. That module would in effect have
an incorrect size. This new size check would put a stop gap against such
mistakes.

This also makes the entire goal of ".gnu.linkonce.this_module" pretty
clear. Before this patch verification of the goal / intent required some
Indian Jones whips, torches and cleaning up big old spider webs.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:08 -07:00
Luis Chamberlain
419e1a20f7 module: rename check_module_license_and_versions() to check_export_symbol_versions()
This makes the routine easier to understand what the check its checking for.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:08 -07:00
Luis Chamberlain
72f08b3cc6 module: converge taint work together
Converge on a compromise: so long as we have a module hit our linked
list of modules we taint. That is, the module was about to become live.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-03-24 11:33:08 -07:00