104 Commits

Author SHA1 Message Date
Thomas Gleixner
ff8d523cc4 debugobjects: Track object usage to avoid premature freeing of objects
The freelist is freed at a constant rate independent of the actual usage
requirements. That's bad in scenarios where usage comes in bursts. The end
of a burst puts the objects on the free list and freeing proceeds even when
the next burst which requires objects started again.

Keep track of the usage with a exponentially wheighted moving average and
take that into account in the worker function which frees objects from the
free list.

This further reduces the kmem_cache allocation/free rate for a full kernel
compile:

   	    kmem_cache_alloc()	kmem_cache_free()
Baseline:   225k		173k
Usage:	    170k		117k

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/87bjznhme2.ffs@tglx
2024-10-15 17:30:33 +02:00
Thomas Gleixner
13f9ca7239 debugobjects: Refill per CPU pool more agressively
Right now the per CPU pools are only refilled when they become
empty. That's suboptimal especially when there are still non-freed objects
in the to free list.

Check whether an allocation from the per CPU pool emptied a batch and try
to allocate from the free pool if that still has objects available.

   	    kmem_cache_alloc()	kmem_cache_free()
Baseline:   295k		245k
Refill:	    225k		173k

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.439053085@linutronix.de
2024-10-15 17:30:33 +02:00
Thomas Gleixner
a201a96b96 debugobjects: Double the per CPU slots
In situations where objects are rapidly allocated from the pool and handed
back, the size of the per CPU pool turns out to be too small.

Double the size of the per CPU pool.

This reduces the kmem cache allocation and free operations during a kernel compile:

     	     alloc    	    free
Baseline:    380k           330k
Double size: 295k	    245k

Especially the reduction of allocations is important because that happens
in the hot path when objects are initialized.

The maximum increase in per CPU pool memory consumption is about 2.5K per
online CPU, which is acceptable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.378676302@linutronix.de
2024-10-15 17:30:33 +02:00
Thomas Gleixner
2638345d22 debugobjects: Move pool statistics into global_pool struct
Keep it along with the pool as that's a hot cache line anyway and it makes
the code more comprehensible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.318776207@linutronix.de
2024-10-15 17:30:33 +02:00
Thomas Gleixner
f57ebb92ba debugobjects: Implement batch processing
Adding and removing single objects in a loop is bad in terms of lock
contention and cache line accesses.

To implement batching, record the last object in a batch in the object
itself. This is trivialy possible as hlists are strictly stacks. At a batch
boundary, when the first object is added to the list the object stores a
pointer to itself in debug_obj::batch_last. When the next object is added
to the list then the batch_last pointer is retrieved from the first object
in the list and stored in the to be added one.

That means for batch processing the first object always has a pointer to
the last object in a batch, which allows to move batches in a cache line
efficient way and reduces the lock held time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.258995000@linutronix.de
2024-10-15 17:30:33 +02:00
Thomas Gleixner
aebbfe0779 debugobjects: Prepare kmem_cache allocations for batching
Allocate a batch and then push it into the pool. Utilize the
debug_obj::last_node pointer for keeping track of the batch boundary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241007164914.198647184@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
74fe1ad413 debugobjects: Prepare for batching
Move the debug_obj::object pointer into a union and add a pointer to the
last node in a batch. That allows to implement batch processing efficiently
by utilizing the stack property of hlist:

When the first object of a batch is added to the list, then the batch
pointer is set to the hlist node of the object itself. Any subsequent add
retrieves the pointer to the last node from the first object in the list
and uses that for storing the last node pointer in the newly added object.

Add the pointer to the data structure and ensure that all relevant pool
sizes are strictly batch sized. The actual batching implementation follows
in subsequent changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.139204961@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
14077b9e58 debugobjects: Use static key for boot pool selection
Get rid of the conditional in the hot path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.077247071@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
9ce99c6d7b debugobjects: Rework free_object_work()
Convert it to batch processing with intermediate helper functions. This
reduces the final changes for batch processing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164914.015906394@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
a3b9e191f5 debugobjects: Rework object freeing
__free_object() is uncomprehensibly complex. The same can be achieved by:

   1) Adding the object to the per CPU pool

   2) If that pool is full, move a batch of objects into the global pool
      or if the global pool is full into the to free pool

This also prepares for batch processing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.955542307@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
fb60c004f3 debugobjects: Rework object allocation
The current allocation scheme tries to allocate from the per CPU pool
first. If that fails it allocates one object from the global pool and then
refills the per CPU pool from the global pool.

That is in the way of switching the pool management to batch mode as the
global pool needs to be a strict stack of batches, which does not allow
to allocate single objects.

Rework the code to refill the per CPU pool first and then allocate the
object from the refilled batch. Also try to allocate from the to free pool
first to avoid freeing and reallocating objects.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.893554162@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
96a9a0421c debugobjects: Move min/max count into pool struct
Having the accounting in the datastructure is better in terms of cache
lines and allows more optimizations later on.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.831908427@linutronix.de
2024-10-15 17:30:32 +02:00
Thomas Gleixner
18b8afcb37 debugobjects: Rename and tidy up per CPU pools
No point in having a separate data structure. Reuse struct obj_pool and
tidy up the code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.770595795@linutronix.de
2024-10-15 17:30:31 +02:00
Thomas Gleixner
cb58d19084 debugobjects: Use separate list head for boot pool
There is no point to handle the statically allocated objects during early
boot in the actual pool list. This phase does not require accounting, so
all of the related complexity can be avoided.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.708939081@linutronix.de
2024-10-15 17:30:31 +02:00
Thomas Gleixner
e18328ff70 debugobjects: Move pools into a datastructure
The contention on the global pool lock can be reduced by strict batch
processing where batches of objects are moved from one list head to another
instead of moving them object by object. This also reduces the cache
footprint because it avoids the list walk and dirties at maximum three
cache lines instead of potentially up to eighteen.

To prepare for that, move the hlist head and related counters into a
struct.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.646171170@linutronix.de
2024-10-15 17:30:31 +02:00
Zhen Lei
d8c6cd3a5c debugobjects: Reduce parallel pool fill attempts
The contention on the global pool_lock can be massive when the global pool
needs to be refilled and many CPUs try to handle this.

Address this by:

  - splitting the refill from free list and allocation.

    Refill from free list has no constraints vs. the context on RT, so
    it can be tried outside of the RT specific preemptible() guard

  - Let only one CPU handle the free list

  - Let only one CPU do allocations unless the pool level is below
    half of the minimum fill level.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240911083521.2257-4-thunder.leizhen@huawei.com-
Link: https://lore.kernel.org/all/20241007164913.582118421@linutronix.de

--
 lib/debugobjects.c |   84 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 59 insertions(+), 25 deletions(-)
2024-10-15 17:30:31 +02:00
Thomas Gleixner
661cc28b52 debugobjects: Make debug_objects_enabled bool
Make it what it is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.518175013@linutronix.de
2024-10-15 17:30:31 +02:00
Thomas Gleixner
49a5cb827d debugobjects: Provide and use free_object_list()
Move the loop to free a list of objects into a helper function so it can be
reused later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241007164913.453912357@linutronix.de
2024-10-15 17:30:31 +02:00
Thomas Gleixner
241463f4fd debugobjects: Remove pointless debug printk
It has zero value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.390511021@linutronix.de
2024-10-15 17:30:31 +02:00
Thomas Gleixner
49968cf181 debugobjects: Reuse put_objects() on OOM
Reuse the helper function instead of having a open coded copy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.326834268@linutronix.de
2024-10-15 17:30:31 +02:00
Thomas Gleixner
a2a702383e debugobjects: Dont free objects directly on CPU hotplug
Freeing the per CPU pool of the unplugged CPU directly is suboptimal as the
objects can be reused in the real pool if there is room. Aside of that this
gets the accounting wrong.

Use the regular free path, which allows reuse and has the accounting correct.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.263960570@linutronix.de
2024-10-15 17:30:30 +02:00
Thomas Gleixner
3f397bf955 debugobjects: Remove pointless hlist initialization
It's BSS zero initialized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/all/20241007164913.200379308@linutronix.de
2024-10-15 17:30:30 +02:00
Thomas Gleixner
55fb412ef7 debugobjects: Dont destroy kmem cache in init()
debug_objects_mem_init() is invoked from mm_core_init() before work queues
are available. If debug_objects_mem_init() destroys the kmem cache in the
error path it causes an Oops in __queue_work():

 Oops: Oops: 0000 [#1] PREEMPT SMP PTI
 RIP: 0010:__queue_work+0x35/0x6a0
  queue_work_on+0x66/0x70
  flush_all_cpus_locked+0xdf/0x1a0
  __kmem_cache_shutdown+0x2f/0x340
  kmem_cache_destroy+0x4e/0x150
  mm_core_init+0x9e/0x120
  start_kernel+0x298/0x800
  x86_64_start_reservations+0x18/0x30
  x86_64_start_kernel+0xc5/0xe0
  common_startup_64+0x12c/0x138

Further the object cache pointer is used in various places to check for
early boot operation. It is exposed before the replacments for the static
boot time objects are allocated and the self test operates on it.

This can be avoided by:

     1) Running the self test with the static boot objects

     2) Exposing it only after the replacement objects have been added to
     	the pool.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241007164913.137021337@linutronix.de
2024-10-15 17:30:30 +02:00
Zhen Lei
813fd07858 debugobjects: Collect newly allocated objects in a list to reduce lock contention
Collect the newly allocated debug objects in a list outside the lock, so
that the lock held time and the potential lock contention is reduced.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240911083521.2257-3-thunder.leizhen@huawei.com
Link: https://lore.kernel.org/all/20241007164913.073653668@linutronix.de
2024-10-15 17:30:30 +02:00
Zhen Lei
a0ae950408 debugobjects: Delete a piece of redundant code
The statically allocated objects are all located in obj_static_pool[],
the whole memory of obj_static_pool[] will be reclaimed later. Therefore,
there is no need to split the remaining statically nodes in list obj_pool
into isolated ones, no one will use them anymore. Just write
INIT_HLIST_HEAD(&obj_pool) is enough. Since hlist_move_list() directly
discards the old list, even this can be omitted.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240911083521.2257-2-thunder.leizhen@huawei.com
Link: https://lore.kernel.org/all/20241007164913.009849239@linutronix.de
2024-10-15 17:30:30 +02:00