linux

mirror of https://github.com/Dasharo/linux.git synced 2026-03-06 15:25:10 -08:00

Author	SHA1	Message	Date
Carlos Llamas	95bc2d4a90	binder: use per-vma lock in page reclaiming Use per-vma locking in the shrinker's callback when reclaiming pages, similar to the page installation logic. This minimizes contention with unrelated vmas improving performance. The mmap_sem is still acquired if the per-vma lock cannot be obtained. Cc: Suren Baghdasaryan <surenb@google.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-10-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	978ce3ed70	binder: propagate vm_insert_page() errors Instead of always overriding errors with -ENOMEM, propagate the specific error code returned by vm_insert_page(). This allows for more accurate error logs and handling. Cc: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-9-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	9e2aa76549	binder: use per-vma lock in page installation Use per-vma locking for concurrent page installations, this minimizes contention with unrelated vmas improving performance. The mmap_lock is still acquired when needed though, e.g. before get_user_pages_remote(). Many thanks to Barry Song who posted a similar approach [1]. Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1] Cc: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-8-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	0a7bf6866d	binder: rename alloc->buffer to vm_start The alloc->buffer field in struct binder_alloc stores the starting address of the mapped vma, rename this field to alloc->vm_start to better reflect its purpose. It also avoids confusion with the binder buffer concept, e.g. transaction->buffer. No functional changes in this patch. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-7-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	072010abc3	binder: replace alloc->vma with alloc->mapped It is unsafe to use alloc->vma outside of the mmap_sem. Instead, add a new boolean alloc->mapped to save the vma state (mapped or unmmaped) and use this as a replacement for alloc->vma to validate several paths. Using the alloc->vma caused several performance and security issues in the past. Now that it has been replaced with either vm_lookup() or the alloc->mapped state, we can finally remove it. Cc: Minchan Kim <minchan@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-6-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	f909f03082	binder: store shrinker metadata under page->private Instead of pre-allocating an entire array of struct binder_lru_page in alloc->pages, install the shrinker metadata under page->private. This ensures the memory is allocated and released as needed alongside pages. By converting the alloc->pages[] into an array of struct page pointers, we can access these pages directly and only reference the shrinker metadata where it's being used (e.g. inside the shrinker's callback). Rename struct binder_lru_page to struct binder_shrinker_mdata to better reflect its purpose. Add convenience functions that wrap the allocation and freeing of pages along with their shrinker metadata. Note I've reworked this patch to avoid using page->lru and page->index directly, as Matthew pointed out that these are being removed [1]. Link: https://lore.kernel.org/all/ZzziucEm3np6e7a0@casper.infradead.org/ [1] Cc: Matthew Wilcox <willy@infradead.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-5-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	49d2562c80	binder: select correct nid for pages in LRU The numa node id for binder pages is currently being derived from the lru entry under struct binder_lru_page. However, this object doesn't reflect the node id of the struct page items allocated separately. Instead, select the correct node id from the page itself. This was made possible since commit `0a97c01cd2` ("list_lru: allow explicit memcg and NUMA node selection"). Cc: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-4-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	d1716b4b78	binder: concurrent page installation Allow multiple callers to install pages simultaneously by switching the mmap_sem from write-mode to read-mode. Races to the same PTE are handled using get_user_pages_remote() to retrieve the already installed page. This method significantly reduces contention in the mmap semaphore. To ensure safety, vma_lookup() is used (instead of alloc->vma) to avoid operating on an isolated VMA. In addition, zap_page_range_single() is called under the alloc->mutex to avoid racing with the shrinker. Many thanks to Barry Song who posted a similar approach [1]. Link: https://lore.kernel.org/all/20240902225009.34576-1-21cnbao@gmail.com/ [1] Cc: David Hildenbrand <david@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-3-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Carlos Llamas	8b52c7261e	Revert "binder: switch alloc->mutex to spinlock_t" This reverts commit `7710e2cca3`. In preparation for concurrent page installations, restore the original alloc->mutex which will serialize zap_page_range_single() against page installations in subsequent patches (instead of the mmap_sem). Resolved trivial conflicts with commit `2c10a20f5e` ("binder_alloc: Fix sleeping function called from invalid context") and commit `da0c02516c` ("mm/list_lru: simplify the list_lru walk callback function"). Cc: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20241210143114.661252-2-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-24 09:35:23 +01:00
Kairui Song	da0c02516c	mm/list_lru: simplify the list_lru walk callback function Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-11 17:22:26 -08:00
Kairui Song	fb56fdf8b9	mm/list_lru: split the lock to per-cgroup scope Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru. This lock can be split into per-cgroup scope to reduce contention. To achieve this, we need a stable list_lru_one for every cgroup. This commit adds a lock to each list_lru_one and introduced a helper function lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg. Then reworked the reparenting process. Reparenting will switch the list_lru_one instances one by one. By locking each instance and marking it dead using the nr_items counter, reparenting ensures that all items in the corresponding cgroup (on-list or not, because items have a stable cgroup, see below) will see the list_lru_one switch synchronously. Objcg reparent is also moved after list_lru reparent so items will have a stable mem cgroup until all list_lru_one instances are drained. The only caller that doesn't work the *_obj interfaces are direct calls to list_lru_{add,del}. But it's only used by zswap and that's also based on objcg, so it's fine. This also changes the bahaviour of the isolation function when LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the lock could unblock reparenting and free the list_lru_one, isolation function will have to return withoug re-lock the lru. prepare() { mkdir /tmp/test-fs modprobe brd rd_nr=1 rd_size=33554432 mkfs.xfs -f /dev/ram0 mount -t xfs /dev/ram0 /tmp/test-fs for i in $(seq 1 512); do mkdir "/tmp/test-fs/$i" for j in $(seq 1 10240); do echo TEST-CONTENT > "/tmp/test-fs/$i/$j" done & done; wait } do_test() { read_worker() { sleep 1 tar -cv "$1" &>/dev/null } read_in_all() { cd "/tmp/test-fs" && ls for i in $(seq 1 512); do (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs" read_worker "$i" & done; wait } for i in $(seq 1 512); do mkdir -p "/sys/fs/cgroup/benchmark/$i" done echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control echo 512M > /sys/fs/cgroup/benchmark/memory.max echo 3 > /proc/sys/vm/drop_caches time read_in_all } Above script simulates compression of small files in multiple cgroups with memory pressure. Run prepare() then do_test for 6 times: Before: real 0m7.762s user 0m11.340s sys 3m11.224s real 0m8.123s user 0m11.548s sys 3m2.549s real 0m7.736s user 0m11.515s sys 3m11.171s real 0m8.539s user 0m11.508s sys 3m7.618s real 0m7.928s user 0m11.349s sys 3m13.063s real 0m8.105s user 0m11.128s sys 3m14.313s After this commit (about ~15% faster): real 0m6.953s user 0m11.327s sys 2m42.912s real 0m7.453s user 0m11.343s sys 2m51.942s real 0m6.916s user 0m11.269s sys 2m43.957s real 0m6.894s user 0m11.528s sys 2m45.346s real 0m6.911s user 0m11.095s sys 2m43.168s real 0m6.773s user 0m11.518s sys 2m40.774s Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-11 17:22:26 -08:00
Mukesh Ojha	2c10a20f5e	binder_alloc: Fix sleeping function called from invalid context `36c55ce870` ("binder_alloc: Replace kcalloc with kvcalloc to mitigate OOM issues") introduced schedule while atomic issue. [ 2689.152635][ T4275] BUG: sleeping function called from invalid context at mm/vmalloc.c:2847 [ 2689.161291][ T4275] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4275, name: kworker/1:140 [ 2689.170708][ T4275] preempt_count: 1, expected: 0 [ 2689.175572][ T4275] RCU nest depth: 0, expected: 0 [ 2689.180521][ T4275] INFO: lockdep is turned off. [ 2689.180523][ T4275] Preemption disabled at: [ 2689.180525][ T4275] [<ffffffe031f2a2dc>] binder_alloc_deferred_release+0x2c/0x388 .. .. [ 2689.213419][ T4275] __might_resched+0x174/0x178 [ 2689.213423][ T4275] __might_sleep+0x48/0x7c [ 2689.213426][ T4275] vfree+0x4c/0x15c [ 2689.213430][ T4275] kvfree+0x24/0x44 [ 2689.213433][ T4275] binder_alloc_deferred_release+0x2c0/0x388 [ 2689.213436][ T4275] binder_proc_dec_tmpref+0x15c/0x2a8 [ 2689.213440][ T4275] binder_deferred_func+0xa8/0x8ec [ 2689.213442][ T4275] process_one_work+0x254/0x59c [ 2689.213447][ T4275] worker_thread+0x274/0x3ec [ 2689.213450][ T4275] kthread+0x110/0x134 [ 2689.213453][ T4275] ret_from_fork+0x10/0x20 Fix it by moving the place of kvfree outside of spinlock context. Fixes: `36c55ce870` ("binder_alloc: Replace kcalloc with kvcalloc to mitigate OOM issues") Acked-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> Link: https://lore.kernel.org/r/20240725062510.2856662-1-quic_mojha@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-07-31 13:48:25 +02:00
Lei Liu	36c55ce870	binder_alloc: Replace kcalloc with kvcalloc to mitigate OOM issues In binder_alloc, there is a frequent need for order3 memory allocation, especially on small-memory mobile devices, which can lead to OOM and cause foreground applications to be killed, resulting in flashbacks. We use kvcalloc to allocate memory, which can reduce system OOM occurrences, as well as decrease the time and probability of failure for order3 memory allocations. Additionally, It has little impact on the throughput of the binder. (as verified by Google's binder_benchmark testing tool). We have conducted multiple tests on an 8GB memory phone, kvcalloc has little performance degradation and resolves frequent OOM issues, Below is a partial excerpt of the test data. throughput(TH_PUT) = (size * Iterations)/Time kcalloc->kvcalloc: Sample with kcalloc(): adb shell stop/ kcalloc /8+256G --------------------------------------------------------------------- Benchmark Time CPU Iterations TH-PUT TH-PUTCPU (ns) (ns) (GB/s) (GB/s) --------------------------------------------------------------------- BM_sendVec_binder4 39126 18550 38894 3.976282 8.38684 BM_sendVec_binder8 38924 18542 37786 7.766108 16.3028 BM_sendVec_binder16 38328 18228 36700 15.32039 32.2141 BM_sendVec_binder32 38154 18215 38240 32.07213 67.1798 BM_sendVec_binder64 39093 18809 36142 59.16885 122.977 BM_sendVec_binder128 40169 19188 36461 116.1843 243.2253 BM_sendVec_binder256 40695 19559 35951 226.1569 470.5484 BM_sendVec_binder512 41446 20211 34259 423.2159 867.8743 BM_sendVec_binder1024 44040 22939 28904 672.0639 1290.278 BM_sendVec_binder2048 47817 25821 26595 1139.063 2109.393 BM_sendVec_binder4096 54749 30905 22742 1701.423 3014.115 BM_sendVec_binder8192 68316 42017 16684 2000.634 3252.858 BM_sendVec_binder16384 95435 64081 10961 1881.752 2802.469 BM_sendVec_binder32768 148232 107504 6510 1439.093 1984.295 BM_sendVec_binder65536 326499 229874 3178 637.8991 906.0329 NORAML TEST SUM 10355.79 17188.15 stressapptest eat 2G SUM 10088.39 16625.97 Sample with kvcalloc(): adb shell stop/ kvcalloc /8+256G ---------------------------------------------------------------------- Benchmark Time CPU Iterations TH-PUT TH-PUTCPU (ns) (ns) (GB/s) (GB/s) ---------------------------------------------------------------------- BM_sendVec_binder4 39673 18832 36598 3.689965 7.773577 BM_sendVec_binder8 39869 18969 37188 7.462038 15.68369 BM_sendVec_binder16 39774 18896 36627 14.73405 31.01355 BM_sendVec_binder32 40225 19125 36995 29.43045 61.90013 BM_sendVec_binder64 40549 19529 35148 55.47544 115.1862 BM_sendVec_binder128 41580 19892 35384 108.9262 227.6871 BM_sendVec_binder256 41584 20059 34060 209.6806 434.6857 BM_sendVec_binder512 42829 20899 32493 388.4381 796.0389 BM_sendVec_binder1024 45037 23360 29251 665.0759 1282.236 BM_sendVec_binder2048 47853 25761 27091 1159.433 2153.735 BM_sendVec_binder4096 55574 31745 22405 1651.328 2890.877 BM_sendVec_binder8192 70706 43693 16400 1900.105 3074.836 BM_sendVec_binder16384 96161 64362 10793 1838.921 2747.468 BM_sendVec_binder32768 147875 107292 6296 1395.147 1922.858 BM_sendVec_binder65536 330324 232296 3053 605.7126 861.3209 NORAML TEST SUM 10033.56 16623.35 stressapptest eat 2G SUM 9958.43 16497.55 Signed-off-by: Lei Liu <liulei.rjpt@vivo.com> Acked-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20240619113841.3362-1-liulei.rjpt@vivo.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-07-03 16:22:36 +02:00
Colin Ian King	367b3560e1	binder: remove redundant variable page_addr Variable page_addr is being assigned a value that is never read. The variable is redundant and can be removed. Cleans up clang scan build warning: warning: Value stored to 'page_addr' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@intel.com> Fixes: `162c797314` ("binder: avoid user addresses in debug logs") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312060851.cudv98wG-lkp@intel.com/ Acked-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20240307221505.101431-1-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-03-07 22:22:32 +00:00
Linus Torvalds	296455ade1	Merge tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc and other driver updates from Greg KH: "Here is the big set of char/misc and other driver subsystem changes for 6.8-rc1. Other than lots of binder driver changes (as you can see by the merge conflicts) included in here are: - lots of iio driver updates and additions - spmi driver updates - eeprom driver updates - firmware driver updates - ocxl driver updates - mhi driver updates - w1 driver updates - nvmem driver updates - coresight driver updates - platform driver remove callback api changes - tags.sh script updates - bus_type constant marking cleanups - lots of other small driver updates All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits) android: removed duplicate linux/errno uio: Fix use-after-free in uio_open drivers: soc: xilinx: add check for platform firmware: xilinx: Export function to use in other module scripts/tags.sh: remove find_sources scripts/tags.sh: use -n to test archinclude scripts/tags.sh: add local annotation scripts/tags.sh: use more portable -path instead of -wholename scripts/tags.sh: Update comment (addition of gtags) firmware: zynqmp: Convert to platform remove callback returning void firmware: turris-mox-rwtm: Convert to platform remove callback returning void firmware: stratix10-svc: Convert to platform remove callback returning void firmware: stratix10-rsu: Convert to platform remove callback returning void firmware: raspberrypi: Convert to platform remove callback returning void firmware: qemu_fw_cfg: Convert to platform remove callback returning void firmware: mtk-adsp-ipc: Convert to platform remove callback returning void firmware: imx-dsp: Convert to platform remove callback returning void firmware: coreboot_table: Convert to platform remove callback returning void firmware: arm_scpi: Convert to platform remove callback returning void firmware: arm_scmi: Convert to platform remove callback returning void ...	2024-01-17 16:47:17 -08:00
Nhat Pham	0a97c01cd2	list_lru: allow explicit memcg and NUMA node selection Patch series "workload-specific and memory pressure-driven zswap writeback", v8. There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. This patch (of 6): The interface of list_lru is based on the assumption that the list node and the data it represents belong to the same allocated on the correct node/memcg. While this assumption is valid for existing slab objects LRU such as dentries and inodes, it is undocumented, and rather inflexible for certain potential list_lru users (such as the upcoming zswap shrinker and the THP shrinker). It has caused us a lot of issues during our development. This patch changes list_lru interface so that the caller must explicitly specify numa node and memcg when adding and removing objects. The old list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and list_lru_del_obj(), respectively. It also extends the list_lru API with a new function, list_lru_putback, which undoes a previous list_lru_isolate call. Unlike list_lru_add, it does not increment the LRU node count (as list_lru_isolate does not decrement the node count). list_lru_putback also allows for explicit memcg and NUMA node selection. Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-12 10:57:01 -08:00
Carlos Llamas	7710e2cca3	binder: switch alloc->mutex to spinlock_t The alloc->mutex is a highly contended lock that causes performance issues on Android devices. When a low-priority task is given this lock and it sleeps, it becomes difficult for the task to wake up and complete its work. This delays other tasks that are also waiting on the mutex. The problem gets worse when there is memory pressure in the system, because this increases the contention on the alloc->mutex while the shrinker reclaims binder pages. Switching to a spinlock helps to keep the waiters running and avoids the overhead of waking up tasks. This significantly improves the transaction latency when the problematic scenario occurs. The performance impact of this patchset was measured by stress-testing the binder alloc contention. In this test, several clients of different priorities send thousands of transactions of different sizes to a single server. In parallel, pages get reclaimed using the shinker's debugfs. The test was run on a Pixel 8, Pixel 6 and qemu machine. The results were similar on all three devices: after: \| sched \| prio \| average \| max \| min \| \|--------+------+---------+-----------+---------\| \| fifo \| 99 \| 0.135ms \| 1.197ms \| 0.022ms \| \| fifo \| 01 \| 0.136ms \| 5.232ms \| 0.018ms \| \| other \| -20 \| 0.180ms \| 7.403ms \| 0.019ms \| \| other \| 19 \| 0.241ms \| 58.094ms \| 0.018ms \| before: \| sched \| prio \| average \| max \| min \| \|--------+------+---------+-----------+---------\| \| fifo \| 99 \| 0.350ms \| 248.730ms \| 0.020ms \| \| fifo \| 01 \| 0.357ms \| 248.817ms \| 0.024ms \| \| other \| -20 \| 0.399ms \| 249.906ms \| 0.020ms \| \| other \| 19 \| 0.477ms \| 297.756ms \| 0.022ms \| The key metrics above are the average and max latencies (wall time). These improvements should roughly translate to p95-p99 latencies on real workloads. The response time is up to 200x faster in these scenarios and there is no penalty in the regular path. Note that it is only possible to convert this lock after a series of changes made by previous patches. These mainly include refactoring the sections that might_sleep() and changing the locking order with the mmap_lock amongst others. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-29-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:41 +09:00
Carlos Llamas	e50f4e6cc9	binder: reverse locking order in shrinker callback The locking order currently requires the alloc->mutex to be acquired first followed by the mmap lock. However, the alloc->mutex is converted into a spinlock in subsequent commits so the order needs to be reversed to avoid nesting the sleeping mmap lock under the spinlock. The shrinker's callback binder_alloc_free_page() is the only place that needs to be reordered since other functions have been refactored and no longer nest these locks. Some minor cosmetic changes are also included in this patch. Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-28-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:41 +09:00
Carlos Llamas	162c797314	binder: avoid user addresses in debug logs Prefer logging vma offsets instead of addresses or simply drop the debug log altogether if not useful. Note this covers the instances affected by the switch to store addresses as unsigned long. However, there are other sections in the driver that could do the same. Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-27-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00
Carlos Llamas	f07b83a48e	binder: refactor binder_delete_free_buffer() Skip the freelist call immediately as needed, instead of continuing the pointless checks. Also, drop the debug logs that we don't really need. Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-26-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00
Carlos Llamas	8e905217c4	binder: collapse print_binder_buffer() into caller The code in print_binder_buffer() is quite small so it can be collapsed into its single caller binder_alloc_print_allocated(). No functional change in this patch. Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-25-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00
Carlos Llamas	67dcc88078	binder: document the final page calculation The code to determine the page range for binder_lru_freelist_del() is quite obscure. It leverages the buffer_size calculated before doing an oversized buffer split. This is used to figure out if the last page is being shared with another active buffer. If so, the page gets trimmed out of the range as it has been previously removed from the freelist. This would be equivalent to getting the start page of the next in-use buffer explicitly. However, the code for this is much larger as we can see in binder_free_buf_locked() routine. Instead, lets settle on documenting the tricky step and using better names for now. I believe an ideal solution would be to count the binder_page->users to determine when a page should be added or removed from the freelist. However, this is a much bigger change than what I'm willing to risk at this time. Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-24-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00
Carlos Llamas	ea9cdbf0c7	binder: rename lru shrinker utilities Now that the page allocation step is done separately we should rename the binder_free_page_range() and binder_allocate_page_range() functions to provide a more accurate description of what they do. Lets borrow the freelist concept used in other parts of the kernel for this. No functional change here. Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-23-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00
Carlos Llamas	de0e657312	binder: make oversized buffer code more readable The sections in binder_alloc_new_buf_locked() dealing with oversized buffers are scattered which makes them difficult to read. Instead, consolidate this code into a single block to improve readability. No functional change here. Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-22-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00
Carlos Llamas	258ce20ede	binder: remove redundant debug log The debug information in this statement is already logged earlier in the same function. We can get rid of this duplicate log. Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-21-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-05 09:23:40 +09:00

1 2 3 4 5

119 Commits