mirror of
https://github.com/linux-apfs/linux-apfs.git
synced 2026-05-01 15:00:59 -07:00
060e74173f292fb3e0398b3dca8765568d195ff1
1090 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
060e74173f |
mm, page_alloc: inline zone_statistics
zone_statistics has one call-site but it's a public function. Make it
static and inline.
The performance difference on a page allocator microbenchmark is;
4.6.0-rc2 4.6.0-rc2
statbranch-v1r20 statinline-v1r20
Min alloc-odr0-1 419.00 ( 0.00%) 412.00 ( 1.67%)
Min alloc-odr0-2 305.00 ( 0.00%) 301.00 ( 1.31%)
Min alloc-odr0-4 250.00 ( 0.00%) 247.00 ( 1.20%)
Min alloc-odr0-8 219.00 ( 0.00%) 215.00 ( 1.83%)
Min alloc-odr0-16 203.00 ( 0.00%) 199.00 ( 1.97%)
Min alloc-odr0-32 195.00 ( 0.00%) 191.00 ( 2.05%)
Min alloc-odr0-64 191.00 ( 0.00%) 187.00 ( 2.09%)
Min alloc-odr0-128 189.00 ( 0.00%) 185.00 ( 2.12%)
Min alloc-odr0-256 198.00 ( 0.00%) 193.00 ( 2.53%)
Min alloc-odr0-512 210.00 ( 0.00%) 207.00 ( 1.43%)
Min alloc-odr0-1024 216.00 ( 0.00%) 213.00 ( 1.39%)
Min alloc-odr0-2048 221.00 ( 0.00%) 220.00 ( 0.45%)
Min alloc-odr0-4096 227.00 ( 0.00%) 226.00 ( 0.44%)
Min alloc-odr0-8192 232.00 ( 0.00%) 229.00 ( 1.29%)
Min alloc-odr0-16384 232.00 ( 0.00%) 229.00 ( 1.29%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
175145748d |
mm, page_alloc: use new PageAnonHead helper in the free page fast path
The PageAnon check always checks for compound_head but this is a
relatively expensive check if the caller already knows the page is a
head page. This patch creates a helper and uses it in the page free
path which only operates on head pages.
With this patch and "Only check PageCompound for high-order pages", the
performance difference on a page allocator microbenchmark is;
4.6.0-rc2 4.6.0-rc2
vanilla nocompound-v1r20
Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%)
Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%)
Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%)
Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%)
Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%)
Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%)
Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%)
Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%)
Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%)
Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%)
Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%)
Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%)
Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%)
Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%)
Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%)
Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%)
Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%)
Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%)
Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%)
Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%)
Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%)
Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%)
Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%)
Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%)
Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%)
Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%)
Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%)
There is a sizable boost to the free allocator performance. While there
is an apparent boost on the allocation side, it's likely a co-incidence
or due to the patches slightly reducing cache footprint.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
d61f859039 |
mm, page_alloc: only check PageCompound for high-order pages
Another year, another round of page allocator optimisations focusing
this time on the alloc and free fast paths. This should be of help to
workloads that are allocator-intensive from kernel space where the cost
of zeroing is not nceessraily incurred.
The series is motivated by the observation that page alloc
microbenchmarks on multiple machines regressed between 3.12.44 and 4.4.
Second, there is discussions before LSF/MM considering the possibility
of adding another page allocator which is potentially hazardous but a
patch series improving performance is better than whining.
After the series is applied, there are still hazards. In the free
paths, the debugging checking and page zone/pageblock lookups dominate
but there was not an obvious solution to that. In the alloc path, the
major contributers are dealing with zonelists, new page preperation, the
fair zone allocation and numerous statistic updates. The fair zone
allocator is removed by the per-node LRU series if that gets merged so
it's nor a major concern at the moment.
On normal userspace benchmarks, there is little impact as the zeroing
cost is significant but it's visible
aim9
4.6.0-rc3 4.6.0-rc3
vanilla deferalloc-v3
Min page_test 828693.33 ( 0.00%) 887060.00 ( 7.04%)
Min brk_test 4847266.67 ( 0.00%) 4966266.67 ( 2.45%)
Min exec_test 1271.00 ( 0.00%) 1275.67 ( 0.37%)
Min fork_test 12371.75 ( 0.00%) 12380.00 ( 0.07%)
The overall impact on a page allocator microbenchmark for a range of orders
and number of pages allocated in a batch is
4.6.0-rc3 4.6.0-rc3
vanilla deferalloc-v3r7
Min alloc-odr0-1 428.00 ( 0.00%) 316.00 ( 26.17%)
Min alloc-odr0-2 314.00 ( 0.00%) 231.00 ( 26.43%)
Min alloc-odr0-4 256.00 ( 0.00%) 192.00 ( 25.00%)
Min alloc-odr0-8 222.00 ( 0.00%) 166.00 ( 25.23%)
Min alloc-odr0-16 207.00 ( 0.00%) 154.00 ( 25.60%)
Min alloc-odr0-32 197.00 ( 0.00%) 148.00 ( 24.87%)
Min alloc-odr0-64 193.00 ( 0.00%) 144.00 ( 25.39%)
Min alloc-odr0-128 191.00 ( 0.00%) 143.00 ( 25.13%)
Min alloc-odr0-256 203.00 ( 0.00%) 153.00 ( 24.63%)
Min alloc-odr0-512 212.00 ( 0.00%) 165.00 ( 22.17%)
Min alloc-odr0-1024 221.00 ( 0.00%) 172.00 ( 22.17%)
Min alloc-odr0-2048 225.00 ( 0.00%) 179.00 ( 20.44%)
Min alloc-odr0-4096 232.00 ( 0.00%) 185.00 ( 20.26%)
Min alloc-odr0-8192 235.00 ( 0.00%) 187.00 ( 20.43%)
Min alloc-odr0-16384 236.00 ( 0.00%) 188.00 ( 20.34%)
Min alloc-odr1-1 519.00 ( 0.00%) 450.00 ( 13.29%)
Min alloc-odr1-2 391.00 ( 0.00%) 336.00 ( 14.07%)
Min alloc-odr1-4 313.00 ( 0.00%) 268.00 ( 14.38%)
Min alloc-odr1-8 277.00 ( 0.00%) 235.00 ( 15.16%)
Min alloc-odr1-16 256.00 ( 0.00%) 218.00 ( 14.84%)
Min alloc-odr1-32 252.00 ( 0.00%) 212.00 ( 15.87%)
Min alloc-odr1-64 244.00 ( 0.00%) 206.00 ( 15.57%)
Min alloc-odr1-128 244.00 ( 0.00%) 207.00 ( 15.16%)
Min alloc-odr1-256 243.00 ( 0.00%) 207.00 ( 14.81%)
Min alloc-odr1-512 245.00 ( 0.00%) 209.00 ( 14.69%)
Min alloc-odr1-1024 248.00 ( 0.00%) 214.00 ( 13.71%)
Min alloc-odr1-2048 253.00 ( 0.00%) 220.00 ( 13.04%)
Min alloc-odr1-4096 258.00 ( 0.00%) 224.00 ( 13.18%)
Min alloc-odr1-8192 261.00 ( 0.00%) 229.00 ( 12.26%)
Min alloc-odr2-1 560.00 ( 0.00%) 753.00 (-34.46%)
Min alloc-odr2-2 424.00 ( 0.00%) 351.00 ( 17.22%)
Min alloc-odr2-4 339.00 ( 0.00%) 393.00 (-15.93%)
Min alloc-odr2-8 298.00 ( 0.00%) 246.00 ( 17.45%)
Min alloc-odr2-16 276.00 ( 0.00%) 227.00 ( 17.75%)
Min alloc-odr2-32 271.00 ( 0.00%) 221.00 ( 18.45%)
Min alloc-odr2-64 264.00 ( 0.00%) 217.00 ( 17.80%)
Min alloc-odr2-128 264.00 ( 0.00%) 217.00 ( 17.80%)
Min alloc-odr2-256 264.00 ( 0.00%) 218.00 ( 17.42%)
Min alloc-odr2-512 269.00 ( 0.00%) 223.00 ( 17.10%)
Min alloc-odr2-1024 279.00 ( 0.00%) 230.00 ( 17.56%)
Min alloc-odr2-2048 283.00 ( 0.00%) 235.00 ( 16.96%)
Min alloc-odr2-4096 285.00 ( 0.00%) 239.00 ( 16.14%)
Min alloc-odr3-1 629.00 ( 0.00%) 505.00 ( 19.71%)
Min alloc-odr3-2 472.00 ( 0.00%) 374.00 ( 20.76%)
Min alloc-odr3-4 383.00 ( 0.00%) 301.00 ( 21.41%)
Min alloc-odr3-8 341.00 ( 0.00%) 266.00 ( 21.99%)
Min alloc-odr3-16 316.00 ( 0.00%) 248.00 ( 21.52%)
Min alloc-odr3-32 308.00 ( 0.00%) 241.00 ( 21.75%)
Min alloc-odr3-64 305.00 ( 0.00%) 241.00 ( 20.98%)
Min alloc-odr3-128 308.00 ( 0.00%) 244.00 ( 20.78%)
Min alloc-odr3-256 317.00 ( 0.00%) 249.00 ( 21.45%)
Min alloc-odr3-512 327.00 ( 0.00%) 256.00 ( 21.71%)
Min alloc-odr3-1024 331.00 ( 0.00%) 261.00 ( 21.15%)
Min alloc-odr3-2048 333.00 ( 0.00%) 266.00 ( 20.12%)
Min alloc-odr4-1 767.00 ( 0.00%) 572.00 ( 25.42%)
Min alloc-odr4-2 578.00 ( 0.00%) 429.00 ( 25.78%)
Min alloc-odr4-4 474.00 ( 0.00%) 346.00 ( 27.00%)
Min alloc-odr4-8 422.00 ( 0.00%) 310.00 ( 26.54%)
Min alloc-odr4-16 399.00 ( 0.00%) 295.00 ( 26.07%)
Min alloc-odr4-32 392.00 ( 0.00%) 293.00 ( 25.26%)
Min alloc-odr4-64 394.00 ( 0.00%) 293.00 ( 25.63%)
Min alloc-odr4-128 405.00 ( 0.00%) 305.00 ( 24.69%)
Min alloc-odr4-256 417.00 ( 0.00%) 319.00 ( 23.50%)
Min alloc-odr4-512 425.00 ( 0.00%) 326.00 ( 23.29%)
Min alloc-odr4-1024 426.00 ( 0.00%) 329.00 ( 22.77%)
Min free-odr0-1 216.00 ( 0.00%) 178.00 ( 17.59%)
Min free-odr0-2 152.00 ( 0.00%) 125.00 ( 17.76%)
Min free-odr0-4 120.00 ( 0.00%) 99.00 ( 17.50%)
Min free-odr0-8 106.00 ( 0.00%) 85.00 ( 19.81%)
Min free-odr0-16 97.00 ( 0.00%) 80.00 ( 17.53%)
Min free-odr0-32 92.00 ( 0.00%) 76.00 ( 17.39%)
Min free-odr0-64 89.00 ( 0.00%) 74.00 ( 16.85%)
Min free-odr0-128 89.00 ( 0.00%) 73.00 ( 17.98%)
Min free-odr0-256 107.00 ( 0.00%) 90.00 ( 15.89%)
Min free-odr0-512 117.00 ( 0.00%) 108.00 ( 7.69%)
Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
Min free-odr0-2048 132.00 ( 0.00%) 125.00 ( 5.30%)
Min free-odr0-4096 135.00 ( 0.00%) 130.00 ( 3.70%)
Min free-odr0-8192 137.00 ( 0.00%) 130.00 ( 5.11%)
Min free-odr0-16384 137.00 ( 0.00%) 131.00 ( 4.38%)
Min free-odr1-1 318.00 ( 0.00%) 289.00 ( 9.12%)
Min free-odr1-2 228.00 ( 0.00%) 207.00 ( 9.21%)
Min free-odr1-4 182.00 ( 0.00%) 165.00 ( 9.34%)
Min free-odr1-8 163.00 ( 0.00%) 146.00 ( 10.43%)
Min free-odr1-16 151.00 ( 0.00%) 135.00 ( 10.60%)
Min free-odr1-32 146.00 ( 0.00%) 129.00 ( 11.64%)
Min free-odr1-64 145.00 ( 0.00%) 130.00 ( 10.34%)
Min free-odr1-128 148.00 ( 0.00%) 134.00 ( 9.46%)
Min free-odr1-256 148.00 ( 0.00%) 137.00 ( 7.43%)
Min free-odr1-512 151.00 ( 0.00%) 140.00 ( 7.28%)
Min free-odr1-1024 154.00 ( 0.00%) 143.00 ( 7.14%)
Min free-odr1-2048 156.00 ( 0.00%) 144.00 ( 7.69%)
Min free-odr1-4096 156.00 ( 0.00%) 142.00 ( 8.97%)
Min free-odr1-8192 156.00 ( 0.00%) 140.00 ( 10.26%)
Min free-odr2-1 361.00 ( 0.00%) 457.00 (-26.59%)
Min free-odr2-2 258.00 ( 0.00%) 224.00 ( 13.18%)
Min free-odr2-4 208.00 ( 0.00%) 223.00 ( -7.21%)
Min free-odr2-8 185.00 ( 0.00%) 160.00 ( 13.51%)
Min free-odr2-16 173.00 ( 0.00%) 149.00 ( 13.87%)
Min free-odr2-32 166.00 ( 0.00%) 145.00 ( 12.65%)
Min free-odr2-64 166.00 ( 0.00%) 146.00 ( 12.05%)
Min free-odr2-128 169.00 ( 0.00%) 148.00 ( 12.43%)
Min free-odr2-256 170.00 ( 0.00%) 152.00 ( 10.59%)
Min free-odr2-512 177.00 ( 0.00%) 156.00 ( 11.86%)
Min free-odr2-1024 182.00 ( 0.00%) 162.00 ( 10.99%)
Min free-odr2-2048 181.00 ( 0.00%) 160.00 ( 11.60%)
Min free-odr2-4096 180.00 ( 0.00%) 159.00 ( 11.67%)
Min free-odr3-1 431.00 ( 0.00%) 367.00 ( 14.85%)
Min free-odr3-2 306.00 ( 0.00%) 259.00 ( 15.36%)
Min free-odr3-4 249.00 ( 0.00%) 208.00 ( 16.47%)
Min free-odr3-8 224.00 ( 0.00%) 186.00 ( 16.96%)
Min free-odr3-16 208.00 ( 0.00%) 176.00 ( 15.38%)
Min free-odr3-32 206.00 ( 0.00%) 174.00 ( 15.53%)
Min free-odr3-64 210.00 ( 0.00%) 178.00 ( 15.24%)
Min free-odr3-128 215.00 ( 0.00%) 182.00 ( 15.35%)
Min free-odr3-256 224.00 ( 0.00%) 189.00 ( 15.62%)
Min free-odr3-512 232.00 ( 0.00%) 195.00 ( 15.95%)
Min free-odr3-1024 230.00 ( 0.00%) 195.00 ( 15.22%)
Min free-odr3-2048 229.00 ( 0.00%) 193.00 ( 15.72%)
Min free-odr4-1 561.00 ( 0.00%) 439.00 ( 21.75%)
Min free-odr4-2 418.00 ( 0.00%) 318.00 ( 23.92%)
Min free-odr4-4 339.00 ( 0.00%) 269.00 ( 20.65%)
Min free-odr4-8 299.00 ( 0.00%) 239.00 ( 20.07%)
Min free-odr4-16 289.00 ( 0.00%) 234.00 ( 19.03%)
Min free-odr4-32 291.00 ( 0.00%) 235.00 ( 19.24%)
Min free-odr4-64 298.00 ( 0.00%) 238.00 ( 20.13%)
Min free-odr4-128 308.00 ( 0.00%) 251.00 ( 18.51%)
Min free-odr4-256 321.00 ( 0.00%) 267.00 ( 16.82%)
Min free-odr4-512 327.00 ( 0.00%) 269.00 ( 17.74%)
Min free-odr4-1024 326.00 ( 0.00%) 271.00 ( 16.87%)
Min total-odr0-1 644.00 ( 0.00%) 494.00 ( 23.29%)
Min total-odr0-2 466.00 ( 0.00%) 356.00 ( 23.61%)
Min total-odr0-4 376.00 ( 0.00%) 291.00 ( 22.61%)
Min total-odr0-8 328.00 ( 0.00%) 251.00 ( 23.48%)
Min total-odr0-16 304.00 ( 0.00%) 234.00 ( 23.03%)
Min total-odr0-32 289.00 ( 0.00%) 224.00 ( 22.49%)
Min total-odr0-64 282.00 ( 0.00%) 218.00 ( 22.70%)
Min total-odr0-128 280.00 ( 0.00%) 216.00 ( 22.86%)
Min total-odr0-256 310.00 ( 0.00%) 243.00 ( 21.61%)
Min total-odr0-512 329.00 ( 0.00%) 273.00 ( 17.02%)
Min total-odr0-1024 346.00 ( 0.00%) 290.00 ( 16.18%)
Min total-odr0-2048 357.00 ( 0.00%) 304.00 ( 14.85%)
Min total-odr0-4096 367.00 ( 0.00%) 315.00 ( 14.17%)
Min total-odr0-8192 372.00 ( 0.00%) 317.00 ( 14.78%)
Min total-odr0-16384 373.00 ( 0.00%) 319.00 ( 14.48%)
Min total-odr1-1 838.00 ( 0.00%) 739.00 ( 11.81%)
Min total-odr1-2 619.00 ( 0.00%) 543.00 ( 12.28%)
Min total-odr1-4 495.00 ( 0.00%) 433.00 ( 12.53%)
Min total-odr1-8 440.00 ( 0.00%) 382.00 ( 13.18%)
Min total-odr1-16 407.00 ( 0.00%) 353.00 ( 13.27%)
Min total-odr1-32 398.00 ( 0.00%) 341.00 ( 14.32%)
Min total-odr1-64 389.00 ( 0.00%) 336.00 ( 13.62%)
Min total-odr1-128 392.00 ( 0.00%) 341.00 ( 13.01%)
Min total-odr1-256 391.00 ( 0.00%) 344.00 ( 12.02%)
Min total-odr1-512 396.00 ( 0.00%) 349.00 ( 11.87%)
Min total-odr1-1024 402.00 ( 0.00%) 357.00 ( 11.19%)
Min total-odr1-2048 409.00 ( 0.00%) 364.00 ( 11.00%)
Min total-odr1-4096 414.00 ( 0.00%) 366.00 ( 11.59%)
Min total-odr1-8192 417.00 ( 0.00%) 369.00 ( 11.51%)
Min total-odr2-1 921.00 ( 0.00%) 1210.00 (-31.38%)
Min total-odr2-2 682.00 ( 0.00%) 576.00 ( 15.54%)
Min total-odr2-4 547.00 ( 0.00%) 616.00 (-12.61%)
Min total-odr2-8 483.00 ( 0.00%) 406.00 ( 15.94%)
Min total-odr2-16 449.00 ( 0.00%) 376.00 ( 16.26%)
Min total-odr2-32 437.00 ( 0.00%) 366.00 ( 16.25%)
Min total-odr2-64 431.00 ( 0.00%) 363.00 ( 15.78%)
Min total-odr2-128 433.00 ( 0.00%) 365.00 ( 15.70%)
Min total-odr2-256 434.00 ( 0.00%) 371.00 ( 14.52%)
Min total-odr2-512 446.00 ( 0.00%) 379.00 ( 15.02%)
Min total-odr2-1024 461.00 ( 0.00%) 392.00 ( 14.97%)
Min total-odr2-2048 464.00 ( 0.00%) 395.00 ( 14.87%)
Min total-odr2-4096 465.00 ( 0.00%) 398.00 ( 14.41%)
Min total-odr3-1 1060.00 ( 0.00%) 872.00 ( 17.74%)
Min total-odr3-2 778.00 ( 0.00%) 633.00 ( 18.64%)
Min total-odr3-4 632.00 ( 0.00%) 510.00 ( 19.30%)
Min total-odr3-8 565.00 ( 0.00%) 452.00 ( 20.00%)
Min total-odr3-16 524.00 ( 0.00%) 424.00 ( 19.08%)
Min total-odr3-32 514.00 ( 0.00%) 415.00 ( 19.26%)
Min total-odr3-64 515.00 ( 0.00%) 419.00 ( 18.64%)
Min total-odr3-128 523.00 ( 0.00%) 426.00 ( 18.55%)
Min total-odr3-256 541.00 ( 0.00%) 438.00 ( 19.04%)
Min total-odr3-512 559.00 ( 0.00%) 451.00 ( 19.32%)
Min total-odr3-1024 561.00 ( 0.00%) 456.00 ( 18.72%)
Min total-odr3-2048 562.00 ( 0.00%) 459.00 ( 18.33%)
Min total-odr4-1 1328.00 ( 0.00%) 1011.00 ( 23.87%)
Min total-odr4-2 997.00 ( 0.00%) 747.00 ( 25.08%)
Min total-odr4-4 813.00 ( 0.00%) 615.00 ( 24.35%)
Min total-odr4-8 721.00 ( 0.00%) 550.00 ( 23.72%)
Min total-odr4-16 689.00 ( 0.00%) 529.00 ( 23.22%)
Min total-odr4-32 683.00 ( 0.00%) 528.00 ( 22.69%)
Min total-odr4-64 692.00 ( 0.00%) 531.00 ( 23.27%)
Min total-odr4-128 713.00 ( 0.00%) 556.00 ( 22.02%)
Min total-odr4-256 738.00 ( 0.00%) 586.00 ( 20.60%)
Min total-odr4-512 753.00 ( 0.00%) 595.00 ( 20.98%)
Min total-odr4-1024 752.00 ( 0.00%) 600.00 ( 20.21%)
This patch (of 27):
order-0 pages by definition cannot be compound so avoid the check in the
fast path for those pages.
[akpm@linux-foundation.org: use unlikely(order) in free_pages_prepare(), per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
3da88fb3ba |
mm, oom: move GFP_NOFS check to out_of_memory
__alloc_pages_may_oom is the central place to decide when the
out_of_memory should be invoked. This is a good approach for most
checks there because they are page allocator specific and the allocation
fails right after for all of them.
The notable exception is GFP_NOFS context which is faking
did_some_progress and keep the page allocator looping even though there
couldn't have been any progress from the OOM killer. This patch doesn't
change this behavior because we are not ready to allow those allocation
requests to fail yet (and maybe we will face the reality that we will
never manage to safely fail these request). Instead __GFP_FS check is
moved down to out_of_memory and prevent from OOM victim selection there.
There are two reasons for that
- OOM notifiers might release some memory even from this context
as none of the registered notifier seems to be FS related
- this might help a dying thread to get an access to memory
reserves and move on which will make the behavior more
consistent with the case when the task gets killed from a
different context.
Keep a comment in __alloc_pages_may_oom to make sure we do not forget
how GFP_NOFS is special and that we really want to do something about
it.
Note to the current oom_notifier users:
The observable difference for you is that oom notifiers cannot depend on
any fs locks because we could deadlock. Not that this would be allowed
today because that would just lockup machine in most of the cases and
ruling out the OOM killer along the way. Another difference is that
callbacks might be invoked sooner now because GFP_NOFS is a weaker
reclaim context and so there could be reclaimable memory which is just
not reachable now. That would require GFP_NOFS only loads which are
really rare and more importantly the observable result would be dropping
of reconstructible object and potential performance drop which is not
such a big deal when we are struggling to fulfill other important
allocation requests.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
fc2bd799c7 |
mm/page_alloc: correct highmem memory statistics
ZONE_MOVABLE could be treated as highmem so we need to consider it for accurate statistics. And, in following patches, ZONE_CMA will be introduced and it can be treated as highmem, too. So, instead of manually adding stat of ZONE_MOVABLE, looping all zones and check whether the zone is highmem or not and add stat of the zone which can be treated as highmem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ba6b0979e3 |
power: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. mark_free_pages() iterates requested zone's pfn range and unset all range's bitmap first. And then it marks freepages in a zone to the bitmap. If there is an overlapping zone, above unset could clear previous marked bit and reference to this bitmap in the future will cause the problem. To prevent it, this patch adds a zone check in mark_free_pages(). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b9eb63191a |
mm/memory_hotplug: add comment to some functions related to memory hotplug
__offline_isolated_pages() and test_pages_isolated() are used by memory hotplug. These functions require that range is in a single zone but there is no code to do this because memory hotplug checks it before calling these functions. To avoid confusing future user of these functions, this patch adds comments to them. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
949698a31a |
mm/page_alloc: Remove useless parameter of __free_pages_boot_core
__free_pages_boot_core has parameter pfn which is not used at all. Remove it. Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com> Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0139aa7b7f |
mm: rename _count, field of the struct page, to _refcount
Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Manish Chopra <manish.chopra@qlogic.com> Cc: Yuval Mintz <yuval.mintz@qlogic.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bc22af74f2 |
mm: update min_free_kbytes from khugepaged after core initialization
Khugepaged attempts to raise min_free_kbytes if its set too low.
However, on boot khugepaged sets min_free_kbytes first from
subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
after from init_per_zone_wmark_min(), via a module_init() call.
Khugepaged used to use a late_initcall() to set min_free_kbytes (such
that it occurred after the core initialization), however this was
removed when the initialization of min_free_kbytes was integrated into
the starting of the khugepaged thread.
The fix here is simply to invoke the core initialization using a
core_initcall() instead of module_init(), such that the previous
initialization ordering is restored. I didn't restore the
late_initcall() since start_stop_khugepaged() already sets
min_free_kbytes via set_recommended_min_free_kbytes().
This was noticed when we had a number of page allocation failures when
moving a workload to a kernel with this new initialization ordering. On
an 8GB system this restores min_free_kbytes back to 67584 from 11365
when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
Fixes:
|
||
|
|
d9dddbf556 |
mm/page_alloc: prevent merging between isolated and other pageblocks
Hanjun Guo has reported that a CMA stress test causes broken accounting of CMA and free pages: > Before the test, I got: > -bash-4.3# cat /proc/meminfo | grep Cma > CmaTotal: 204800 kB > CmaFree: 195044 kB > > > After running the test: > -bash-4.3# cat /proc/meminfo | grep Cma > CmaTotal: 204800 kB > CmaFree: 6602584 kB > > So the freed CMA memory is more than total.. > > Also the the MemFree is more than mem total: > > -bash-4.3# cat /proc/meminfo > MemTotal: 16342016 kB > MemFree: 22367268 kB > MemAvailable: 22370528 kB Laura Abbott has confirmed the issue and suspected the freepage accounting rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA pageblocks: > CMA isolates MAX_ORDER aligned blocks, but, during the process, > partialy isolated block exists. If MAX_ORDER is 11 and > pageblock_order is 9, two pageblocks make up MAX_ORDER > aligned block and I can think following scenario because pageblock > (un)isolation would be done one by one. > > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA, > MIGRATE_ISOLATE, respectively. > > CC -> IC -> II (Isolation) > II -> CI -> CC (Un-isolation) > > If some pages are freed at this intermediate state such as IC or CI, > that page could be merged to the other page that is resident on > different type of pageblock and it will cause wrong freepage count. This was supposed to be prevented by CMA operating on MAX_ORDER blocks, but since it doesn't hold the zone->lock between pageblocks, a race window does exist. It's also likely that unexpected merging can occur between MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in __free_one_page() since commit |
||
|
|
0a687aace3 |
mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled
After the OOM killer is disabled during suspend operation, any !__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
987b3095c2 |
mm: meminit: initialise more memory for inode/dentry hash tables in early boot
Upstream has supported page parallel initialisation for X86 and the boot
time is improved greately. Some tests have been done for Power.
Here is the result I have done with different memory size.
* 4GB memory:
boot time is as the following:
with patch vs without patch: 10.4s vs 24.5s
boot time is improved 57%
* 200GB memory:
boot time looks the same with and without patches.
boot time is about 38s
* 32TB memory:
boot time looks the same with and without patches
boot time is about 160s.
The boot time is much shorter than X86 with 24TB memory.
From community discussion, it costs about 694s for X86 24T system.
Parallel initialisation improves the performance by deferring memory
initilisation to kswap with N kthreads, it should improve the performance
therotically.
In testing on X86, performance is improved greatly with huge memory. But
on Power platform, it is improved greatly with less than 100GB memory.
For huge memory, it is not improved greatly. But it saves the time with
several threads at least, as the following information shows(32TB system
log):
[ 22.648169] node 9 initialised, 16607461 pages in 280ms
[ 22.783772] node 3 initialised, 23937243 pages in 410ms
[ 22.858877] node 6 initialised, 29179347 pages in 490ms
[ 22.863252] node 2 initialised, 29179347 pages in 490ms
[ 22.907545] node 0 initialised, 32049614 pages in 540ms
[ 22.920891] node 15 initialised, 32212280 pages in 550ms
[ 22.923236] node 4 initialised, 32306127 pages in 550ms
[ 22.923384] node 12 initialised, 32314319 pages in 550ms
[ 22.924754] node 8 initialised, 32314319 pages in 550ms
[ 22.940780] node 13 initialised, 33353677 pages in 570ms
[ 22.940796] node 11 initialised, 33353677 pages in 570ms
[ 22.941700] node 5 initialised, 33353677 pages in 570ms
[ 22.941721] node 10 initialised, 33353677 pages in 570ms
[ 22.941876] node 7 initialised, 33353677 pages in 570ms
[ 22.944946] node 14 initialised, 33353677 pages in 570ms
[ 22.946063] node 1 initialised, 33345485 pages in 580ms
It saves the time about 550*16 ms at least, although it can be ignore to
compare the boot time about 160 seconds. What's more, the boot time is
much shorter on Power even without patches than x86 for huge memory
machine.
So this patchset is still necessary to be enabled for Power.
This patch (of 2):
This patch is based on Mel Gorman's old patch in the mailing list,
https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
a completion to wait for all memory initialised in page_alloc_init_late().
It is to fix the OOM problem on X86 with 24TB memory which allocates
memory in late initialisation. But for Power platform with 32TB memory,
it causes a call trace in vfs_caches_init->inode_init() and inode hash
table needs more memory. So this patch allocates 1GB for 0.25TB/node for
large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627
This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
Currently, it only allocates 2GB*16=32GB for early initialisation. But
Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
So the system have no enough memory for it. The log from dmesg as the
following:
Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
swapper/0: page allocation failure: order:0,mode:0x2080020
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
Call Trace:
.dump_stack+0xb4/0xb664 (unreliable)
.warn_alloc_failed+0x114/0x160
.__vmalloc_area_node+0x1a4/0x2b0
.__vmalloc_node_range+0xe4/0x110
.__vmalloc_node+0x40/0x50
.alloc_large_system_hash+0x134/0x2a4
.inode_init+0xa4/0xf0
.vfs_caches_init+0x80/0x144
.start_kernel+0x40c/0x4e0
start_here_common+0x20/0x4a4
Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
1170532bb4 |
mm: convert printk(KERN_<LEVEL> to pr_<level>
Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
756a025f00 |
mm: coalesce split strings
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0f352e5392 |
mm: remove __GFP_NOFAIL is deprecated comment
Commit
|
||
|
|
fe896d1878 |
mm: introduce page reference manipulation functions
The success of CMA allocation largely depends on the success of migration and key factor of it is page reference count. Until now, page reference is manipulated by direct calling atomic functions so we cannot follow up who and where manipulate it. Then, it is hard to find actual reason of CMA allocation failure. CMA allocation should be guaranteed to succeed so finding offending place is really important. In this patch, call sites where page reference is manipulated are converted to introduced wrapper function. This is preparation step to add tracepoint to each page reference manipulation function. With this facility, we can easily find reason of CMA allocation failure. There is no functional change in this patch. In addition, this patch also converts reference read sites. It will help a second step that renames page._count to something else and prevents later attempt to direct access to it (Suggested by Andrew). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
444eb2a449 |
mm: thp: set THP defrag by default to madvise and add a stall-free defrag option
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.
First, this patch defines THP defrag as follows;
madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.
Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.
Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.
This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.
After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.
The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this
usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)
For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41
Which is substantial. Now, the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0
This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.
I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used
thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)
The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1
Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.
I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available
stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)
This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98
And once again we look at the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0
Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.
THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
795ae7a0de |
mm: scale kswapd watermarks in proportion to memory
In machines with 140G of memory and enterprise flash storage, we have seen read and write bursts routinely exceed the kswapd watermarks and cause thundering herds in direct reclaim. Unfortunately, the only way to tune kswapd aggressiveness is through adjusting min_free_kbytes - the system's emergency reserves - which is entirely unrelated to the system's latency requirements. In order to get kswapd to maintain a 250M buffer of free memory, the emergency reserves need to be set to 1G. That is a lot of memory wasted for no good reason. On the other hand, it's reasonable to assume that allocation bursts and overall allocation concurrency scale with memory capacity, so it makes sense to make kswapd aggressiveness a function of that as well. Change the kswapd watermark scale factor from the currently fixed 25% of the tunable emergency reserve to a tunable 0.1% of memory. Beyond 1G of memory, this will produce bigger watermark steps than the current formula in default settings. Ensure that the new formula never chooses steps smaller than that, i.e. 25% of the emergency reserve. On a 140G machine, this raises the default watermark steps - the distance between min and low, and low and high - from 16M to 143M. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d02bd27bd3 |
mm/page_alloc.c: calculate 'available' memory in a separate function
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. This patch (of 2): Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: Igor Redko <redkoi@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
698b1b3064 |
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts: - kswapd balancing a zone after a high-order allocation failure - direct compaction to satisfy a high-order allocation, including THP page fault attemps - khugepaged trying to collapse a hugepage - manually from /proc The purpose of compaction is two-fold. The obvious purpose is to satisfy a (pending or future) high-order allocation, and is easy to evaluate. The other purpose is to keep overal memory fragmentation low and help the anti-fragmentation mechanism. The success wrt the latter purpose is more The current situation wrt the purposes has a few drawbacks: - compaction is invoked only when a high-order page or hugepage is not available (or manually). This might be too late for the purposes of keeping memory fragmentation low. - direct compaction increases latency of allocations. Again, it would be better if compaction was performed asynchronously to keep fragmentation low, before the allocation itself comes. - (a special case of the previous) the cost of compaction during THP page faults can easily offset the benefits of THP. - kswapd compaction appears to be complex, fragile and not working in some scenarios. It could also end up compacting for a high-order allocation request when it should be reclaiming memory for a later order-0 request. To improve the situation, we should be able to benefit from an equivalent of kswapd, but for compaction - i.e. a background thread which responds to fragmentation and the need for high-order allocations (including hugepages) somewhat proactively. One possibility is to extend the responsibilities of kswapd, which could however complicate its design too much. It should be better to let kswapd handle reclaim, as order-0 allocations are often more critical than high-order ones. Another possibility is to extend khugepaged, but this kthread is a single instance and tied to THP configs. This patch goes with the option of a new set of per-node kthreads called kcompactd, and lays the foundations, without introducing any new tunables. The lifecycle mimics kswapd kthreads, including the memory hotplug hooks. For compaction, kcompactd uses the standard compaction_suitable() and ompact_finished() criteria and the deferred compaction functionality. Unlike direct compaction, it uses only sync compaction, as there's no allocation latency to minimize. This patch doesn't yet add a call to wakeup_kcompactd. The kswapd compact/reclaim loop for high-order pages will be replaced by waking up kcompactd in the next patch with the description of what's wrong with the old approach. Waking up of the kcompactd threads is also tied to kswapd activity and follows these rules: - we don't want to affect any fastpaths, so wake up kcompactd only from the slowpath, as it's done for kswapd - if kswapd is doing reclaim, it's more important than compaction, so don't invoke kcompactd until kswapd goes to sleep - the target order used for kswapd is passed to kcompactd Future possible future uses for kcompactd include the ability to wake up kcompactd on demand in special situations, such as when hugepages are not available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform periodic compaction with kcompactd. [arnd@arndb.de: fix build errors with kcompactd] [paul.gortmaker@windriver.com: don't use modular references for non modular code] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
505f6d22db |
sound: query dynamic DEBUG_PAGEALLOC setting
We can disable debug_pagealloc processing even if the code is compiled with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query whether it is enabled or not in runtime. [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Takashi Iwai <tiwai@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
832fc1de01 |
/proc/kpageflags: return KPF_BUDDY for "tail" buddy pages
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed. This
patch sets KPF_BUDDY for such pages.
With this patch:
$ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
MemFree: 3134992 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 779272 3044 __________B_______________________________ buddy
0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
total 783657 3061
783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.
[akpm@linux-foundation.org: update comment, per Naoya]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
7cf91a98e6 |
mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
There is a performance drop report due to hugepage allocation and in there half of cpu time are spent on pageblock_pfn_to_page() in compaction [1]. In that workload, compaction is triggered to make hugepage but most of pageblocks are un-available for compaction due to pageblock type and skip bit so compaction usually fails. Most costly operations in this case is to find valid pageblock while scanning whole zone range. To check if pageblock is valid to compact, valid pfn within pageblock is required and we can obtain it by calling pageblock_pfn_to_page(). This function checks whether pageblock is in a single zone and return valid pfn if possible. Problem is that we need to check it every time before scanning pageblock even if we re-visit it and this turns out to be very expensive in this workload. Although we have no way to skip this pageblock check in the system where hole exists at arbitrary position, we can use cached value for zone continuity and just do pfn_to_page() in the system where hole doesn't exist. This optimization considerably speeds up in above workload. Before vs After Max: 1096 MB/s vs 1325 MB/s Min: 635 MB/s 1015 MB/s Avg: 899 MB/s 1194 MB/s Avg is improved by roughly 30% [2]. [1]: http://www.spinics.net/lists/linux-mm/msg97378.html [2]: https://lkml.org/lkml/2015/12/9/23 [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
1414c7f4f7 |
mm/page_poisoning.c: allow for zero poisoning
By default, page poisoning uses a poison value (0xaa) on free. If this is changed to 0, the page is not only sanitized but zeroing on alloc with __GFP_ZERO can be skipped as well. The tradeoff is that detecting corruption from the poisoning is harder to detect. This feature also cannot be used with hibernation since pages are not guaranteed to be zeroed after hibernation. Credit to Grsecurity/PaX team for inspiring this work Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |