mirror of
https://github.com/ukui/kernel.git
synced 2026-03-09 10:07:04 -07:00
5feeca3c1e39c01f9ef5abc94dea94021ccf94fc
3149 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
78bbf153fa |
Merge tag 'regmap-fix-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown: "A fix for an issue with double locking that was introduced earlier this release. I'd missed in review that we were already in a locked region when trying to drop part of the cache" * tag 'regmap-fix-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: fix deadlock on _regmap_raw_write() error path |
||
|
|
f0aa1ce625 |
regmap: fix deadlock on _regmap_raw_write() error path
Commit
|
||
|
|
778935778c |
PM / runtime: Use _rcuidle for runtime suspend tracepoints
Further testing with false negatives suppressed by commit
|
||
|
|
ec9a03d47e |
Merge tag 'regmap-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"Several fixes here, the main one being the change from Lars-Peter
which I'd been letting soak in -next since the merge window in case it
uncovered further issues as it's a minimal fix rather than a change
addressing the root cause of the problems (which would've been too
invasive for -rc):
- The biggest change is a fix from Lars-Peter to ensure that we don't
create overlapping rbtree nodes which in turn avoids returning
corrupt cache values to users, fixing some issues that were exposed
by some recent optimisations with certain access patterns but had
been present for a long time.
- A fix from Elaine Zhang to stop us updating the cache if we get an
I/O error when writing to the hardware.
- A fix fromm Maarten ter Huurne to avoid uninitialized defaults in
cases where we have non-readable registers but are initializing the
cache by reading from the device"
* tag 'regmap-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: drop cache if the bus transfer error
regmap: rbtree: Avoid overlapping nodes
regmap: cache: Fix num_reg_defaults computation from reg_defaults_raw
|
||
|
|
787ad90332 | Merge remote-tracking branch 'regmap/fix/rbtree' into regmap-linus | ||
|
|
f735aa2790 | Merge remote-tracking branch 'regmap/fix/cache' into regmap-linus | ||
|
|
d7737ce964 |
PM / runtime: Add _rcuidle suffix to allow rpm_idle() use from idle
This commit appends a few _rcuidle suffixes to fix the following RCU-used-from-idle bug: > =============================== > [ INFO: suspicious RCU usage. ] > 4.6.0-rc5-next-20160426+ #1116 Not tainted > ------------------------------- > include/trace/events/rpm.h:95 suspicious rcu_dereference_check() usage! > > other info that might help us debug this: > > > RCU used illegally from idle CPU! > rcu_scheduler_active = 1, debug_locks = 0 > RCU used illegally from extended quiescent state! > 1 lock held by swapper/0/0: > #0: (&(&dev->power.lock)->rlock){-.-...}, at: [<c052cc2c>] __rpm_callback+0x58/0x60 > > stack backtrace: > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc5-next-20160426+ #1116 > Hardware name: Generic OMAP36xx (Flattened Device Tree) > [<c0110290>] (unwind_backtrace) from [<c010c3a8>] (show_stack+0x10/0x14) > [<c010c3a8>] (show_stack) from [<c047fd68>] (dump_stack+0xb0/0xe4) > [<c047fd68>] (dump_stack) from [<c052d5d0>] (rpm_suspend+0x580/0x768) > [<c052d5d0>] (rpm_suspend) from [<c052ec58>] (__pm_runtime_suspend+0x64/0x84) > [<c052ec58>] (__pm_runtime_suspend) from [<c04bf25c>] (omap2_gpio_prepare_for_idle+0x5c/0x70) > [<c04bf25c>] (omap2_gpio_prepare_for_idle) from [<c0125568>] (omap_sram_idle+0x140/0x244) > [<c0125568>] (omap_sram_idle) from [<c01269dc>] (omap3_enter_idle_bm+0xfc/0x1ec) > [<c01269dc>] (omap3_enter_idle_bm) from [<c0601bdc>] (cpuidle_enter_state+0x80/0x3d4) > [<c0601bdc>] (cpuidle_enter_state) from [<c0183b08>] (cpu_startup_entry+0x198/0x3a0) > [<c0183b08>] (cpu_startup_entry) from [<c0b00c0c>] (start_kernel+0x354/0x3c8) > [<c0b00c0c>] (start_kernel) from [<8000807c>] (0x8000807c) In the immortal words of Steven Rostedt, "*Whack* *Whack* *Whack*!!!" Reported-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Guenter Roeck <linux@roeck-us.net> WhACKED-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
d44c950e93 |
PM / runtime: Add _rcuidle suffix to allow rpm_resume() to be called from idle
This commit applies another _rcuidle suffix to fix an RCU use from idle. > =============================== > [ INFO: suspicious RCU usage. ] > 4.6.0-rc5-next-20160426+ #1122 Not tainted > ------------------------------- > include/trace/events/rpm.h:69 suspicious rcu_dereference_check() usage! > > other info that might help us debug this: > > > RCU used illegally from idle CPU! > rcu_scheduler_active = 1, debug_locks = 0 > RCU used illegally from extended quiescent state! > 1 lock held by swapper/0/0: > #0: (&(&dev->power.lock)->rlock){-.-...}, at: [<c052e3dc>] __pm_runtime_resume+0x3c/0x64 > > stack backtrace: > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc5-next-20160426+ #1122 > Hardware name: Generic OMAP36xx (Flattened Device Tree) > [<c0110290>] (unwind_backtrace) from [<c010c3a8>] (show_stack+0x10/0x14) > [<c010c3a8>] (show_stack) from [<c047fd68>] (dump_stack+0xb0/0xe4) > [<c047fd68>] (dump_stack) from [<c052e178>] (rpm_resume+0x5cc/0x7f4) > [<c052e178>] (rpm_resume) from [<c052e3ec>] (__pm_runtime_resume+0x4c/0x64) > [<c052e3ec>] (__pm_runtime_resume) from [<c04bf2c4>] (omap2_gpio_resume_after_idle+0x54/0x68) > [<c04bf2c4>] (omap2_gpio_resume_after_idle) from [<c01269dc>] (omap3_enter_idle_bm+0xfc/0x1ec) > [<c01269dc>] (omap3_enter_idle_bm) from [<c060198c>] (cpuidle_enter_state+0x80/0x3d4) > [<c060198c>] (cpuidle_enter_state) from [<c0183b08>] (cpu_startup_entry+0x198/0x3a0) > [<c0183b08>] (cpu_startup_entry) from [<c0b00c0c>] (start_kernel+0x354/0x3c8) > [<c0b00c0c>] (start_kernel) from [<8000807c>] (0x8000807c) Reported-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
815806e39b |
regmap: drop cache if the bus transfer error
regmap_write ->_regmap_raw_write -->regcache_write first and than use map->bus->write to wirte i2c or spi But if the i2c or spi transfer failed, But the cache is updated, So if I use regmap_read will get the cache data which is not the real register value. Signed-off-by: Elaine Zhang <zhangqing@rock-chips.com> Signed-off-by: Mark Brown <broonie@kernel.org> |
||
|
|
11d8ec408d |
Merge tag 'pm-extra-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"A few more fixes and cleanups in the x86-64 low-level hibernation
code, PM core, cpufreq (Kconfig and intel_pstate), and the operating
points framework.
Specifics:
- Prevent the low-level assembly hibernate code on x86-64 from
referring to __PAGE_OFFSET directly as a symbol which doesn't work
when the kernel identity mapping base is randomized, in which case
__PAGE_OFFSET is a variable (Rafael Wysocki).
- Avoid selecting CPU_FREQ_STAT by default as the statistics are not
required for proper cpufreq operation (Borislav Petkov).
- Add Skylake-X and Broadwell-X IDs to the intel_pstate's list of
processors where out-of-band (OBB) control of P-states is possible
and if that is in use, intel_pstate should not attempt to manage
P-states (Srinivas Pandruvada).
- Drop some unnecessary checks from the wakeup IRQ handling code in
the PM core (Markus Elfring).
- Reduce the number operating performance point (OPP) lookups in one
of the OPP framework's helper functions (Jisheng Zhang)"
* tag 'pm-extra-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
x86/power/64: Do not refer to __PAGE_OFFSET from assembly code
cpufreq: Do not default-yes CPU_FREQ_STAT
cpufreq: intel_pstate: Add more out-of-band IDs
PM / OPP: optimize dev_pm_opp_set_rate() performance a bit
PM-wakeup: Delete unnecessary checks before three function calls
|
||
|
|
6c84239d59 |
Merge tag 'rtc-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"RTC for 4.8
Cleanups:
- huge cleanup of rtc-generic and char/genrtc this allowed to cleanup
rtc-cmos, rtc-sh, rtc-m68k, rtc-powerpc and rtc-parisc
- move mn10300 to rtc-cmos
Subsystem:
- fix wakealarms after hibernate
- multiples fixes for rctest
- simplify implementations of .read_alarm
New drivers:
- Maxim MAX6916
Drivers:
- ds1307: fix weekday
- m41t80: add wakeup support
- pcf85063: add support for PCF85063A variant
- rv8803: extend i2c fix and other fixes
- s35390a: fix alarm reading, this fixes instant reboot after
shutdown for QNAP TS-41x
- s3c: clock fixes"
* tag 'rtc-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (65 commits)
rtc: rv8803: Clear V1F when setting the time
rtc: rv8803: Stop the clock while setting the time
rtc: rv8803: Always apply the I²C workaround
rtc: rv8803: Fix read day of week
rtc: rv8803: Remove the check for valid time
rtc: rv8803: Kconfig: Indicate rx8900 support
rtc: asm9260: remove .owner field for driver
rtc: at91sam9: Fix missing spin_lock_init()
rtc: m41t80: add suspend handlers for alarm IRQ
rtc: m41t80: make it a real error message
rtc: pcf85063: Add support for the PCF85063A device
rtc: pcf85063: fix year range
rtc: hym8563: in .read_alarm set .tm_sec to 0 to signal minute accuracy
rtc: explicitly set tm_sec = 0 for drivers with minute accurancy
rtc: s3c: Add s3c_rtc_{enable/disable}_clk in s3c_rtc_setfreq()
rtc: s3c: Remove unnecessary call to disable already disabled clock
rtc: abx80x: use devm_add_action_or_reset()
rtc: m41t80: use devm_add_action_or_reset()
rtc: fix a typo and reduce three empty lines to one
rtc: s35390a: improve two comments in .set_alarm
...
|
||
|
|
e2b3b80de5 |
Merge branches 'pm-sleep', 'pm-cpufreq', 'pm-core' and 'pm-opp'
* pm-sleep: x86/power/64: Do not refer to __PAGE_OFFSET from assembly code * pm-cpufreq: cpufreq: Do not default-yes CPU_FREQ_STAT cpufreq: intel_pstate: Add more out-of-band IDs * pm-core: PM-wakeup: Delete unnecessary checks before three function calls * pm-opp: PM / OPP: optimize dev_pm_opp_set_rate() performance a bit |
||
|
|
1bc8da4e14 |
regmap: rbtree: Avoid overlapping nodes
When searching for a suitable node that should be used for inserting a new register, which does not fall within the range of any existing node, we not only looks for nodes which are directly adjacent to the new register, but for nodes within a certain proximity. This is done to avoid creating lots of small nodes with just a few registers spacing in between, which would increase memory usage as well as tree traversal time. This means there might be multiple node candidates which fall within the proximity range of the new register. If we choose the first node we encounter, under certain register insertion patterns it is possible to end up with overlapping ranges. This will break order in the rbtree and can cause the cached register value to become corrupted. E.g. take the simplified example where the proximity range is 2 and the register insertion sequence is 1, 4, 2, 3, 5. * Insert of register 1 creates a new node, this is the root of the rbtree * Insert of register 4 creates a new node, which is inserted to the right of the root. * Insert of register 2 gets inserted to the first node * Insert of register 3 gets inserted to the first node * Insert of register 5 also gets inserted into the first node since this is the first node encountered and it is within the proximity range. Now there are two overlapping nodes. To avoid this always choose the node that is closest to the new register. This will ensure that nodes will not overlap. The tree traversal is still done as a binary search, we just don't stop at the first node found. So the complexity of the algorithm stays within the same order. Ideally if a new register is in the range of two adjacent blocks those blocks should be merged, but that is a much more invasive change and left for later. The issue was initially introduced in commit |
||
|
|
a098ecd2fa |
firmware: support loading into a pre-allocated buffer
Some systems are memory constrained but they need to load very large firmwares. The firmware subsystem allows drivers to request this firmware be loaded from the filesystem, but this requires that the entire firmware be loaded into kernel memory first before it's provided to the driver. This can lead to a situation where we map the firmware twice, once to load the firmware into kernel memory and once to copy the firmware into the final resting place. This creates needless memory pressure and delays loading because we have to copy from kernel memory to somewhere else. Let's add a request_firmware_into_buf() API that allows drivers to request firmware be loaded directly into a pre-allocated buffer. This skips the intermediate step of allocating a buffer in kernel memory to hold the firmware image while it's read from the filesystem. It also requires that drivers know how much memory they'll require before requesting the firmware and negates any benefits of firmware caching because the firmware layer doesn't manage the buffer lifetime. For a 16MB buffer, about half the time is spent performing a memcpy from the buffer to the final resting place. I see loading times go from 0.081171 seconds to 0.047696 seconds after applying this patch. Plus the vmalloc pressure is reduced. This is based on a patch from Vikram Mulukutla on codeaurora.org: https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=rel/msm-3.18&id=0a328c5f6cd999f5c591f172216835636f39bcb5 Link: http://lkml.kernel.org/r/20160607164741.31849-4-stephen.boyd@linaro.org Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Mark Brown <broonie@kernel.org> Cc: Ming Lei <ming.lei@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0e742e9275 |
firmware: provide infrastructure to make fw caching optional
Some low memory systems with complex peripherals cannot afford to have the relatively large firmware images taking up valuable memory during suspend and resume. Change the internal implementation of firmware_class to disallow caching based on a configurable option. In the near future, variants of request_firmware will take advantage of this feature. Link: http://lkml.kernel.org/r/20160607164741.31849-3-stephen.boyd@linaro.org [stephen.boyd@linaro.org: Drop firmware_desc design and use flags] Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org> Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Ming Lei <ming.lei@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9ccf981198 |
firmware: consolidate kmap/read/write logic
Some systems are memory constrained but they need to load very large firmwares. The firmware subsystem allows drivers to request this firmware be loaded from the filesystem, but this requires that the entire firmware be loaded into kernel memory first before it's provided to the driver. This can lead to a situation where we map the firmware twice, once to load the firmware into kernel memory and once to copy the firmware into the final resting place. This design creates needless memory pressure and delays loading because we have to copy from kernel memory to somewhere else. This patch sets adds support to the request firmware API to load the firmware directly into a pre-allocated buffer, skipping the intermediate copying step and alleviating memory pressure during firmware loading. The drawback is that we can't use the firmware caching feature because the memory for the firmware cache is not managed by the firmware layer. This patch (of 3): We use similar structured code to read and write the kmapped firmware pages. The only difference is read copies from the kmap region and write copies to it. Consolidate this into one function to reduce duplication. Link: http://lkml.kernel.org/r/20160607164741.31849-2-stephen.boyd@linaro.org Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Ming Lei <ming.lei@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bd721ea73e |
treewide: replace obsolete _refok by __ref
There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit
|
||
|
|
c9b95e5961 |
Merge tag 'sound-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"The majority of this update is about ASoC, including a few new
drivers, and the rest are mostly minor changes. The only substantial
change in ALSA core is about the additional error handling in the
compress-offload API. Below are highlights:
- Add the error propagating support in compress-offload API
- HD-audio: a usual Dell headset fixup, an Intel HDMI/DP fix, and the
default mixer setup change ot turn off the loopback
- Lots of updates for ASoC Intel drivers, mostly board support and
bug fixing, and to the NAU8825 driver
- Work on generalizing bits of simple-card to allow more code sharing
with the Renesas rsrc-card (which can't use simple-card due to DPCM)
- Removal of the Odroid X2 driver due to replacement with simple-card
- Support for several new Mediatek platforms and associated boards
- New ASoC drivers for Allwinner A10, Analog Devices ADAU7002,
Broadcom Cygnus, Cirrus Logic CS35L33 and CS53L30, Maxim MAX8960
and MAX98504, Realtek RT5514 and Wolfson WM8758"
* tag 'sound-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (278 commits)
sound: oss: Use kernel_read_file_from_path() for mod_firmware_load()
ASoC: Intel: Skylake: Delete an unnecessary check before the function call "release_firmware"
ASoC: Intel: Skylake: Fix NULL Pointer exception in dynamic_debug.
ASoC: samsung: Specify DMA channels through struct snd_dmaengine_pcm_config
ASoC: samsung: Fix error paths in the I2S driver's probe()
ASoC: cs53l30: Fix bit shift issue of TDM mode
ASoC: cs53l30: Fix a bug for TDM slot location validation
ASoC: rockchip: correct the spdif clk
ALSA: echoaudio: purge contradictions between dimension matrix members and total number of members
ASoC: rsrc-card: use asoc_simple_card_parse_card_name()
ASoC: rsrc-card: use asoc_simple_dai instead of rsrc_card_dai
ASoC: rsrc-card: use asoc_simple_card_parse_dailink_name()
ASoC: simple-card: use asoc_simple_card_parse_card_name()
ASoC: simple-card-utils: add asoc_simple_card_parse_card_name()
ASoC: simple-card: use asoc_simple_card_parse_dailink_name()
ASoC: simple-card-utils: add asoc_simple_card_set_dailink_name()
ASoC: nau8825: drop redundant idiom when converting integer to boolean
ASoC: nau8825: jack connection decision with different insertion logic
ASoC: mediatek: Add HDMI dai-links to the mt8173-rt5650 machine driver
ASoC: mediatek: mt2701: fix non static symbol warning
...
|
||
|
|
b2c7f5d9c9 |
regmap: cache: Fix num_reg_defaults computation from reg_defaults_raw
In
|
||
|
|
d30dd8be06 |
mm: track NR_KERNEL_STACK in KiB instead of number of stacks
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone. This only makes sense if each kernel stack exists entirely in one zone, and allowing vmapped stacks could break this assumption. Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all architectures. Keep it simple and use KiB. Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
11fb998986 |
mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis. [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4b9d0fab71 |
mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
NR_FILE_PAGES is the number of file pages. NR_FILE_MAPPED is the number of mapped file pages. NR_ANON_PAGES is the number of mapped anon pages. This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we have NR_FILE_PAGES is the number of file pages. NR_FILE_MAPPED is the number of mapped file pages. NR_ANON_MAPPED is the number of mapped anon pages. Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
50658e2e04 |
mm: move page mapped accounting to the node
Reclaim makes decisions based on the number of pages that are mapped but it's mixing node and zone information. Account NR_FILE_MAPPED and NR_ANON_PAGES pages on the node. Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
599d0c954f |
mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
75ef718405 |
mm, vmstat: add infrastructure for per-node vmstats
Patchset: "Move LRU page reclaim from zones to nodes v9"
This series moves LRUs from the zones to the node. While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;
1. The residency of a page partially depends on what zone the page was
allocated from. This is partially combatted by the fair zone allocation
policy but that is a partial solution that introduces overhead in the
page allocator paths.
2. Currently, reclaim on node 0 behaves slightly different to node 1. For
example, direct reclaim scans in zonelist order and reclaims even if
the zone is over the high watermark regardless of the age of pages
in that LRU. Kswapd on the other hand starts reclaim on the highest
unbalanced zone. A difference in distribution of file/anon pages due
to when they were allocated results can result in a difference in
again. While the fair zone allocation policy mitigates some of the
problems here, the page reclaim results on a multi-zone node will
always be different to a single-zone node.
it was scheduled on as a result.
3. kswapd and the page allocator scan zones in the opposite order to
avoid interfering with each other but it's sensitive to timing. This
mitigates the page allocator using pages that were allocated very recently
in the ideal case but it's sensitive to timing. When kswapd is allocating
from lower zones then it's great but during the rebalancing of the highest
zone, the page allocator and kswapd interfere with each other. It's worse
if the highest zone is small and difficult to balance.
4. slab shrinkers are node-based which makes it harder to identify the exact
relationship between slab reclaim and LRU reclaim.
The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.
Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.
The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.
pagealloc
---------
This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)
This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
User 189.19 191.80
System 2604.45 2533.56
Elapsed 2855.30 2786.39
The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;
4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v8
DMA32 allocs 28794729769 0
Normal allocs 48432501431 77227309877
Movable allocs 0 0
tiobench on ext4
----------------
tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 approx-v9
User 645.72 525.90
System 403.85 331.75
Elapsed 6795.36 6783.67
This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Minor Faults 645838 647465
Major Faults 573 640
Swap Ins 0 0
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 46041453 44190646
Normal allocs 78053072 79887245
Movable allocs 0 0
Allocation stalls 24 67
Stall zone DMA 0 0
Stall zone DMA32 0 0
Stall zone Normal 0 2
Stall zone HighMem 0 0
Stall zone Movable 0 65
Direct pages scanned 10969 30609
Kswapd pages scanned 93375144 93492094
Kswapd pages reclaimed 93372243 93489370
Direct pages reclaimed 10969 30609
Kswapd efficiency 99% 99%
Kswapd velocity 13741.015 13781.934
Direct efficiency 100% 100%
Direct velocity 1.614 4.512
Percentage direct scans 0% 0%
kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).
pgbench read-only large configuration on ext4
---------------------------------------------
pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe
pgbench Transactions
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v8
Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)
Negligible differences again. As with tiobench, overall reclaim activity
was comparable.
bonnie++ on ext4
----------------
No interesting performance difference, negligible differences on reclaim
stats.
paralleldd on ext4
------------------
This workload uses varying numbers of dd instances to read large amounts of
data from disk.
4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v9
Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
User 1548.01 1552.44
System 8609.71 8515.08
Elapsed 3587.10 3594.54
There is little or no change in performance but some drop in system CPU usage.
4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v9
Minor Faults 362662 367360
Major Faults 1204 1143
Swap Ins 22 0
Swap Outs 2855 1029
DMA allocs 0 0
DMA32 allocs 31409797 28837521
Normal allocs 46611853 49231282
Movable allocs 0 0
Direct pages scanned 0 0
Kswapd pages scanned 40845270 40869088
Kswapd pages reclaimed 40830976 40855294
Direct pages reclaimed 0 0
Kswapd efficiency 99% 99%
Kswapd velocity 11386.711 11369.769
Direct efficiency 100% 100%
Direct velocity 0.000 0.000
Percentage direct scans 0% 0%
Page writes by reclaim 2855 1029
Page writes file 0 0
Page writes anon 2855 1029
Page reclaim immediate 771 1628
Sector Reads 293312636 293536360
Sector Writes 18213568 18186480
Page rescued immediate 0 0
Slabs scanned 128257 132747
Direct inode steals 181 56
Kswapd inode steals 59 1131
It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.
stutter
-------
stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.
stutter
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v8
Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)
This shows a number of improvements with the worst-case outlier greatly
improved.
Some of the vmstats are interesting
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Swap Ins 163 502
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 618719206 1381662383
Normal allocs 891235743 564138421
Movable allocs 0 0
Allocation stalls 2603 1
Direct pages scanned 216787 2
Kswapd pages scanned 50719775 41778378
Kswapd pages reclaimed 41541765 41777639
Direct pages reclaimed 209159 0
Kswapd efficiency 81% 99%
Kswapd velocity 16859.554 14329.059
Direct efficiency 96% 0%
Direct velocity 72.061 0.001
Percentage direct scans 0% 0%
Page writes by reclaim 6215049 0
Page writes file 6215049 0
Page writes anon 0 0
Page reclaim immediate 70673 90
Sector Reads 81940800 81680456
Sector Writes 100158984 98816036
Page rescued immediate 0 0
Slabs scanned 1366954 22683
While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.
This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.
1. Reclaim/compaction is going to be affected because the amount of reclaim is
no longer targetted at a specific zone. Compaction works on a per-zone basis
so there is no guarantee that reclaiming a few THP's worth page pages will
have a positive impact on compaction success rates.
2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
are called is now different. This may or may not be a problem but if it
is, it'll be because shrinkers are not called enough and some balancing
is required.
3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
distributed between zones and the fair zone allocation policy used to do
something very similar for anon. The distribution is now different but not
necessarily in any way that matters but it's still worth bearing in mind.
VM statistic counters for reclaim decisions are zone-based. If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that. The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state. The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet. There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
patch with no functional change. There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.
Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|