Commit Graph

3149 Commits

Author SHA1 Message Date
Linus Torvalds
78bbf153fa Merge tag 'regmap-fix-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
 "A fix for an issue with double locking that was introduced earlier
  this release.  I'd missed in review that we were already in a locked
  region when trying to drop part of the cache"

* tag 'regmap-fix-v4.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: fix deadlock on _regmap_raw_write() error path
2016-09-23 11:50:49 -07:00
Nikita Yushchenko
f0aa1ce625 regmap: fix deadlock on _regmap_raw_write() error path
Commit 815806e39b ("regmap: drop cache if the bus transfer error")
added a call to regcache_drop_region() to error path in
_regmap_raw_write(). However that path runs with regmap lock taken,
and regcache_drop_region() tries to re-take it, causing a deadlock.
Fix that by calling map->cache_ops->drop() directly.

Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
2016-09-22 11:24:22 +01:00
Paul E. McKenney
778935778c PM / runtime: Use _rcuidle for runtime suspend tracepoints
Further testing with false negatives suppressed by commit 293e2421fe
("rcu: Remove superfluous versions of rcu_read_lock_sched_held()")
identified a few more unprotected uses of RCU from the idle loop.
Because RCU actively ignores idle-loop code (for energy-efficiency
reasons, among other things), using RCU from the idle loop can result
in too-short grace periods, in turn resulting in arbitrary misbehavior.

The affected function is rpm_suspend().

The resulting lockdep-RCU splat is as follows:

------------------------------------------------------------------------

Warning from omap3

===============================
[ INFO: suspicious RCU usage. ]
4.6.0-rc5-next-20160426+ #1112 Not tainted
-------------------------------
include/trace/events/rpm.h:63 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
1 lock held by swapper/0/0:
 #0:  (&(&dev->power.lock)->rlock){-.-...}, at: [<c052ee24>] __pm_runtime_suspend+0x54/0x84

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc5-next-20160426+ #1112
Hardware name: Generic OMAP36xx (Flattened Device Tree)
[<c0110308>] (unwind_backtrace) from [<c010c3a8>] (show_stack+0x10/0x14)
[<c010c3a8>] (show_stack) from [<c047fec8>] (dump_stack+0xb0/0xe4)
[<c047fec8>] (dump_stack) from [<c052d7b4>] (rpm_suspend+0x604/0x7e4)
[<c052d7b4>] (rpm_suspend) from [<c052ee34>] (__pm_runtime_suspend+0x64/0x84)
[<c052ee34>] (__pm_runtime_suspend) from [<c04bf3bc>] (omap2_gpio_prepare_for_idle+0x5c/0x70)
[<c04bf3bc>] (omap2_gpio_prepare_for_idle) from [<c01255e8>] (omap_sram_idle+0x140/0x244)
[<c01255e8>] (omap_sram_idle) from [<c0126b48>] (omap3_enter_idle_bm+0xfc/0x1ec)
[<c0126b48>] (omap3_enter_idle_bm) from [<c0601db8>] (cpuidle_enter_state+0x80/0x3d4)
[<c0601db8>] (cpuidle_enter_state) from [<c0183c74>] (cpu_startup_entry+0x198/0x3a0)
[<c0183c74>] (cpu_startup_entry) from [<c0b00c0c>] (start_kernel+0x354/0x3c8)
[<c0b00c0c>] (start_kernel) from [<8000807c>] (0x8000807c)

------------------------------------------------------------------------

Reported-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-09-16 02:59:58 +02:00
Linus Torvalds
ec9a03d47e Merge tag 'regmap-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
 "Several fixes here, the main one being the change from Lars-Peter
  which I'd been letting soak in -next since the merge window in case it
  uncovered further issues as it's a minimal fix rather than a change
  addressing the root cause of the problems (which would've been too
  invasive for -rc):

   - The biggest change is a fix from Lars-Peter to ensure that we don't
     create overlapping rbtree nodes which in turn avoids returning
     corrupt cache values to users, fixing some issues that were exposed
     by some recent optimisations with certain access patterns but had
     been present for a long time.

   - A fix from Elaine Zhang to stop us updating the cache if we get an
     I/O error when writing to the hardware.

   - A fix fromm Maarten ter Huurne to avoid uninitialized defaults in
     cases where we have non-readable registers but are initializing the
     cache by reading from the device"

* tag 'regmap-fix-v4.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: drop cache if the bus transfer error
  regmap: rbtree: Avoid overlapping nodes
  regmap: cache: Fix num_reg_defaults computation from reg_defaults_raw
2016-09-06 11:02:36 -07:00
Mark Brown
787ad90332 Merge remote-tracking branch 'regmap/fix/rbtree' into regmap-linus 2016-09-03 12:10:09 +01:00
Mark Brown
f735aa2790 Merge remote-tracking branch 'regmap/fix/cache' into regmap-linus 2016-09-03 12:10:08 +01:00
Paul E. McKenney
d7737ce964 PM / runtime: Add _rcuidle suffix to allow rpm_idle() use from idle
This commit appends a few _rcuidle suffixes to fix the following
RCU-used-from-idle bug:

> ===============================
> [ INFO: suspicious RCU usage. ]
> 4.6.0-rc5-next-20160426+ #1116 Not tainted
> -------------------------------
> include/trace/events/rpm.h:95 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> RCU used illegally from idle CPU!
> rcu_scheduler_active = 1, debug_locks = 0
> RCU used illegally from extended quiescent state!
> 1 lock held by swapper/0/0:
>  #0:  (&(&dev->power.lock)->rlock){-.-...}, at: [<c052cc2c>] __rpm_callback+0x58/0x60
>
> stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc5-next-20160426+ #1116
> Hardware name: Generic OMAP36xx (Flattened Device Tree)
> [<c0110290>] (unwind_backtrace) from [<c010c3a8>] (show_stack+0x10/0x14)
> [<c010c3a8>] (show_stack) from [<c047fd68>] (dump_stack+0xb0/0xe4)
> [<c047fd68>] (dump_stack) from [<c052d5d0>] (rpm_suspend+0x580/0x768)
> [<c052d5d0>] (rpm_suspend) from [<c052ec58>] (__pm_runtime_suspend+0x64/0x84)
> [<c052ec58>] (__pm_runtime_suspend) from [<c04bf25c>] (omap2_gpio_prepare_for_idle+0x5c/0x70)
> [<c04bf25c>] (omap2_gpio_prepare_for_idle) from [<c0125568>] (omap_sram_idle+0x140/0x244)
> [<c0125568>] (omap_sram_idle) from [<c01269dc>] (omap3_enter_idle_bm+0xfc/0x1ec)
> [<c01269dc>] (omap3_enter_idle_bm) from [<c0601bdc>] (cpuidle_enter_state+0x80/0x3d4)
> [<c0601bdc>] (cpuidle_enter_state) from [<c0183b08>] (cpu_startup_entry+0x198/0x3a0)
> [<c0183b08>] (cpu_startup_entry) from [<c0b00c0c>] (start_kernel+0x354/0x3c8)
> [<c0b00c0c>] (start_kernel) from [<8000807c>] (0x8000807c)

In the immortal words of Steven Rostedt, "*Whack* *Whack* *Whack*!!!"

Reported-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
WhACKED-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-08-31 03:00:59 +02:00
Paul E. McKenney
d44c950e93 PM / runtime: Add _rcuidle suffix to allow rpm_resume() to be called from idle
This commit applies another _rcuidle suffix to fix an RCU use from
idle.

> ===============================
> [ INFO: suspicious RCU usage. ]
> 4.6.0-rc5-next-20160426+ #1122 Not tainted
> -------------------------------
> include/trace/events/rpm.h:69 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> RCU used illegally from idle CPU!
> rcu_scheduler_active = 1, debug_locks = 0
> RCU used illegally from extended quiescent state!
> 1 lock held by swapper/0/0:
>  #0:  (&(&dev->power.lock)->rlock){-.-...}, at: [<c052e3dc>] __pm_runtime_resume+0x3c/0x64
>
> stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0-rc5-next-20160426+ #1122
> Hardware name: Generic OMAP36xx (Flattened Device Tree)
> [<c0110290>] (unwind_backtrace) from [<c010c3a8>] (show_stack+0x10/0x14)
> [<c010c3a8>] (show_stack) from [<c047fd68>] (dump_stack+0xb0/0xe4)
> [<c047fd68>] (dump_stack) from [<c052e178>] (rpm_resume+0x5cc/0x7f4)
> [<c052e178>] (rpm_resume) from [<c052e3ec>] (__pm_runtime_resume+0x4c/0x64)
> [<c052e3ec>] (__pm_runtime_resume) from [<c04bf2c4>] (omap2_gpio_resume_after_idle+0x54/0x68)
> [<c04bf2c4>] (omap2_gpio_resume_after_idle) from [<c01269dc>] (omap3_enter_idle_bm+0xfc/0x1ec)
> [<c01269dc>] (omap3_enter_idle_bm) from [<c060198c>] (cpuidle_enter_state+0x80/0x3d4)
> [<c060198c>] (cpuidle_enter_state) from [<c0183b08>] (cpu_startup_entry+0x198/0x3a0)
> [<c0183b08>] (cpu_startup_entry) from [<c0b00c0c>] (start_kernel+0x354/0x3c8)
> [<c0b00c0c>] (start_kernel) from [<8000807c>] (0x8000807c)

Reported-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-08-31 02:59:20 +02:00
Elaine Zhang
815806e39b regmap: drop cache if the bus transfer error
regmap_write
->_regmap_raw_write
-->regcache_write first and than use map->bus->write to wirte i2c or spi
But if the i2c or spi transfer failed, But the cache is updated, So if I use
regmap_read will get the cache data which is not the real register value.

Signed-off-by: Elaine Zhang <zhangqing@rock-chips.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
2016-08-18 11:09:12 +01:00
Linus Torvalds
11d8ec408d Merge tag 'pm-extra-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
 "A few more fixes and cleanups in the x86-64 low-level hibernation
  code, PM core, cpufreq (Kconfig and intel_pstate), and the operating
  points framework.

  Specifics:

   - Prevent the low-level assembly hibernate code on x86-64 from
     referring to __PAGE_OFFSET directly as a symbol which doesn't work
     when the kernel identity mapping base is randomized, in which case
     __PAGE_OFFSET is a variable (Rafael Wysocki).

   - Avoid selecting CPU_FREQ_STAT by default as the statistics are not
     required for proper cpufreq operation (Borislav Petkov).

   - Add Skylake-X and Broadwell-X IDs to the intel_pstate's list of
     processors where out-of-band (OBB) control of P-states is possible
     and if that is in use, intel_pstate should not attempt to manage
     P-states (Srinivas Pandruvada).

   - Drop some unnecessary checks from the wakeup IRQ handling code in
     the PM core (Markus Elfring).

   - Reduce the number operating performance point (OPP) lookups in one
     of the OPP framework's helper functions (Jisheng Zhang)"

* tag 'pm-extra-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/power/64: Do not refer to __PAGE_OFFSET from assembly code
  cpufreq: Do not default-yes CPU_FREQ_STAT
  cpufreq: intel_pstate: Add more out-of-band IDs
  PM / OPP: optimize dev_pm_opp_set_rate() performance a bit
  PM-wakeup: Delete unnecessary checks before three function calls
2016-08-05 23:26:16 -04:00
Linus Torvalds
6c84239d59 Merge tag 'rtc-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
 "RTC for 4.8

  Cleanups:
   - huge cleanup of rtc-generic and char/genrtc this allowed to cleanup
     rtc-cmos, rtc-sh, rtc-m68k, rtc-powerpc and rtc-parisc
   - move mn10300 to rtc-cmos

  Subsystem:
   - fix wakealarms after hibernate
   - multiples fixes for rctest
   - simplify implementations of .read_alarm

  New drivers:
   - Maxim MAX6916

  Drivers:
   - ds1307: fix weekday
   - m41t80: add wakeup support
   - pcf85063: add support for PCF85063A variant
   - rv8803: extend i2c fix and other fixes
   - s35390a: fix alarm reading, this fixes instant reboot after
     shutdown for QNAP TS-41x
   - s3c: clock fixes"

* tag 'rtc-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (65 commits)
  rtc: rv8803: Clear V1F when setting the time
  rtc: rv8803: Stop the clock while setting the time
  rtc: rv8803: Always apply the I²C workaround
  rtc: rv8803: Fix read day of week
  rtc: rv8803: Remove the check for valid time
  rtc: rv8803: Kconfig: Indicate rx8900 support
  rtc: asm9260: remove .owner field for driver
  rtc: at91sam9: Fix missing spin_lock_init()
  rtc: m41t80: add suspend handlers for alarm IRQ
  rtc: m41t80: make it a real error message
  rtc: pcf85063: Add support for the PCF85063A device
  rtc: pcf85063: fix year range
  rtc: hym8563: in .read_alarm set .tm_sec to 0 to signal minute accuracy
  rtc: explicitly set tm_sec = 0 for drivers with minute accurancy
  rtc: s3c: Add s3c_rtc_{enable/disable}_clk in s3c_rtc_setfreq()
  rtc: s3c: Remove unnecessary call to disable already disabled clock
  rtc: abx80x: use devm_add_action_or_reset()
  rtc: m41t80: use devm_add_action_or_reset()
  rtc: fix a typo and reduce three empty lines to one
  rtc: s35390a: improve two comments in .set_alarm
  ...
2016-08-05 09:48:22 -04:00
Rafael J. Wysocki
e2b3b80de5 Merge branches 'pm-sleep', 'pm-cpufreq', 'pm-core' and 'pm-opp'
* pm-sleep:
  x86/power/64: Do not refer to __PAGE_OFFSET from assembly code

* pm-cpufreq:
  cpufreq: Do not default-yes CPU_FREQ_STAT
  cpufreq: intel_pstate: Add more out-of-band IDs

* pm-core:
  PM-wakeup: Delete unnecessary checks before three function calls

* pm-opp:
  PM / OPP: optimize dev_pm_opp_set_rate() performance a bit
2016-08-05 15:46:55 +02:00
Lars-Peter Clausen
1bc8da4e14 regmap: rbtree: Avoid overlapping nodes
When searching for a suitable node that should be used for inserting a new
register, which does not fall within the range of any existing node, we not
only looks for nodes which are directly adjacent to the new register, but
for nodes within a certain proximity. This is done to avoid creating lots
of small nodes with just a few registers spacing in between, which would
increase memory usage as well as tree traversal time.

This means there might be multiple node candidates which fall within the
proximity range of the new register. If we choose the first node we
encounter, under certain register insertion patterns it is possible to end
up with overlapping ranges. This will break order in the rbtree and can
cause the cached register value to become corrupted.

E.g. take the simplified example where the proximity range is 2 and the
register insertion sequence is 1, 4, 2, 3, 5.
 * Insert of register 1 creates a new node, this is the root of the rbtree
 * Insert of register 4 creates a new node, which is inserted to the right
   of the root.
 * Insert of register 2 gets inserted to the first node
 * Insert of register 3 gets inserted to the first node
 * Insert of register 5 also gets inserted into the first node since
   this is the first node encountered and it is within the proximity range.
   Now there are two overlapping nodes.

To avoid this always choose the node that is closest to the new register.
This will ensure that nodes will not overlap. The tree traversal is still
done as a binary search, we just don't stop at the first node found. So the
complexity of the algorithm stays within the same order.

Ideally if a new register is in the range of two adjacent blocks those
blocks should be merged, but that is a much more invasive change and left
for later.

The issue was initially introduced in commit 472fdec738 ("regmap: rbtree:
Reduce number of nodes, take 2"), but became much more exposed by commit
6399aea629 ("regmap: rbtree: When adding a reg do a bsearch for target
node") which changed the order in which nodes are looked-up.

Fixes: 6399aea629 ("regmap: rbtree: When adding a reg do a bsearch for target node")
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Mark Brown <broonie@kernel.org>
2016-08-04 16:55:26 +01:00
Stephen Boyd
a098ecd2fa firmware: support loading into a pre-allocated buffer
Some systems are memory constrained but they need to load very large
firmwares.  The firmware subsystem allows drivers to request this
firmware be loaded from the filesystem, but this requires that the
entire firmware be loaded into kernel memory first before it's provided
to the driver.  This can lead to a situation where we map the firmware
twice, once to load the firmware into kernel memory and once to copy the
firmware into the final resting place.

This creates needless memory pressure and delays loading because we have
to copy from kernel memory to somewhere else.  Let's add a
request_firmware_into_buf() API that allows drivers to request firmware
be loaded directly into a pre-allocated buffer.  This skips the
intermediate step of allocating a buffer in kernel memory to hold the
firmware image while it's read from the filesystem.  It also requires
that drivers know how much memory they'll require before requesting the
firmware and negates any benefits of firmware caching because the
firmware layer doesn't manage the buffer lifetime.

For a 16MB buffer, about half the time is spent performing a memcpy from
the buffer to the final resting place.  I see loading times go from
0.081171 seconds to 0.047696 seconds after applying this patch.  Plus
the vmalloc pressure is reduced.

This is based on a patch from Vikram Mulukutla on codeaurora.org:
  https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=rel/msm-3.18&id=0a328c5f6cd999f5c591f172216835636f39bcb5

Link: http://lkml.kernel.org/r/20160607164741.31849-4-stephen.boyd@linaro.org
Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:10 -04:00
Vikram Mulukutla
0e742e9275 firmware: provide infrastructure to make fw caching optional
Some low memory systems with complex peripherals cannot afford to have
the relatively large firmware images taking up valuable memory during
suspend and resume.  Change the internal implementation of
firmware_class to disallow caching based on a configurable option.  In
the near future, variants of request_firmware will take advantage of
this feature.

Link: http://lkml.kernel.org/r/20160607164741.31849-3-stephen.boyd@linaro.org
[stephen.boyd@linaro.org: Drop firmware_desc design and use flags]
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:09 -04:00
Stephen Boyd
9ccf981198 firmware: consolidate kmap/read/write logic
Some systems are memory constrained but they need to load very large
firmwares.  The firmware subsystem allows drivers to request this
firmware be loaded from the filesystem, but this requires that the
entire firmware be loaded into kernel memory first before it's provided
to the driver.  This can lead to a situation where we map the firmware
twice, once to load the firmware into kernel memory and once to copy the
firmware into the final resting place.

This design creates needless memory pressure and delays loading because
we have to copy from kernel memory to somewhere else.  This patch sets
adds support to the request firmware API to load the firmware directly
into a pre-allocated buffer, skipping the intermediate copying step and
alleviating memory pressure during firmware loading.  The drawback is
that we can't use the firmware caching feature because the memory for
the firmware cache is not managed by the firmware layer.

This patch (of 3):

We use similar structured code to read and write the kmapped firmware
pages.  The only difference is read copies from the kmap region and
write copies to it.  Consolidate this into one function to reduce
duplication.

Link: http://lkml.kernel.org/r/20160607164741.31849-2-stephen.boyd@linaro.org
Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:09 -04:00
Fabian Frederick
bd721ea73e treewide: replace obsolete _refok by __ref
There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Linus Torvalds
c9b95e5961 Merge tag 'sound-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
 "The majority of this update is about ASoC, including a few new
  drivers, and the rest are mostly minor changes.  The only substantial
  change in ALSA core is about the additional error handling in the
  compress-offload API.  Below are highlights:

   - Add the error propagating support in compress-offload API

   - HD-audio: a usual Dell headset fixup, an Intel HDMI/DP fix, and the
     default mixer setup change ot turn off the loopback

   - Lots of updates for ASoC Intel drivers, mostly board support and
     bug fixing, and to the NAU8825 driver

   - Work on generalizing bits of simple-card to allow more code sharing
     with the Renesas rsrc-card (which can't use simple-card due to DPCM)

   - Removal of the Odroid X2 driver due to replacement with simple-card

   - Support for several new Mediatek platforms and associated boards

   - New ASoC drivers for Allwinner A10, Analog Devices ADAU7002,
     Broadcom Cygnus, Cirrus Logic CS35L33 and CS53L30, Maxim MAX8960
     and MAX98504, Realtek RT5514 and Wolfson WM8758"

* tag 'sound-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (278 commits)
  sound: oss: Use kernel_read_file_from_path() for mod_firmware_load()
  ASoC: Intel: Skylake: Delete an unnecessary check before the function call "release_firmware"
  ASoC: Intel: Skylake: Fix NULL Pointer exception in dynamic_debug.
  ASoC: samsung: Specify DMA channels through struct snd_dmaengine_pcm_config
  ASoC: samsung: Fix error paths in the I2S driver's probe()
  ASoC: cs53l30: Fix bit shift issue of TDM mode
  ASoC: cs53l30: Fix a bug for TDM slot location validation
  ASoC: rockchip: correct the spdif clk
  ALSA: echoaudio: purge contradictions between dimension matrix members and total number of members
  ASoC: rsrc-card: use asoc_simple_card_parse_card_name()
  ASoC: rsrc-card: use asoc_simple_dai instead of rsrc_card_dai
  ASoC: rsrc-card: use asoc_simple_card_parse_dailink_name()
  ASoC: simple-card: use asoc_simple_card_parse_card_name()
  ASoC: simple-card-utils: add asoc_simple_card_parse_card_name()
  ASoC: simple-card: use asoc_simple_card_parse_dailink_name()
  ASoC: simple-card-utils: add asoc_simple_card_set_dailink_name()
  ASoC: nau8825: drop redundant idiom when converting integer to boolean
  ASoC: nau8825: jack connection decision with different insertion logic
  ASoC: mediatek: Add HDMI dai-links to the mt8173-rt5650 machine driver
  ASoC: mediatek: mt2701: fix non static symbol warning
  ...
2016-07-31 02:25:02 -07:00
Maarten ter Huurne
b2c7f5d9c9 regmap: cache: Fix num_reg_defaults computation from reg_defaults_raw
In 3245d460 (regmap: cache: Fall back to register by register read for
cache defaults) non-readable registers are skipped when initializing
reg_defaults, but are still included in num_reg_defaults. So there can
be uninitialized entries at the end of reg_defaults, which can cause
problems when the register cache initializes from the full array.

Fixed it by excluding non-readable registers from the count as well.

Signed-off-by: Maarten ter Huurne <maarten@treewalker.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
2016-07-29 23:02:19 +01:00
Andy Lutomirski
d30dd8be06 mm: track NR_KERNEL_STACK in KiB instead of number of stacks
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures.  Keep it simple and use KiB.

Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
11fb998986 mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
4b9d0fab71 mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
50658e2e04 mm: move page mapped accounting to the node
Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
599d0c954f mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic.  Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes.  It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
75ef718405 mm, vmstat: add infrastructure for per-node vmstats
Patchset: "Move LRU page reclaim from zones to nodes v9"

This series moves LRUs from the zones to the node.  While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v9
Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          189.19      191.80
System       2604.45     2533.56
Elapsed      2855.30     2786.39

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;

                             4.7.0-rc3   4.7.0-rc3
                         mmotm-20160623 nodelru-v8
DMA32 allocs               28794729769           0
Normal allocs              48432501431 77227309877
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v9
Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 approx-v9
User          645.72      525.90
System        403.85      331.75
Elapsed      6795.36     6783.67

This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      647465
Major Faults                       573         640
Swap Ins                             0           0
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44190646
Normal allocs                 78053072    79887245
Movable allocs                       0           0
Allocation stalls                   24          67
Stall zone DMA                       0           0
Stall zone DMA32                     0           0
Stall zone Normal                    0           2
Stall zone HighMem                   0           0
Stall zone Movable                   0          65
Direct pages scanned             10969       30609
Kswapd pages scanned          93375144    93492094
Kswapd pages reclaimed        93372243    93489370
Direct pages reclaimed           10969       30609
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13781.934
Direct efficiency                 100%        100%
Direct velocity                  1.614       4.512
Percentage direct scans             0%          0%

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160623            nodelru-v9
Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 nodelru-v9
User         1548.01     1552.44
System       8609.71     8515.08
Elapsed      3587.10     3594.54

There is little or no change in performance but some drop in system CPU usage.

                             4.7.0-rc3   4.7.0-rc3
                        mmotm-20160623  nodelru-v9
Minor Faults                    362662      367360
Major Faults                      1204        1143
Swap Ins                            22           0
Swap Outs                         2855        1029
DMA allocs                           0           0
DMA32 allocs                  31409797    28837521
Normal allocs                 46611853    49231282
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40845270    40869088
Kswapd pages reclaimed        40830976    40855294
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11386.711   11369.769
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Page writes by reclaim            2855        1029
Page writes file                     0           0
Page writes anon                  2855        1029
Page reclaim immediate             771        1628
Sector Reads                 293312636   293536360
Sector Writes                 18213568    18186480
Page rescued immediate               0           0
Slabs scanned                   128257      132747
Direct inode steals                181          56
Kswapd inode steals                 59        1131

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         502
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1381662383
Normal allocs                891235743   564138421
Movable allocs                       0           0
Allocation stalls                 2603           1
Direct pages scanned            216787           2
Kswapd pages scanned          50719775    41778378
Kswapd pages reclaimed        41541765    41777639
Direct pages reclaimed          209159           0
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14329.059
Direct efficiency                  96%          0%
Direct velocity                 72.061       0.001
Percentage direct scans             0%          0%
Page writes by reclaim         6215049           0
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673          90
Sector Reads                  81940800    81680456
Sector Writes                100158984    98816036
Page rescued immediate               0           0
Slabs scanned                  1366954       22683

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

VM statistic counters for reclaim decisions are zone-based.  If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that.  The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet.  There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
patch with no functional change.  There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00