Pull xen updates from Konrad Rzeszutek Wilk:
"which has three neat features:
- PV multiconsole support, so that there can be hvc1, hvc2, etc; This
can be used in HVM and in PV mode.
- P-state and C-state power management driver that uploads said power
management data to the hypervisor. It also inhibits cpufreq
scaling drivers to load so that only the hypervisor can make power
management decisions - fixing a weird perf bug.
There is one thing in the Kconfig that you won't like: "default y
if (X86_ACPI_CPUFREQ = y || X86_POWERNOW_K8 = y)" (note, that it
all depends on CONFIG_XEN which depends on CONFIG_PARAVIRT which by
default is off). I've a fix to convert that boolean expression
into "default m" which I am going to post after the cpufreq git
pull - as the two patches to make this work depend on a fix in Dave
Jones's tree.
- Function Level Reset (FLR) support in the Xen PCI backend.
Fixes:
- Kconfig dependencies for Xen PV keyboard and video
- Compile warnings and constify fixes
- Change over to use percpu_xxx instead of this_cpu_xxx"
Fix up trivial conflicts in drivers/tty/hvc/hvc_xen.c due to changes to
a removed commit.
* tag 'stable/for-linus-3.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen kconfig: relax INPUT_XEN_KBDDEV_FRONTEND deps
xen/acpi-processor: C and P-state driver that uploads said data to hypervisor.
xen: constify all instances of "struct attribute_group"
xen/xenbus: ignore console/0
hvc_xen: introduce HVC_XEN_FRONTEND
hvc_xen: implement multiconsole support
hvc_xen: support PV on HVM consoles
xenbus: don't free other end details too early
xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it.
xen/setup/pm/acpi: Remove the call to boot_option_idle_override.
xenbus: address compiler warnings
xen: use this_cpu_xxx replace percpu_xxx funcs
xen/pciback: Support pci_reset_function, aka FLR or D3 support.
pci: Introduce __pci_reset_function_locked to be used when holding device_lock.
xen: Utilize the restore_msi_irqs hook.
Pull cleancache changes from Konrad Rzeszutek Wilk:
"This has some patches for the cleancache API that should have been
submitted a _long_ time ago. They are basically cleanups:
- rename of flush to invalidate
- moving reporting of statistics into debugfs
- use __read_mostly as necessary.
Oh, and also the MAINTAINERS file change. The files (except the
MAINTAINERS file) have been in #linux-next for months now. The late
addition of MAINTAINERS file is a brain-fart on my side - didn't
realize I needed that just until I was typing this up - and I based
that patch on v3.3 - so the tree is on top of v3.3."
* tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
MAINTAINERS: Adding cleancache API to the list.
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/
Pull drm main changes from Dave Airlie:
"This is the main drm pull request, I'm probably going to send two more
smaller ones, will explain below.
This contains a patch that is also in the fbdev tree, but it should be
the same patch, it added an API for hot unplugging framebuffer
devices, and I need that API for a new driver.
It also contains some changes to the i2c tree which Jean has acked,
and one change to moorestown platform stuff in x86.
Highlights:
- new drivers: UDL driver for USB displaylink devices, kms only,
should support correct hotplug operations.
- core: i2c speedups + better hotplug support, EDID overriding via
firmware interface - allows user to load a firmware for a broken
monitor/kvm from userspace, it even has documentation for it.
- exynos: new HDMI audio + hdmi 1.4 + virtual output driver
- gma500: code cleanup
- radeon: cleanups, CS optimisations, streamout support and pageflip
fix
- nouveau: NVD9 displayport support + more reclocking work
- i915: re-enabling GMBUS, finish gpu patch (might help hibernation
who knows), missed irq fixes, stencil tiling fixes, interlaced
support, aliasesd PPGTT support for SNB/IVB, swizzling for SNB/IVB,
semaphore fixes
As well as the usual bunch of cleanups and fixes all over the place.
I've got two things I'd like to merge a bit later:
a) AMD support for all their new radeonhd 7000 series GPU and APUs.
AMD dropped this a bit late due to insane internal review
processes, (please AMD just follow Intel and let open source guys
ship stuff early) however I don't want to penalise people who own
this hardware (since its been on sale for 3-4 months and GPU hw
doesn't exactly have a lifetime in years) and consign them to
using closed drivers for longer than necessary. The changes are
well contained and just plug into the driver new gpu functionality
so they should be fairly regression proof. I just want to give
them a bit of a run on the hw AMD kindly sent me.
b) drm prime/dma-buf interface code. This is just infrastructure
code to expose the dma-buf stuff to drm drivers and to userspace.
I'm not planning on pushing any driver support in this cycle
(except maybe exynos), but I'd like to get the infrastructure code
in so for the next cycle I can start getting the driver support
into the individual drivers. We have started driver support for
i915, nouveau and udl along with I think exynos and omap in
staging. However this code relies on the dma-buf tree being
pulled into your tree first since it needs the latest interfaces
from that tree. I'll push to get that tree sent asap.
(oh and any warnings you see in i915 are gcc's fault from what anyone
can see)."
Fix up trivial conflicts in arch/x86/platform/mrst/mrst.c due to the new
msic_thermal_platform_data() thermal function being added next to the
tc35876x_platform_data() i2c device function..
* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (326 commits)
drm/i915: use DDC_ADDR instead of hard-coding it
drm/radeon: use DDC_ADDR instead of hard-coding it
drm: remove unneeded redefinition of DDC_ADDR
drm/exynos: added virtual display driver.
drm: allow loading an EDID as firmware to override broken monitor
drm/exynos: enable hdmi audio feature
drm/exynos: add default pixel format for plane
drm/exynos: cleanup exynos_hdmi.h
drm/exynos: add is_local member in exynos_drm_subdrv struct
drm/exynos: add subdrv open/close functions
drm/exynos: remove module of exynos drm subdrv
drm/exynos: release pending pageflip events when closed
drm/exynos: added new funtion to get/put dma address.
drm/exynos: update gem and buffer framework.
drm/exynos: added mode_fixup feature and code clean.
drm/exynos: add HDMI version 1.4 support
drm/exynos: remove exynos_mixer.h
gma500: Fix mmap frambuffer
drm/radeon: Drop radeon_gem_object_(un)pin.
drm/radeon: Restrict offset for legacy display engine.
...
Pull updates of sound stuff from Takashi Iwai:
"Here is the first big update chunk of sound stuff for 3.4-rc1.
In the common sound infrastructure, there are a few changes for
dynamic PCM support (used in ASoC) and a few clean-ups. Majority of
changes are found, as usual, in HD-audio and ASoC.
Some highlights of HD-audio changes:
- All the long-standing static quirk codes for Realtek codec were
finally removed by fixing and extending the Realtek auto-parser.
- The mute-LED control is standardized over all HD-audio codec
drivers using the extended vmaster hook.
- The vmaster slave mixer elements are initialized to 0dB as default
so that the user won't be annoyed by the silent output after
updates, e.g. due to the additions of new elements.
- Other many fix-ups for the misc HD-audio devices.
In the ASoC side, this is a very active release, including a quite a
few framework enhancements. Some highlights:
- Support for widgets not associated with a CODEC, an important part
of the dynamic PCM framework.
- A library factoring out the common code shared by dmaengine based
DMA drivers contributed by Lars-Peter Clausen. This will save a
lot of code and make it much easier to deploy enhancements to
dmaengine.
- Support for binary controls, used for providing runtime
configuration of algorithm coefficients.
- A new DAPM widget type for regulator supplies allowing drivers for
devices that can power down unused supplies while active to do
without any per-driver code.
- DAPM widgets for DAIs, initially giving a speed boost for playback
startup and shutdown and also the basis for CODEC<->CODEC DAI link
support.
- Support for specifying the number of significant bits on audio
interfaces, useful for allowing applications to know how much
effort to put into generating data for a larger sample format.
- Conversion of the FSI driver used on some SH processors to
DMAEngine.
- Conversion of EP93xx drivers to DMAEngine.
- New CODEC drivers for Maxim MAX9768 and Wolfson Microelectronics
WM2200.
- Move audmux driver from arc/arm to sound/soc
- McBSP move from arch/ to sound/ and updates
Also, a few small updates and fixes for other drivers like au88x0,
ymfpci, USB 6fire, USB usx2yaudio are included."
* tag 'sound-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (446 commits)
ASoC: wm8994: Provide VMID mode control and fix default sequence
ASoC: wm8994: Add missing break in resume
ASoC: wm_hubs: Don't actively manage LINEOUT_VMID_BUF
ASoC: pxa-ssp: atomically set stream active masks
ASoC: fsl: p1022ds: tell the WM8776 codec driver that it's the master
ASoC: Samsung: Added to support mono recording
ALSA: hda - Fix build with CONFIG_PM=n
ALSA: au88x0 - Avoid possible Oops at unbinding
ALSA: usb-audio - Fix build error by consitification of rate list
ASoC: core: Fix obscure leak of runtime array
ALSA: pcm - Avoid GFP_ATOMIC in snd_pcm_link()
ALSA: pcm: Constify the list in snd_pcm_hw_constraint_list
ASoC: wm8996: Add 44.1kHz support
ALSA: hda - Fix build of patch_sigmatel.c without CONFIG_SND_HDA_POWER_SAVE
ASoC: mx27vis-aic32x4: Convert it to platform driver
ALSA: hda - fix printing of high HDMI sample rates
ALSA: ymfpci - Fix legacy registers on S3/S4 resume
ALSA: control - Fixe a trailing white space error
ALSA: hda - Add expose_enum_ctl flag to snd_hda_add_vmaster_hook()
ALSA: hda - Add "Mute-LED Mode" enum control
...
SCSI updates from James Bottomley:
"The update includes the usual assortment of driver updates (lpfc,
qla2xxx, qla4xxx, bfa, bnx2fc, bnx2i, isci, fcoe, hpsa) plus a huge
amount of infrastructure work in the SAS library and transport class
as well as an iSCSI update. There's also a new SCSI based virtio
driver."
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (177 commits)
[SCSI] qla4xxx: Update driver version to 5.02.00-k15
[SCSI] qla4xxx: trivial cleanup
[SCSI] qla4xxx: Fix sparse warning
[SCSI] qla4xxx: Add support for multiple session per host.
[SCSI] qla4xxx: Export CHAP index as sysfs attribute
[SCSI] scsi_transport: Export CHAP index as sysfs attribute
[SCSI] qla4xxx: Add support to display CHAP list and delete CHAP entry
[SCSI] iscsi_transport: Add support to display CHAP list and delete CHAP entry
[SCSI] pm8001: fix endian issue with code optimization.
[SCSI] pm8001: Fix possible racing condition.
[SCSI] pm8001: Fix bogus interrupt state flag issue.
[SCSI] ipr: update PCI ID definitions for new adapters
[SCSI] qla2xxx: handle default case in qla2x00_request_firmware()
[SCSI] isci: improvements in driver unloading routine
[SCSI] isci: improve phy event warnings
[SCSI] isci: debug, provide state-enum-to-string conversions
[SCSI] scsi_transport_sas: 'enable' phys on reset
[SCSI] libsas: don't recover end devices attached to disabled phys
[SCSI] libsas: fixup target_port_protocols for expanders that don't report sata
[SCSI] libsas: set attached device type and target protocols for local phys
...
Pull md updates for 3.4 from Neil Brown:
"Mostly tidying up code in preparation for some bigger changes next
time.
A few bug fixes tagged for -stable.
Main functionality change is that some RAID10 arrays can now grow to
use extra space that may have been made available on the individual
devices."
Fixed up trivial conflicts with the k[un]map_atomic() cleanups in
drivers/md/bitmap.c.
* tag 'md-3.4' of git://neil.brown.name/md: (22 commits)
md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
md: fix clearing of the 'changed' flags for the bad blocks list.
md/bitmap: discard CHUNK_BLOCK_SHIFT macro
md/bitmap: remove unnecessary indirection when allocating.
md/bitmap: remove some pointless locking.
md/bitmap: change a 'goto' to a normal 'if' construct.
md/bitmap: move printing of bitmap status to bitmap.c
md/bitmap: remove some unused noise from bitmap.h
md/raid10 - support resizing some RAID10 arrays.
md/raid1: handle merge_bvec_fn in member devices.
md/raid10: handle merge_bvec_fn in member devices.
md: add proper merge_bvec handling to RAID0 and Linear.
md: tidy up rdev_for_each usage.
md/raid1,raid10: avoid deadlock during resync/recovery.
md/bitmap: ensure to load bitmap when creating via sysfs.
md: don't set md arrays to readonly on shutdown.
md: allow re-add to failed arrays.
md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
md: Use existed macros instead of numbers
md/raid5: removed unused 'added_devices' variable.
...
Pull x86 platform changes from Ingo Molnar.
Removes the Moorestown platform that nobody ever used.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform: Move APIC ID validity check into platform APIC code
x86/olpc/xo15/sci: Enable lid close wakeup control
x86/geode/net5501: Add platform driver for Soekris Engineering net5501
x86/geode/alix2: Supplement driver to include GPIO button support
x86/mid/powerbtn: Use MSIC read/write instead of ipc_scu
x86/mid/thermal: Turn off thermistor
x86/mid/thermal: Add msic_thermal alias
x86/mid/thermal: Convert to use Intel MSIC API
x86/mid/scu_ipc: Remove Moorestown support
x86/mid: Kill off Moorestown
x86/mrst: Add msic_thermal platform support
x86/config: Select MSIC MFD driver on Intel Medfield platform
x86/mid: Remove Intel Moorestown
x86/mrst: Set ISA bus type for fake MP IRQs
x86/ioapic: Use legacy_pic to set correct gsi-irq mapping
Pull MCE changes from Ingo Molnar.
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
x86/mce: Convert static array of pointers to per-cpu variables
x86/mce: Replace hard coded hex constants with symbolic defines
x86/mce: Recognise machine check bank signature for data path error
x86/mce: Handle "action required" errors
x86/mce: Add mechanism to safely save information in MCE handler
x86/mce: Create helper function to save addr/misc when needed
HWPOISON: Add code to handle "action required" errors.
HWPOISON: Clean up memory_failure() vs. __memory_failure()
Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"
* emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...
Pull powerpc merge from Benjamin Herrenschmidt:
"Here's the powerpc batch for this merge window. It is going to be a
bit more nasty than usual as in touching things outside of
arch/powerpc mostly due to the big iSeriesectomy :-) We finally got
rid of the bugger (legacy iSeries support) which was a PITA to
maintain and that nobody really used anymore.
Here are some of the highlights:
- Legacy iSeries is gone. Thanks Stephen ! There's still some bits
and pieces remaining if you do a grep -ir series arch/powerpc but
they are harmless and will be removed in the next few weeks
hopefully.
- The 'fadump' functionality (Firmware Assisted Dump) replaces the
previous (equivalent) "pHyp assisted dump"... it's a rewrite of a
mechanism to get the hypervisor to do crash dumps on pSeries, the
new implementation hopefully being much more reliable. Thanks
Mahesh Salgaonkar.
- The "EEH" code (pSeries PCI error handling & recovery) got a big
spring cleaning, motivated by the need to be able to implement a
new backend for it on top of some new different type of firwmare.
The work isn't complete yet, but a good chunk of the cleanups is
there. Note that this adds a field to struct device_node which is
not very nice and which Grant objects to. I will have a patch soon
that moves that to a powerpc private data structure (hopefully
before rc1) and we'll improve things further later on (hopefully
getting rid of the need for that pointer completely). Thanks Gavin
Shan.
- I dug into our exception & interrupt handling code to improve the
way we do lazy interrupt handling (and make it work properly with
"edge" triggered interrupt sources), and while at it found & fixed
a wagon of issues in those areas, including adding support for page
fault retry & fatal signals on page faults.
- Your usual random batch of small fixes & updates, including a bunch
of new embedded boards, both Freescale and APM based ones, etc..."
I fixed up some conflicts with the generalized irq-domain changes from
Grant Likely, hopefully correctly.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits)
powerpc/ps3: Do not adjust the wrapper load address
powerpc: Remove the rest of the legacy iSeries include files
powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces
init: Remove CONFIG_PPC_ISERIES
powerpc: Remove FW_FEATURE ISERIES from arch code
tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable
powerpc/spufs: Fix double unlocks
powerpc/5200: convert mpc5200 to use of_platform_populate()
powerpc/mpc5200: add options to mpc5200_defconfig
powerpc/mpc52xx: add a4m072 board support
powerpc/mpc5200: update mpc5200_defconfig to fit for charon board
Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup
powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board
powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board
MAINTAINERS: Update PowerPC 4xx tree
powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board
powerpc: document the FSL MPIC message register binding
powerpc: add support for MPIC message register API
powerpc/fsl: Added aliased MSIIR register address to MSI node in dts
powerpc/85xx: mpc8548cds - add 36-bit dts
...
Pull gfs2 changes from Steven Whitehouse.
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Change truncate page allocation to be GFP_NOFS
GFS2: call gfs2_write_alloc_required for each chunk
GFS2: Clean up log flush header writing
GFS2: Remove a __GFP_NOFAIL allocation
GFS2: Flush pending glock work when evicting an inode
GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpd
GFS2: Eliminate sd_rindex_mutex
GFS2: Unlock rindex mutex on glock error
GFS2: Make bd_cmp() static
GFS2: Sort the ordered write list
GFS2: FITRIM ioctl support
GFS2: Move two functions from log.c to lops.c
GFS2: glock statistics gathering
These macros will be used in a later patch, where all usages are expected
to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
detect unexpected usages, we convert the existing BUG() to BUILD_BUG().
[akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_begin_update_page_stat() should be very fast because it's
called very frequently. Now, it needs to look up page_cgroup and its
memcg....this is slow.
This patch adds a global variable to check "any memcg is moving or not".
With this, the caller doesn't need to visit page_cgroup and memcg.
Here is a test result. A test program makes page faults onto a file,
MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
range by madvise() and page fault again. This program causes 26214400
times of page fault onto a file(size was 1G.) and shows shows the cost of
mem_cgroup_begin_update_page_stat().
Before this patch for mem_cgroup_begin_update_page_stat()
[kamezawa@bluextal test]$ time ./mmap 1G
real 0m21.765s
user 0m5.999s
sys 0m15.434s
27.46% mmap mmap [.] reader
21.15% mmap [kernel.kallsyms] [k] page_fault
9.17% mmap [kernel.kallsyms] [k] filemap_fault
2.96% mmap [kernel.kallsyms] [k] __do_fault
2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat
After this patch
[root@bluextal test]# time ./mmap 1G
real 0m21.373s
user 0m6.113s
sys 0m15.016s
In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
Note: we may be able to remove this optimization in future if
we can get pointer to memcg directly from struct page.
[akpm@linux-foundation.org: don't return a void]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the new lock scheme for updating memcg's page stat, we don't need a
flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
[hughd@google.com: cosmetic fix]
[hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag. The reason is that memcg has a
feature to move a page from a group to another group and we have race
between "move" and "page stat accounting",
Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B
does "page stat accounting".
When CPU-A goes 1st,
CPU-A CPU-B
update "struct page" info.
move_lock_mem_cgroup(memcg)
see pc->flags
copy page stat to new group
overwrite pc->mem_cgroup.
move_unlock_mem_cgroup(memcg)
move_lock_mem_cgroup(mem)
set pc->flags
update page stat accounting
move_unlock_mem_cgroup(mem)
stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
(CPU-A) doesn't see changes in "struct page" information.
But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'. And, there is a potential problem.
For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
kind of page stat accounting.)
CPU-A CPU-B
TestSet PG_dirty
(delay) TestClear PG_dirty
if (TestClear(PCG_dirty))
memcg->nr_dirty--
if (TestSet(PCG_dirty))
memcg->nr_dirty++
Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg
Thelen <gthelen@google.com>. Now, only FILE_MAPPED is supported but
fortunately, it's serialized by page table lock and this is not real bug,
_now_,
If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information. But we'll have a problem in "move
account"
Assume we use only PG_dirty.
CPU-A CPU-B
TestSet PG_dirty
(delay) move_lock_mem_cgroup()
if (PageDirty(page))
new_memcg->nr_dirty++
pc->mem_cgroup = new_memcg;
move_unlock_mem_cgroup()
move_lock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty++
accounting information may be double-counted. This was original reason to
have PCG_xxx flags but it seems PCG_xxx has another problem.
I think we need a bigger lock as
move_lock_mem_cgroup(page)
TestSetPageDirty(page)
update page stats (without any checks)
move_unlock_mem_cgroup(page)
This fixes both of problems and we don't have to duplicate page flag into
page_cgroup. Please note: move_lock_mem_cgroup() is held only when there
are possibility of "account move" under the system. So, in most path,
status update will go without atomic locks.
This patch introduces mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it. as
mem_cgroup_begin_update_page_stat()
modify page information
mem_cgroup_update_page_stat()
=> never check any 'struct page' info, just update counters.
mem_cgroup_end_update_page_stat().
This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not. A
following patch adds an easy optimization and reduces the cost.
[akpm@linux-foundation.org: s/lock/locked/]
[hughd@google.com: fix deadlock by avoiding stat lock when anon]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
pc->mem_cgroup and page statistics accounting per memcg. This lock helps
to avoid the race but the race is very rare because moving tasks between
cgroup is not a usual job. So, it seems using 1bit per page is too
costly.
This patch changes this lock as per-memcg spinlock and removes
PCG_MOVE_LOCK.
If smaller lock is required, we'll be able to add some hashes but I'd like
to start from this.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We record 'the page is cache' with the PCG_CACHE bit in page_cgroup.
Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't
include shmem.
Considering callers, at charge/uncharge, the caller should know what the
page is and we don't need to record it by using one bit per page.
This patch removes PCG_CACHE bit and make callers of
mem_cgroup_charge_statistics() to specify what the page is.
About page migration: Mapping of the used page is not touched during migra
tion (see page_remove_rmap) so we can rely on it and push the correct
charge type down to __mem_cgroup_uncharge_common from end_migration for
unused page. The force flag was misleading was abused for skipping the
needless page_mapped() / PageCgroupMigration() check, as we know the
unused page is no longer mapped and cleared the migration flag just a few
lines up. But doing the checks is no biggie and it's not worth adding
another flag just to skip them.
[akpm@linux-foundation.org: checkpatch fixes]
[hughd@google.com: fix PageAnon uncharging]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c1e2ee2dc4 ("memcg: replace ss->id_lock with a rwlock") has now
been seen to cause the unfair behavior we should have expected from
converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose
get_new_cssid() is waiting for the wlock, while there are 19 tasks using
the rlock in css_get_next() to get on with their memcg workload (in an
artificial test, admittedly). Yet lib/idr.c was made suitable for RCU
way back: revert that commit, restoring ss->id_lock to a spinlock.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
PAGE_SIZE, but this is not sufficient.
Modify hugetlb_file_setup() to align requests to the huge page size, and
to accept an address argument so that all alignment checks can be
performed in hugetlb_file_setup(), rather than in its callers. Change
newseg() and mmap_pgoff() to match the new prototype and eliminate a now
redundant alignment check.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sync_mm_rss() can only be used for current to avoid race conditions in
iterating and clearing its per-task counters. Remove the task argument
for it and its helper function, __sync_task_rss_stat(), to avoid thinking
it can be used safely for anything other than current.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>