Davem says:
1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.
2) tg3 header length computation correction from Eric Dumazet.
3) More build and reference counting fixes for socket memory cgroup
code from Glauber Costa.
4) module.h snuck back into a core header after all the hard work we
did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.
5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
Alessandro Rubini.
6) Netlink message generation fix in new team driver, should only advertise
the entries that changed during events, from Jiri Pirko.
7) SRIOV VF registration and unregistration fixes, and also add a
missing PCI ID, from Roopa Prabhu.
8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.
9) ftgmac100/ftmac100 build fix, missing interrupt.h include.
10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.
11) Off by one fix in netem packet scheduler, from Vijay Subramanian.
12) TCP loss detection fix from Yuchung Cheng.
13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.
14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.
15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
congestion control modules, fix from Neal Cardwell.
16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.
17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.
18) Statistics bug fixes in mlx4 from Eugenia Emantayev.
19) rds locking bug fix during info dumps, from your's truly.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
rds: Make rds_sock_lock BH rather than IRQ safe.
netprio_cgroup.h: dont include module.h from other includes
net: flow_dissector.c missing include linux/export.h
team: send only changed options/ports via netlink
net/hyperv: fix possible memory leak in do_set_multicast()
drivers/net: dsa/mv88e6xxx.c files need linux/module.h
stmmac: added PCI identifiers
llc: Fix race condition in llc_ui_recvmsg
stmmac: fix phy naming inconsistency
dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
tg3: fix ipv6 header length computation
skge: add byte queue limit support
mv643xx_eth: Add Rx Discard and Rx Overrun statistics
bnx2x: fix compilation error with SOE in fw_dump
bnx2x: handle CHIP_REVISION during init_one
bnx2x: allow user to change ring size in ISCSI SD mode
bnx2x: fix Big-Endianess in ethtool -t
bnx2x: fixed ethtool statistics for MF modes
bnx2x: credit-leakage fixup on vlan_mac_del_all
macvlan: fix a possible use after free
...
end_migration() passes the old page instead of the new page to commit
the charge. This page descriptor is not used for committing itself,
though, since we also pass the (correct) page_cgroup descriptor. But
it's used to find the soft limit tree through the page's zone, so the
soft limit tree of the old page's zone is updated instead of that of the
new page's, which might get slightly out of date until the next charge
reaches the ratelimit point.
This glitch has been present since 5564e88 ("memcg: condense
page_cgroup-to-page lookup points").
This fixes a bug that I introduced in 2.6.38. It's benign enough (to my
knowledge) that we probably don't want this for stable.
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although only used currently for tcp sockets, this function
is now used in common sock code (for sock_clone())
Commit 475f1b5264 moved the
declaration of sock_update_clone() to inside sock.c, but
this only fixes the problem when CONFIG_CGROUP_MEM_RES_CTLR_KMEM
is also not defined.
This patch here is verified to fix both problems, although
reverting the previous one is not necessary.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page()
shows a "Bad page state" message, removes page from circulation, adds a
taint and continues. This is at a very low level, often when a spinlock
is held (sometimes when page table lock is held, for example).
We want to recover from this badness, not make it worse: we must not
kmalloc memory here, we must not do a cgroup path lookup via dubious
pointers. No doubt that code was useful to debug a particular case at one
time, and may be again, but take it out of the mainline kernel.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch started off as a cleanup: __split_huge_page_refcounts() has to
cope with two scenarios, when the hugepage being split is already on LRU,
and when it is not; but why does it have to split that accounting across
three different sites? Consolidate it in lru_add_page_tail(), handling
evictable and unevictable alike, and use standard add_page_to_lru_list()
when accounting is needed (when the head is not yet on LRU).
But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
messing up reclaim calculations and causing a freeze at rmdir of cgroup.
Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
count - this has not been the only such incident. Document that
lru_add_page_tail() is for Transparent HugePages by #ifdef around it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, at LRU handling, memory cgroup needs to do complicated works to see
valid pc->mem_cgroup, which may be overwritten.
This patch is for relaxing the protocol. This patch guarantees
- when pc->mem_cgroup is overwritten, page must not be on LRU.
By this, LRU routine can believe pc->mem_cgroup and don't need to check
bits on pc->flags. This new rule may adds small overheads to swapin. But
in most case, lru handling gets faster.
After this patch, PCG_ACCT_LRU bit is obsolete and removed.
[akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree]
[akpm@linux-foundation.org: clean up code comment]
[hughd@google.com: fix NULL mem_cgroup_try_charge]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
and reducing atomic ops/complexity in memcg LRU handling.
In some cases, pages are added to lru before charge to memcg and pages
are not classfied to memory cgroup at lru addtion. Now, the lru where
the page should be added is determined a bit in page_cgroup->flags and
pc->mem_cgroup. I'd like to remove the check of flag.
To handle the case pc->mem_cgroup may contain stale pointers if pages
are added to LRU before classification. This patch resets
pc->mem_cgroup to root_mem_cgroup before lru additions.
[akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
[hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
[akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
[hughd@google.com: stop oops in mem_cgroup_reset_owner()]
[hughd@google.com: fix page migration to reset_owner]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch simplifies LRU handling of racy case (memcg+SwapCache). At
charging, SwapCache tend to be on LRU already. So, before overwriting
pc->mem_cgroup, the page must be removed from LRU and added to LRU
later.
This patch does
spin_lock(zone->lru_lock);
if (PageLRU(page))
remove from LRU
overwrite pc->mem_cgroup
if (PageLRU(page))
add to new LRU.
spin_unlock(zone->lru_lock);
And guarantee all pages are not on LRU at modifying pc->mem_cgroup.
This patch also unfies lru handling of replace_page_cache() and
swapin.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a clean up. No functional/logical changes.
Because of commit ef6a3c6311 ("mm: add replace_page_cache_page()
function") , FUSE uses replace_page_cache() instead of
add_to_page_cache(). Then, mem_cgroup_cache_charge() is not called
against FUSE's pages from splice.
So now, mem_cgroup_cache_charge() gets pages that are not on the LRU
with the exception of PageSwapCache pages. For checking,
WARN_ON_ONCE(PageLRU(page)) is added.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer relies on logic that identifies threads that have already
been oom killed when scanning the tasklist and, if found, deferring
until such threads have exited. This is done by checking for any
candidate threads that have the TIF_MEMDIE bit set.
For memcg ooms, candidate threads are first found by calling
task_in_mem_cgroup() since the oom killer should not defer if there's an
oom killed thread in another memcg.
Unfortunately, task_in_mem_cgroup() excludes threads if they have
detached their mm in the process of exiting so TIF_MEMDIE is never
detected for such conditions. This is different for global, mempolicy,
and cpuset oom conditions where a detached mm is only excluded after
checking for TIF_MEMDIE and deferring, if necessary, in
select_bad_process().
The fix is to return true if a task has a detached mm but is still in
the memcg or its hierarchy that is currently oom. This will allow the
oom killer to appropriately defer rather than kill unnecessarily or, in
the worst case, panic the machine if nothing else is available to kill.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pages have their corresponding page_cgroup descriptors set up before
they are used in userspace, and thus managed by a memory cgroup.
The only time where lookup_page_cgroup() can return NULL is in the
CONFIG_DEBUG_VM-only page sanity checking code that executes while
feeding pages into the page allocator for the first time.
Remove the NULL checks against lookup_page_cgroup() results from all
callsites where we know that corresponding page_cgroup descriptors must
be allocated, and add a comment to the callsite that actually does have
to check the return value.
[hughd@google.com: stop oops in mem_cgroup_update_page_stat()]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.
But thinking again,
- compound_lock() is held at move_accout...then, it's not necessary
to take move_lock_page_cgroup().
- LRU is locked and all tail pages will go into the same LRU as
head is now on.
- page_cgroup is contiguous in huge page range.
This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
hugepage and reduce costs for spliting.
[akpm@linux-foundation.org: fix typo, per Michal]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
root_mem_cgroup, lacking a configurable limit, was never subject to
limit reclaim, so the pages charged to it could be kept off its LRU
lists. They would be found on the global per-zone LRU lists upon
physical memory pressure and it made sense to avoid uselessly linking
them to both lists.
The global per-zone LRU lists are about to go away on memcg-enabled
kernels, with all pages being exclusively linked to their respective
per-memcg LRU lists. As a result, pages of the root_mem_cgroup must
also be linked to its LRU lists again. This is purely about the LRU
list, root_mem_cgroup is still not charged.
The overhead is temporary until the double-LRU scheme is going away
completely.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>