Commit Graph

4744 Commits

Author SHA1 Message Date
Glauber Costa 3f13461939 memcg: decrement static keys at real destroy time
We call the destroy function when a cgroup starts to be removed, such as
by a rmdir event.

However, because of our reference counters, some objects are still
inflight.  Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference, some
objects will still have charges, but the code to properly uncharge them
won't be run.

This becomes a problem specially if it is ever enabled again, because now
new charges will be added to the staled charges making keeping it pretty
much impossible.

We just need to be careful with the static branch activation: since there
is no particular preferred order of their activation, we need to make sure
that we only start using it after all call sites are active.  This is
achieved by having a per-memcg flag that is only updated after
static_key_slow_inc() returns.  At this time, we are sure all sites are
active.

This is made per-memcg, not global, for a reason: it also has the effect
of making socket accounting more consistent.  The first memcg to be
limited will trigger static_key() activation, therefore, accounting.  But
all the others will then be accounted no matter what.  After this patch,
only limited memcgs will have its sockets accounted.

[akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                            document enum sock_flag_bits,
                            convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                            redo tcp_update_limit() comment to 80 cols]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:28 -07:00
Linus Torvalds 28f3d71761 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull more networking updates from David Miller:
 "Ok, everything from here on out will be bug fixes."

1) One final sync of wireless and bluetooth stuff from John Linville.
   These changes have all been in his tree for more than a week, and
   therefore have had the necessary -next exposure.  John was just away
   on a trip and didn't have a change to send the pull request until a
   day or two ago.

2) Put back some defines in user exposed header file areas that were
   removed during the tokenring purge.  From Stephen Hemminger and Paul
   Gortmaker.

3) A bug fix for UDP hash table allocation got lost in the pile due to
   one of those "you got it..  no I've got it.." situations.  :-)

   From Tim Bird.

4) SKB coalescing in TCP needs to have stricter checks, otherwise we'll
   try to coalesce overlapping frags and crash.  Fix from Eric Dumazet.

5) RCU routing table lookups can race with free_fib_info(), causing
   crashes when we deref the device pointers in the route.  Fix by
   releasing the net device in the RCU callback.  From Yanmin Zhang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (293 commits)
  tcp: take care of overlaps in tcp_try_coalesce()
  ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  mm: add a low limit to alloc_large_system_hash
  ipx: restore token ring define to include/linux/ipx.h
  if: restore token ring ARP type to header
  xen: do not disable netfront in dom0
  phy/micrel: Fix ID of KSZ9021
  mISDN: Add X-Tensions USB ISDN TA XC-525
  gianfar:don't add FCB length to hard_header_len
  Bluetooth: Report proper error number in disconnection
  Bluetooth: Create flags for bt_sk()
  Bluetooth: report the right security level in getsockopt
  Bluetooth: Lock the L2CAP channel when sending
  Bluetooth: Restore locking semantics when looking up L2CAP channels
  Bluetooth: Fix a redundant and problematic incoming MTU check
  Bluetooth: Add support for Foxconn/Hon Hai AR5BBU22 0489:E03C
  Bluetooth: Fix EIR data generation for mgmt_device_found
  Bluetooth: Fix Inquiry with RSSI event mask
  Bluetooth: improve readability of l2cap_seq_list code
  Bluetooth: Fix skb length calculation
  ...
2012-05-24 11:54:29 -07:00
Eric Dumazet 1ca7ee3063 tcp: take care of overlaps in tcp_try_coalesce()
Sergio Correia reported following warning :

WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()

WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
     "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
     tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);

It appears TCP coalescing, and more specifically commit b081f85c29
(net: implement tcp coalescing in tcp_queue_rcv()) should take care of
possible segment overlaps in receive queue. This was properly done in
the case of out_or_order_queue by the caller.

For example, segment at tail of queue have sequence 1000-2000, and we
add a segment with sequence 1500-2500.
This can happen in case of retransmits.

In this case, just don't do the coalescing.

Reported-by: Sergio Correia <lists@uece.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Sergio Correia <lists@uece.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-24 00:28:21 -04:00
Yanmin Zhang e49cc0da72 ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50

Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.

With the patch, we ran MTBF testing on Android mobile for 12 hours
and didn't trigger the issue.

Thank Eric for very detailed review/checking the issue.

Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-24 00:28:21 -04:00
Tim Bird 31fe62b958 mm: add a low limit to alloc_large_system_hash
UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-24 00:28:21 -04:00
Linus Torvalds 644473e9c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace enhancements from Eric Biederman:
 "This is a course correction for the user namespace, so that we can
  reach an inexpensive, maintainable, and reasonably complete
  implementation.

  Highlights:
   - Config guards make it impossible to enable the user namespace and
     code that has not been converted to be user namespace safe.

   - Use of the new kuid_t type ensures the if you somehow get past the
     config guards the kernel will encounter type errors if you enable
     user namespaces and attempt to compile in code whose permission
     checks have not been updated to be user namespace safe.

   - All uids from child user namespaces are mapped into the initial
     user namespace before they are processed.  Removing the need to add
     an additional check to see if the user namespace of the compared
     uids remains the same.

   - With the user namespaces compiled out the performance is as good or
     better than it is today.

   - For most operations absolutely nothing changes performance or
     operationally with the user namespace enabled.

   - The worst case performance I could come up with was timing 1
     billion cache cold stat operations with the user namespace code
     enabled.  This went from 156s to 164s on my laptop (or 156ns to
     164ns per stat operation).

   - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
     Most uid/gid setting system calls treat these value specially
     anyway so attempting to use -1 as a uid would likely cause
     entertaining failures in userspace.

   - If setuid is called with a uid that can not be mapped setuid fails.
     I have looked at sendmail, login, ssh and every other program I
     could think of that would call setuid and they all check for and
     handle the case where setuid fails.

   - If stat or a similar system call is called from a context in which
     we can not map a uid we lie and return overflowuid.  The LFS
     experience suggests not lying and returning an error code might be
     better, but the historical precedent with uids is different and I
     can not think of anything that would break by lying about a uid we
     can't map.

   - Capabilities are localized to the current user namespace making it
     safe to give the initial user in a user namespace all capabilities.

  My git tree covers all of the modifications needed to convert the core
  kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
  userns:  Silence silly gcc warning.
  cred: use correct cred accessor with regards to rcu read lock
  userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
  userns: Convert cgroup permission checks to use uid_eq
  userns: Convert tmpfs to use kuid and kgid where appropriate
  userns: Convert sysfs to use kgid/kuid where appropriate
  userns: Convert sysctl permission checks to use kuid and kgids.
  userns: Convert proc to use kuid/kgid where appropriate
  userns: Convert ext4 to user kuid/kgid where appropriate
  userns: Convert ext3 to use kuid/kgid where appropriate
  userns: Convert ext2 to use kuid/kgid where appropriate.
  userns: Convert devpts to use kuid/kgid where appropriate
  userns: Convert binary formats to use kuid/kgid where appropriate
  userns: Add negative depends on entries to avoid building code that is userns unsafe
  userns: signal remove unnecessary map_cred_ns
  userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
  userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
  userns: Convert stat to return values mapped from kuids and kgids
  userns: Convert user specfied uids and gids in chown into kuids and kgid
  userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
  ...
2012-05-23 17:42:39 -07:00
Linus Torvalds 88d6ae8dc3 Merge branch 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "cgroup file type addition / removal is updated so that file types are
  added and removed instead of individual files so that dynamic file
  type addition / removal can be implemented by cgroup and used by
  controllers.  blkio controller changes which will come through block
  tree are dependent on this.  Other changes include res_counter cleanup
  and disallowing kthread / PF_THREAD_BOUND threads to be attached to
  non-root cgroups.

  There's a reported bug with the file type addition / removal handling
  which can lead to oops on cgroup umount.  The issue is being looked
  into.  It shouldn't cause problems for most setups and isn't a
  security concern."

Fix up trivial conflict in Documentation/feature-removal-schedule.txt

* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
  res_counter: Account max_usage when calling res_counter_charge_nofail()
  res_counter: Merge res_counter_charge and res_counter_charge_nofail
  cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
  cgroup: remove cgroup_subsys->populate()
  cgroup: get rid of populate for memcg
  cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
  cgroup: make css->refcnt clearing on cgroup removal optional
  cgroup: use negative bias on css->refcnt to block css_tryget()
  cgroup: implement cgroup_rm_cftypes()
  cgroup: introduce struct cfent
  cgroup: relocate __d_cgrp() and __d_cft()
  cgroup: remove cgroup_add_file[s]()
  cgroup: convert memcg controller to the new cftype interface
  memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
  cgroup: convert all non-memcg controllers to the new cftype interface
  cgroup: relocate cftype and cgroup_subsys definitions in controllers
  cgroup: merge cft_release_agent cftype array into the base files array
  cgroup: implement cgroup_add_cftypes() and friends
  cgroup: build list of all cgroups under a given cgroupfs_root
  cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
  ...
2012-05-22 17:40:19 -07:00
David S. Miller 17eea0df5f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-20 21:53:04 -04:00
Eldad Zack 413c27d869 net/ipv4: replace simple_strtoul with kstrtoul
Replace simple_strtoul with kstrtoul in three similar occurrences, all setup
handlers:
* route.c: set_rhash_entries
* tcp.c: set_thash_entries
* udp.c: set_uhash_entries

Also check if the conversion failed.

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-20 04:06:17 -04:00
Eldad Zack b37f4d7b01 net/ipv4/ipconfig: neaten __setup placement
The __setup macro should follow the corresponding setup handler.

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-20 04:06:16 -04:00
Eric Dumazet 3cc4949269 ipv4: use skb coalescing in defragmentation
ip_frag_reasm() can use skb_try_coalesce() to build optimized skb,
reducing memory used by them (truesize), and reducing number of cache
line misses and overhead for the consumer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19 18:34:57 -04:00
Eric Dumazet bad43ca832 net: introduce skb_try_coalesce()
Move tcp_try_coalesce() protocol independent part to
skb_try_coalesce().

skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
to build optimized skbs (less sk_buff, and possibly less 'headers')

skb_try_coalesce() is zero copy, unless the copy can fit in destination
header (its a rare case)

kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
because IPv6 will need it in patch (ipv6: use skb coalescing in
reassembly).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19 18:34:57 -04:00
Eric Dumazet cbc264cacd ip_frag: struct inet_frags match() method returns a bool
- match() method returns a boolean
- return (A && B && C && D) -> return A && B && C && D
- fix indentation

Signed-off-by: Eric Dumazet <edumazet@google.com>
2012-05-18 01:40:27 -04:00
Willy Tarreau bad115cfe5 tcp: do_tcp_sendpages() must try to push data out on oom conditions
Since recent changes on TCP splicing (starting with commits 2f533844
"tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp:
tcp_sendpages() should call tcp_push() once"), I started seeing
massive stalls when forwarding traffic between two sockets using
splice() when pipe buffers were larger than socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org>
2012-05-17 18:31:43 -04:00
Eric Dumazet a2a385d627 tcp: bool conversions
bool conversions where possible.

__inline__ -> inline

space cleanups

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-17 14:59:59 -04:00
Eric Dumazet dc6b9b7823 net: include/net/sock.h cleanup
bool/const conversions where possible

__inline__ -> inline

space cleanups

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-17 04:50:21 -04:00
David S. Miller 028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
David S. Miller c727e7f007 Merge branch 'delete-tokenring' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2012-05-16 01:02:40 -04:00
Joe Perches 91df42bedc net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug
Use the current debugging style and enable dynamic_debug.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16 01:01:03 -04:00
Paul Gortmaker 211ed86510 net: delete all instances of special processing for token ring
We are going to delete the Token ring support.  This removes any
special processing in the core networking for token ring, (aside
from net/tr.c itself), leaving the drivers and remaining tokenring
support present but inert.

The mass removal of the drivers and net/tr.c will be in a separate
commit, so that the history of these files that we still care
about won't have the giant deletion tied into their history.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-05-15 20:14:35 -04:00
Joe Perches e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00
Jan Beulich 7e15252498 xfrm: make xfrm_algo.c a module
By making this a standalone config option (auto-selected as needed),
selecting CRYPTO from here rather than from XFRM (which is boolean)
allows the core crypto code to become a module again even when XFRM=y.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:13:34 -04:00
Pavel Emelyanov 1070b1b831 tcp: Out-line tcp_try_rmem_schedule
As proposed by Eric, make the tcp_input.o thinner.

add/remove: 1/1 grow/shrink: 1/4 up/down: 868/-1329 (-461)
function                                     old     new   delta
tcp_try_rmem_schedule                          -     864    +864
tcp_ack                                     4811    4815      +4
tcp_validate_incoming                        817     815      -2
tcp_collapse                                 860     858      -2
tcp_send_rcvq                                555     353    -202
tcp_data_queue                              3435    3033    -402
tcp_prune_queue                              721       -    -721

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:24:36 -04:00
Pavel Emelyanov 3c961afed4 tcp: Schedule rmem for rcvq repair send
As noted by Eric, no checks are performed on the data size we're
putting in the read queue during repair. Thus, validate the given
data size with the common rmem management routine.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:24:35 -04:00
Pavel Emelyanov 292e8d8c85 tcp: Move rcvq sending to tcp_input.c
It actually works on the input queue and will use its read mem
routines, thus it's better to have in in the tcp_input.c file.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:24:35 -04:00