Commit Graph

13562 Commits

Author SHA1 Message Date
Linus Torvalds
a11637194a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ftrace: Make all inline tags also include notrace
  perf: Use css_tryget() to avoid propping up css refcount
  perf tools: Fix synthesizing tracepoint names from the perf.data headers
  perf stat: Fix default output file
  perf tools: Fix endianity swapping for adds_features bitmask
2012-06-22 10:58:57 -07:00
Linus Torvalds
2ce5682947 Merge branch 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull two cgroup fixes from Tejun Heo:
 "This containes two patches fixing a refcnt race bug during css_put().
  Decrementing and checking the value weren't atomic and two tasks could
  think that they both pushed the counter to zero."

* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroups: Account for CSS_DEACT_BIAS in __css_put
  cgroup: make sure that decisions in __css_put are atomic
2012-06-20 22:11:04 -07:00
Linus Torvalds
fe80352460 Merge tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and printk fixes from Greg Kroah-Hartman:
 "Here are some fixes for 3.5-rc4 that resolve the kmsg problems that
  people have reported showing up after the printk and kmsg changes went
  into 3.5-rc1.  There are also a smattering of other tiny fixes for the
  extcon and hyper-v drivers that people have reported.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  extcon: max8997: Add missing kfree for info->edev in max8997_muic_remove()
  extcon: Set platform drvdata in gpio_extcon_probe() and fix irq leak
  extcon: Fix wrong index in max8997_extcon_cable[]
  kmsg - kmsg_dump() fix CONFIG_PRINTK=n compilation
  printk: return -EINVAL if the message len is bigger than the buf size
  printk: use mutex lock to stop syslog_seq from going wild
  kmsg - kmsg_dump() use iterator to receive log buffer content
  vme: change maintainer e-mail address
  Extcon: Don't try to create duplicate link names
  driver core: fixup reversed deferred probe order
  printk: Fix alignment of buf causing crash on ARM EABI
  Tools: hv: verify origin of netlink connector message
2012-06-20 15:14:28 -07:00
Cyrill Gorcunov
5702c5eeab c/r: prctl: Move PR_GET_TID_ADDRESS to a proper place
During merging of PR_GET_TID_ADDRESS patch the code has been misplaced (it
happened to appear under PR_MCE_KILL) in result noone can use this option.

Fix it by moving code snippet to a proper place.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Oleg Nesterov
50d75f8dae pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper
find_new_reaper() changes pid_ns->child_reaper, see add0d4df ("pid_ns:
zap_pid_ns_processes: fix the ->child_reaper changing").

The original reason has gone away after the previous patch, ->children
list must be empty after zap_pid_ns_processes().

However now we can not switch to init_pid_ns.child_reaper.
__unhash_process() relies on the "->child_reaper == parent" check, but
this check does not work if the last exiting task is also the child
reaper.

As Eric sugested, we can change __unhash_process() to use the parent's
pid_ns and remove this code.

Also, with this change we can move detach_pid(PIDTYPE_PID) back, where it
was before the previous fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Wagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Eric W. Biederman
6347e90091 pidns: guarantee that the pidns init will be the last pidns process reaped
Today we have a twofold bug.  Sometimes release_task on pid == 1 in a pid
namespace can run before other processes in a pid namespace have had
release task called.  With the result that pid_ns_release_proc can be
called before the last proc_flus_task() is done using upid->ns->proc_mnt,
resulting in the use of a stale pointer.  This same set of circumstances
can lead to waitpid(...) returning for a processes started with
clone(CLONE_NEWPID) before the every process in the pid namespace has
actually exited.

To fix this modify zap_pid_ns_processess wait until all other processes in
the pid namespace have exited, even EXIT_DEAD zombies.

The delay_group_leader and related tests ensure that the thread gruop
leader will be the last thread of a process group to be reaped, or to
become EXIT_DEAD and self reap.  With the change to zap_pid_ns_processes
we get the guarantee that pid == 1 in a pid namespace will be the last
task that release_task is called on.

With pid == 1 being the last task to pass through release_task
pid_ns_release_proc can no longer be called too early nor can wait return
before all of the EXIT_DEAD tasks in a pid namespace have exited.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Tested-by: Andrew Wagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Konstantin Khlebnikov
4fe7efdbdf mm: correctly synchronize rss-counters at exit/exec
do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does
put_user(clear_child_tid) which can update task->rss_stat and thus make
mm->rss_stat inconsistent.  This triggers the "BUG:" printk in check_mm().

Let's fix this bug in the safest way, and optimize/cleanup this later.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20 14:39:36 -07:00
Salman Qazi
8e3bbf42c6 cgroups: Account for CSS_DEACT_BIAS in __css_put
When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count.  This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-06-18 15:38:02 -07:00
Salman Qazi
9c5da09d26 perf: Use css_tryget() to avoid propping up css refcount
An rmdir pushes css's ref count to zero.  However, if the associated
directory is open at the time, the dentry ref count is non-zero.  If
the fd for this directory is then passed into perf_event_open, it
does a css_get().  This bounces the ref count back up from zero.  This
is a problem by itself.  But what makes it turn into a crash is the
fact that we end up doing an extra dput, since we perform a dput
when css_put sees the ref count go down to zero.

css_tryget() does not fall into that trap. So, we use that instead.

Reproduction test-case for the bug:

 #include <unistd.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <linux/unistd.h>
 #include <linux/perf_event.h>
 #include <string.h>
 #include <errno.h>
 #include <stdio.h>

 #define PERF_FLAG_PID_CGROUP    (1U << 2)

 int perf_event_open(struct perf_event_attr *hw_event_uptr,
                     pid_t pid, int cpu, int group_fd, unsigned long flags) {
         return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu,
                 group_fd, flags);
 }

 /*
  * Directly poke at the perf_event bug, since it's proving hard to repro
  * depending on where in the kernel tree.  what moved?
  */
 int main(int argc, char **argv)
 {
        int fd;
        struct perf_event_attr attr;
        memset(&attr, 0, sizeof(attr));
        attr.exclude_kernel = 1;
        attr.size = sizeof(attr);
        mkdir("/dev/cgroup/perf_event/blah", 0777);
        fd = open("/dev/cgroup/perf_event/blah", O_RDONLY);
        perror("open");
        rmdir("/dev/cgroup/perf_event/blah");
        sleep(2);
        perf_event_open(&attr, fd, 0, -1,  PERF_FLAG_PID_CGROUP);
        perror("perf_event_open");
        close(fd);
        return 0;
 }

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-18 11:45:57 +02:00
Yuanhan Liu
b56a39ac26 printk: return -EINVAL if the message len is bigger than the buf size
Just like what devkmsg_read() does, return -EINVAL if the message len is
bigger than the buf size, or it will trigger a segfault error.

Acked-by: Kay Sievers <kay@vrfy.org>
Acked-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-16 08:36:03 -07:00
Yuanhan Liu
4a77a5a06e printk: use mutex lock to stop syslog_seq from going wild
Although syslog_seq and log_next_seq stuff are protected by logbuf_lock
spin log, it's not enough. Say we have two processes A and B, and let
syslog_seq = N, while log_next_seq = N + 1, and the two processes both
come to syslog_print at almost the same time. And No matter which
process get the spin lock first, it will increase syslog_seq by one,
then release spin lock; thus later, another process increase syslog_seq
by one again. In this case, syslog_seq is bigger than syslog_next_seq.
And latter, it would make:
   wait_event_interruptiable(log_wait, syslog != log_next_seq)
don't wait any more even there is no new write comes. Thus it introduce
a infinite loop reading.

I can easily see this kind of issue by the following steps:
  # cat /proc/kmsg # at meantime, I don't kill rsyslog
                   # So they are the two processes.
  # xinit          # I added drm.debug=6 in the kernel parameter line,
                   # so that it will produce lots of message and let that
                   # issue happen

It's 100% reproducable on my side. And my disk will be filled up by
/var/log/messages in a quite short time.

So, introduce a mutex_lock to stop syslog_seq from going wild just like
what devkmsg_read() does. It does fix this issue as expected.

v2: use mutex_lock_interruptiable() instead (comments from Kay)

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-16 08:36:02 -07:00
Linus Torvalds
ed21a66c18 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  watchdog: Quiet down the boot messages
  perf/x86: Fix broken LBR fixup code
  tracing: Have tracing_off() actually turn tracing off
2012-06-15 16:58:10 -07:00
Linus Torvalds
a95f9b6e09 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core updates (RCU and locking) from Ingo Molnar:
 "Most of the diffstat comes from the RCU slow boot regression fixes,
  but there's also a debuggability improvements/fixes."

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  memblock: Document memblock_is_region_{memory,reserved}()
  rcu: Precompute RCU_FAST_NO_HZ timer offsets
  rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
  rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
  rcu: RCU_FAST_NO_HZ detection of callback adoption
  spinlock: Indicate that a lockup is only suspected
  kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
  panic: Make panic_on_oops configurable
2012-06-15 16:52:35 -07:00
Kay Sievers
e2ae715d66 kmsg - kmsg_dump() use iterator to receive log buffer content
Provide an iterator to receive the log buffer content, and convert all
kmsg_dump() users to it.

The structured data in the kmsg buffer now contains binary data, which
should no longer be copied verbatim to the kmsg_dump() users.

The iterator should provide reliable access to the buffer data, and also
supports proper log line-aware chunking of data while iterating.

Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Reported-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Tested-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-15 14:53:59 -07:00
Don Zickus
a702704682 watchdog: Quiet down the boot messages
A bunch of bugzillas have complained how noisy the nmi_watchdog
is during boot-up especially with its expected failure cases
(like virt and bios resource contention).

This is my attempt to quiet them down and keep it less confusing
for the end user.  What I did is print the message for cpu0 and
save it for future comparisons.  If future cpus have an
identical message as cpu0, then don't print the redundant info.
However, if a future cpu has a different message, happily print
that loudly.

Before the change, you would see something like:

    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    CPU0: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz stepping 0a
    Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
    ... version:                2
    ... bit width:              40
    ... generic registers:      2
    ... value mask:             000000ffffffffff
    ... max period:             000000007fffffff
    ... fixed-purpose events:   3
    ... event mask:             0000000700000003
    NMI watchdog enabled, takes one hw-pmu counter.
    Booting Node   0, Processors  #1
    NMI watchdog enabled, takes one hw-pmu counter.
     #2
    NMI watchdog enabled, takes one hw-pmu counter.
     #3 Ok.
    NMI watchdog enabled, takes one hw-pmu counter.
    Brought up 4 CPUs
    Total of 4 processors activated (22607.24 BogoMIPS).

After the change, it is simplified to:

    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    CPU0: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz stepping 0a
    Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
    ... version:                2
    ... bit width:              40
    ... generic registers:      2
    ... value mask:             000000ffffffffff
    ... max period:             000000007fffffff
    ... fixed-purpose events:   3
    ... event mask:             0000000700000003
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    Booting Node   0, Processors  #1 #2 #3 Ok.
    Brought up 4 CPUs

V2: little changes based on Joe Perches' feedback
V3: printk cleanup based on Ingo's feedback; checkpatch fix
V4: keep printk as one long line
V5: Ingo fix ups

Reported-and-tested-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: nzimmer@sgi.com
Cc: joe@perches.com
Link: http://lkml.kernel.org/r/1339594548-17227-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-14 12:20:50 +02:00
Andrew Lunn
6ebb017de9 printk: Fix alignment of buf causing crash on ARM EABI
Commit 7ff9554bb5, printk: convert
byte-buffer to variable-length record buffer, causes systems using
EABI to crash very early in the boot cycle. The first entry in struct
log is a u64, which for EABI must be 8 byte aligned.

Make use of __alignof__() so the compiler to decide the alignment, but
allow it to be overridden using CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS,
for systems which can perform unaligned access and want to save
a few bytes of space.

Tested on Orion5x and Kirkwood.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-12 16:20:17 -07:00
Ingo Molnar
4a1e001d2b Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Merge RCU fixes from Paul E. McKenney:

 " This series has four patches, the major point of which is to eliminate
   some slowdowns (including boot-time slowdowns) resulting from some
   RCU_FAST_NO_HZ changes.  The issue with the changes is that posting timers
   from the idle loop has no effect if the CPU has entered dyntick-idle
   mode because the CPU has already computed its wakeup time, and posting
   a timer does not cause it to be recomputed.  The short-term fix is for
   RCU to precompute the timeout value so that the CPU's calculation is
   correct. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-11 10:30:23 +02:00
Linus Torvalds
7249450449 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix the relax_domain_level boot parameter
  sched: Validate assumptions in sched_init_numa()
  sched: Always initialize cpu-power
  sched: Fix domain iteration
  sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
  sched/numa: Load balance between remote nodes
  sched/x86: Calculate booted cores after construction of sibling_mask
2012-06-08 14:59:29 -07:00
Randy Dunlap
cd96891d48 sched/fair: fix lots of kernel-doc warnings
Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

  Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
  Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
  .. more warnings

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-08 14:59:10 -07:00
Linus Torvalds
106544d81d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A bit larger than what I'd wish for - half of it is due to hw driver
  updates to Intel Ivy-Bridge which info got recently released,
  cycles:pp should work there now too, amongst other things.  (but we
  are generally making exceptions for hardware enablement of this type.)

  There are also callchain fixes in it - responding to mostly
  theoretical (but valid) concerns.  The tooling side sports perf.data
  endianness/portability fixes which did not make it for the merge
  window - and various other fixes as well."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  perf/x86: Check user address explicitly in copy_from_user_nmi()
  perf/x86: Check if user fp is valid
  perf: Limit callchains to 127
  perf/x86: Allow multiple stacks
  perf/x86: Update SNB PEBS constraints
  perf/x86: Enable/Add IvyBridge hardware support
  perf/x86: Implement cycles:p for SNB/IVB
  perf/x86: Fix Intel shared extra MSR allocation
  x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
  perf: Remove duplicate invocation on perf_event_for_each
  perf uprobes: Remove unnecessary check before strlist__delete
  perf symbols: Check for valid dso before creating map
  perf evsel: Fix 32 bit values endianity swap for sample_id_all header
  perf session: Handle endianity swap on sample_id_all header data
  perf symbols: Handle different endians properly during symbol load
  perf evlist: Pass third argument to ioctl explicitly
  perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
  perf tools: Make --version show kernel version instead of pull req tag
  perf tools: Check if callchain is corrupted
  perf callchain: Make callchain cursors TLS
  ...
2012-06-08 09:14:46 -07:00
Linus Torvalds
b1e25f4109 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull leap second timer fix from Thomas Gleixner.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
2012-06-08 09:11:33 -07:00
Ingo Molnar
9ee6ddc9da Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull brown paper bag fix from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-08 12:23:22 +02:00
Linus Torvalds
48d212a2ee Revert "mm: correctly synchronize rss-counters at exit/exec"
This reverts commit 40af1bbdca.

It's horribly and utterly broken for at least the following reasons:

 - calling sync_mm_rss() from mmput() is fundamentally wrong, because
   there's absolutely no reason to believe that the task that does the
   mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
   you name it.

 - calling it *after* having done mmdrop() on it is doubly insane, since
   the mm struct may well be gone now.

 - testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.

.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap.  I should have caught it before I
even took it, but I trusted Andrew too much.

Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 17:54:07 -07:00
Konstantin Khlebnikov
40af1bbdca mm: correctly synchronize rss-counters at exit/exec
mm->rss_stat counters have per-task delta: task->rss_stat.  Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().

do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff.  Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid.  So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:

| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier.  After mm_release() there should
be no pagefaults.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>		[3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00
Cyrill Gorcunov
736f24d5e5 c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment
In commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).

So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.

As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07 14:43:55 -07:00