__orderly_poweroff() does argv_free() if call_usermodehelper_fns()
returns -ENOMEM. As Lucas pointed out, this can be wrong if -ENOMEM was
not triggered by the failing call_usermodehelper_setup(), in this case
both __orderly_poweroff() and argv_cleanup() can do kfree().
Kill argv_cleanup() and change __orderly_poweroff() to call argv_free()
unconditionally like do_coredump() does. This info->cleanup() is not
needed (and wrong) since 6c0c0d4d "fix bug in orderly_poweroff() which
did the UMH_NO_WAIT => UMH_WAIT_EXEC change, we can rely on the fact
that CLONE_VFORK can't return until do_execve() succeeds/fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: hongfeng <hongfeng@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
In a container with its own pid namespace and user namespace, rebooting
the system won't reboot the host, but terminate all the processes in
it and thus have the container shutdown, so it's safe.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
We have thread_group_cputime() and thread_group_times(). The naming
doesn't provide enough information about the difference between
these two APIs.
To lower the confusion, rename thread_group_times() to
thread_group_cputime_adjusted(). This name better suggests that
it's a version of thread_group_cputime() that does some stabilization
on the raw cputime values. ie here: scale on top of CFS runtime
stats and bound lower value for monotonicity.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
The min/max call needed to have explicit types on some architectures
(e.g. mn10300). Use clamp_t instead to avoid the warning:
kernel/sys.c: In function 'override_release':
kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
orderly_poweroff is trying to poweroff platform in two steps:
step 1: Call user space application to poweroff
step 2: If user space poweroff fail, then do a force power off if force param
is set.
The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey
the design goal of orderly_poweroff.
We have two choices here:
UMH_WAIT_EXEC which means wait for the exec, but not the process;
UMH_WAIT_PROC which means wait for the process to complete.
we need to trade off the two choices:
If using UMH_WAIT_EXEC, there is potential issue comments by Serge E.
Hallyn: The exec will have started, but may for whatever (very unlikely)
reason fail.
If using UMH_WAIT_PROC, there is potential issue comments by Eric W.
Biederman: If the caller is not running in a kernel thread then we can
easily get into a case where the user space caller will block waiting for
us when we are waiting for the user space caller.
Thanks for their excellent ideas, based on the above discussion, we
finally choose UMH_WAIT_EXEC, which is much more safe, if the user
application really fails, we just complain the application itself, it
seems a better choice here.
Signed-off-by: Feng Hong <hongfeng@marvell.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
have kernel_restart() call disable_nonboot_cpus(). Doing so can help
machines that require boot cpu be the last alive cpu during reboot to
survive with kernel restart.
This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine
requires that the restart routine be run on the primary cpu rather than
secondary ones. Otherwise, the secondary core running the restart
routine will fail to come to online after reboot.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just setting the "error" to error number is enough on failure and It
doesn't require to set "error" variable to zero in each switch case,
since it was already initialized with zero. And also removed return 0
in switch case with break statement
Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"no other files mapped" requirement from my previous patch (c/r: prctl:
update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal) is too
paranoid, it forbids operation even if there mapped one shared-anon vma.
Let's check that current mm->exe_file already unmapped, in this case
exe_file symlink already outdated and its changing is reasonable.
Plus, this patch fixes exit code in case operation success.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).
So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.
As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().
We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.
Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.
This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.
This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.
[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we do restore we would like to have a way to setup a former
mm_struct::exe_file so that /proc/pid/exe would point to the original
executable file a process had at checkpoint time.
For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a
file descriptor which will be set as a source for new /proc/$pid/exe
symlink.
Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
vmas present for current process, simply because this feature is a special
to C/R and mm::num_exe_file_vmas become meaningless after that.
To minimize the amount of transition the /proc/pid/exe symlink might have,
this feature is implemented in one-shot manner. Thus once changed the
symlink can't be changed again. This should help sysadmins to monitor the
symlinks over all process running in a system.
In particular one could make a snapshot of processes and ring alarm if
there unexpected changes of /proc/pid/exe's in a system.
Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
request to change symlink will be rejected.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During checkpoint we dump whole process memory to a file and the dump
includes process stack memory. But among stack data itself, the stack
carries additional parameters such as command line arguments, environment
data and auxiliary vector.
So when we do restore procedure and once we've restored stack data itself
we need to setup mm_struct::arg_start/end, env_start/end, so restored
process would be able to find command line arguments and environment data
it had at checkpoint time. The same applies to auxiliary vector.
For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
ENV_END | AUXV) codes are introduced.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>