Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When f_setown is called a pid and a pid type are stored. Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.
Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.
For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.
Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.
__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch()
which prevents inotify from modifying any existing watches when invoked.
If the pathname specified in the call has a watched inode associated
with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST.
Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and
will return EINVAL.
RATIONALE
In the current implementation, there is no way to prevent
inotify_add_watch() from modifying existing watch descriptors. Even if
the caller keeps a record of all watch descriptors collected, this is
only sufficient to detect that an existing watch descriptor may have
been modified.
The assumption that a particular path will map to the same inode over
multiple calls to inotify_add_watch() cannot be made as files can be
renamed or deleted. It is also not possible to assume that two distinct
paths do no map to the same inode, due to hard-links or a dereferenced
symbolic link. Further uses of inotify_add_watch() to revert the change
may cause other watch descriptors to be modified or created, merely
compunding the problem. There is currently no system call such as
inotify_modify_watch() to explicity modify a watch descriptor, which
would be able to revert unwanted changes. Thus the caller cannot
guarantee to be able to revert any changes to existing watch decriptors.
Additionally the caller cannot assume that the events that are
associated with a watch descriptor are within the set requested, as any
future calls to inotify_add_watch() may unintentionally modify a watch
descriptor's mask. Thus it cannot currently be guaranteed that a watch
descriptor will only generate events which have been requested. The
program must filter events which come through its watch descriptor to
within its expected range.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Henry Wilson <henry.wilson@acentic.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Factor out helpers fanotify_add_mark() and fanotify_remove_mark()
to reduce duplicated code.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use a helper to get the mask from the object (i.e. i_fsnotify_mask)
to generalize code of add/remove inode/vfsmount mark.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Make the code to attach/detach a connector to object more generic
by letting the fsnotify connector point to an abstract fsnotify_connp_t.
Code that needs to dereference an inode or mount object now uses the
helpers fsnotify_conn_{inode,mount}.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of passing inode and vfsmount arguments to fsnotify_add_mark()
and its _locked variant, pass an abstract object pointer and the object
type.
The helpers fsnotify_obj_{inode,mount} are added to get the concrete
object pointer from abstract object pointer.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The object marks manipulation functions fsnotify_destroy_marks()
fsnotify_find_mark() and their helpers take an argument of type
struct fsnotify_mark_connector __rcu ** to dereference the connector
pointer. use a typedef to describe this type for brevity.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Before changing the arguments of the functions fsnotify_add_mark()
and fsnotify_add_mark_locked(), convert most callers to use a wrapper.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use fsnotify_foreach_obj_type macros to generalize the code that filters
events by marks mask and ignored_mask.
This is going to be used for adding mark of super block object type.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Use fsnotify_foreach_obj_type macros to generalize the code that filters
events by marks mask and ignored_mask.
This is going to be used for adding mark of super block object type.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Make some code that handles marks of object types inode and vfsmount
generic, so it can handle other object types.
Introduce fsnotify_foreach_obj_type macro to iterate marks by object type
and fsnotify_iter_{should|set}_report_type macros to set/test report_mask.
This is going to be used for adding mark of another object type
(super block mark).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Introduce helpers fsnotify_iter_select_report_types() and
fsnotify_iter_next() to abstract the inode/vfsmount marks merged
list iteration.
This is a preparation patch before generalizing mark list
iteration to more mark object types (i.e. super block marks).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
inode_mark and vfsmount_mark arguments are passed to handle_event()
operation as function arguments as well as on iter_info struct.
The difference is that iter_info struct may contain marks that should
not be handled and are represented as NULL arguments to inode_mark or
vfsmount_mark.
Instead of passing the inode_mark and vfsmount_mark arguments, add
a report_mask member to iter_info struct to indicate which marks should
be handled, versus marks that should only be kept alive during user
wait.
This change is going to be used for passing more mark types
with handle_event() (i.e. super block marks).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
An fsnotify_mark_connector is referencing a single type of object
(either inode or vfsmount). Instead of storing a type mask in
connector->flags, store a single type id in connector->type to
identify the type of object.
When a connector object is detached from the object, its type is set
to FSNOTIFY_OBJ_TYPE_DETACHED and this object is not going to be
reused.
The function fsnotify_clear_marks_by_group() is the only place where
type mask was used, so use type flags instead of type id to this
function.
This change is going to be more convenient when adding a new object
type (super block).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The ignore mask logic in send_to_group() does not match the logic
in fanotify_should_send_event(). In the latter, a vfsmount mark ignore
mask precedes an inode mark mask and in the former, it does not.
That difference may cause events to be sent to fanotify backend for no
reason. Fix the logic in send_to_group() to match that of
fanotify_should_send_event().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When event on child inodes are sent to the parent inode mark and
parent inode mark was not marked with FAN_EVENT_ON_CHILD, the event
will not be delivered to the listener process. However, if the same
process also has a mount mark, the event to the parent inode will be
delivered regadless of the mount mark mask.
This behavior is incorrect in the case where the mount mark mask does
not contain the specific event type. For example, the process adds
a mark on a directory with mask FAN_MODIFY (without FAN_EVENT_ON_CHILD)
and a mount mark with mask FAN_CLOSE_NOWRITE (without FAN_ONDIR).
A modify event on a file inside that directory (and inside that mount)
should not create a FAN_MODIFY event, because neither of the marks
requested to get that event on the file.
Fixes: 1968f5eed5 ("fanotify: use both marks when possible")
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull misc filesystem updates from Jan Kara:
"udf, ext2, quota, fsnotify fixes & cleanups:
- udf fixes for handling of media without uid/gid
- udf fixes for some corner cases in parsing of volume recognition
sequence
- improvements of fsnotify handling of ENOMEM
- new ioctl to allow setting of watch descriptor id for inotify (for
checkpoint - restart)
- small ext2, reiserfs, quota cleanups"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Kill an unused extern entry form quota.h
reiserfs: Remove VLA from fs/reiserfs/reiserfs.h
udf: fix potential refcnt problem of nls module
ext2: change return code to -ENOMEM when failing memory allocation
udf: Do not mark possibly inconsistent filesystems as closed
fsnotify: Let userspace know about lost events due to ENOMEM
fanotify: Avoid lost events due to ENOMEM for unlimited queues
udf: Remove never implemented mount options
udf: Update mount option documentation
udf: Provide saner default for invalid uid / gid
udf: Clean up handling of invalid uid/gid
udf: Apply uid/gid mount options also to new inodes & chown
udf: Ignore [ug]id=ignore mount options
udf: Fix handling of Partition Descriptors
udf: Unify common handling of descriptors
udf: Convert descriptor index definitions to enum
udf: Allow volume descriptor sequence to be terminated by unrecorded block
udf: Simplify handling of Volume Descriptor Pointers
udf: Fix off-by-one in volume descriptor sequence length
inotify: Extend ioctl to allow to request id of new watch descriptor
Currently if notification event is lost due to event allocation failing
we ENOMEM, we just silently continue (except for fanotify permission
events where we deny the access). This is undesirable as userspace has
no way of knowing whether the notifications it got are complete or not.
Treat lost events due to ENOMEM the same way as lost events due to queue
overflow so that userspace knows something bad happened and it likely
needs to rescan the filesystem.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fanotify queues of unlimited length do not expect events can be lost.
Since these queues are used for system auditing and other security
related tasks, loosing events can even have security implications.
Currently, since the allocation is small (32-bytes), it cannot fail
however when we start accounting events in memcgs, allocation can start
failing. So avoid loosing events due to failure to allocate memory by
making event allocation use __GFP_NOFAIL.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>