linux

mirror of https://github.com/armbian/linux.git synced 2026-01-06 10:13:00 -08:00

Author	SHA1	Message	Date
Seiji Aguchi	755d4fe465	efi_pstore: Add a sequence counter to a variable name [Issue] Currently, a variable name, which identifies each entry, consists of type, id and ctime. But if multiple events happens in a short time, a second/third event may fail to log because efi_pstore can't distinguish each event with current variable name. [Solution] A reasonable way to identify all events precisely is introducing a sequence counter to the variable name. The sequence counter has already supported in a pstore layer with "oopscount". So, this patch adds it to a variable name. Also, it is passed to read/erase callbacks of platform drivers in accordance with the modification of the variable name. <before applying this patch> a variable name of first event: dump-type0-1-12345678 a variable name of second event: dump-type0-1-12345678 type:0 id:1 ctime:12345678 If multiple events happen in a short time, efi_pstore can't distinguish them because variable names are same among them. <after applying this patch> it can be distinguishable by adding a sequence counter as follows. a variable name of first event: dump-type0-1-1-12345678 a variable name of Second event: dump-type0-1-2-12345678 type:0 id:1 sequence counter: 1(first event), 2(second event) ctime:12345678 In case of a write callback executed in pstore_console_write(), "0" is added to an argument of the write callback because it just logs all kernel messages and doesn't need to care about multiple events. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Mike Waychison <mikew@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2012-11-26 16:07:44 -08:00
Seiji Aguchi	a9efd39cd5	efi_pstore: Add ctime to argument of erase callback [Issue] Currently, a variable name, which is used to identify each log entry, consists of type, id and ctime. But an erase callback does not use ctime. If efi_pstore supported just one log, type and id were enough. However, in case of supporting multiple logs, it doesn't work because it can't distinguish each entry without ctime at erasing time. <Example> As you can see below, efi_pstore can't differentiate first event from second one without ctime. a variable name of first event: dump-type0-1-12345678 a variable name of second event: dump-type0-1-23456789 type:0 id:1 ctime:12345678, 23456789 [Solution] This patch adds ctime to an argument of an erase callback. It works across reboots because ctime of pstore means the date that the record was originally stored. To do this, efi_pstore saves the ctime to variable name at writing time and passes it to pstore at reading time. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Mike Waychison <mikew@google.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2012-11-26 16:02:12 -08:00
Linus Torvalds	35f95d228e	Merge tag 'for-linus-20121123' of git://git.infradead.org/mtd-2.6 Pull MTD fixes from David Woodhouse: "The most important part of this is that it fixes a regression in Samsung NAND chip detection, introduced by some rework which went into 3.7. The initial fix wasn't quite complete, so it's in two parts. In fact the first part is committed twice (Artem committed his own copy of the same patch) and I've merged Artem's tree into mine which already had that fix. I'd have recommitted that to make it somewhat cleaner, but figured by this point in the release cycle it was better to merge exactly the commits which have been in linux-next. If I'd recommitted, I'd also omit the sparse warning fix. But it's there, and it's harmless — just marking one function as 'static' in onenand code. This also includes a couple more fixes for stable: an AB-BA deadlock in JFFS2, and an invalid range check in slram." * tag 'for-linus-20121123' of git://git.infradead.org/mtd-2.6: mtd: nand: fix Samsung SLC detection regression mtd: nand: fix Samsung SLC NAND identification regression jffs2: Fix lock acquisition order bug in jffs2_write_begin mtd: onenand: Make flexonenand_set_boundary static mtd: slram: invalid checking of absolute end address mtd: ofpart: Fix incorrect NULL check in parse_ofoldpart_partitions() mtd: nand: fix Samsung SLC NAND identification regression	2012-11-23 15:12:17 -10:00
Linus Torvalds	ca6215dfc7	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs and ext3 fixes from Jan Kara: "Fixes of reiserfs deadlocks when quotas are enabled (locking there was completely busted by BKL conversion) and also one small ext3 fix in the trim interface." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext3: Avoid underflow of in ext3_trim_fs() reiserfs: Move quota calls out of write lock reiserfs: Protect reiserfs_quota_write() with write lock reiserfs: Protect reiserfs_quota_on() with write lock reiserfs: Fix lock ordering during remount	2012-11-20 18:48:25 -10:00
Lukas Czerner	ae49eeec78	ext3: Avoid underflow of in ext3_trim_fs() Currently if len argument in ext3_trim_fs() is smaller than one block, the 'end' variable underflow. Avoid that by returning EINVAL if len is smaller than file system block. Also remove useless unlikely(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>	2012-11-19 21:36:12 +01:00
Jan Kara	7af1168693	reiserfs: Move quota calls out of write lock Calls into highlevel quota code cannot happen under the write lock. These calls take dqio_mutex which ranks above write lock. So drop write lock before calling back into quota code. CC: stable@vger.kernel.org # >= 3.0 Signed-off-by: Jan Kara <jack@suse.cz>	2012-11-19 21:34:33 +01:00
Jan Kara	361d94a338	reiserfs: Protect reiserfs_quota_write() with write lock Calls into reiserfs journalling code and reiserfs_get_block() need to be protected with write lock. We remove write lock around calls to high level quota code in the next patch so these paths would suddently become unprotected. CC: stable@vger.kernel.org # >= 3.0 Signed-off-by: Jan Kara <jack@suse.cz>	2012-11-19 21:34:33 +01:00
Jan Kara	b9e06ef2e8	reiserfs: Protect reiserfs_quota_on() with write lock In reiserfs_quota_on() we do quite some work - for example unpacking tail of a quota file. Thus we have to hold write lock until a moment we call back into the quota code. CC: stable@vger.kernel.org # >= 3.0 Signed-off-by: Jan Kara <jack@suse.cz>	2012-11-19 21:34:32 +01:00
Jan Kara	3bb3e1fc47	reiserfs: Fix lock ordering during remount When remounting reiserfs dquot_suspend() or dquot_resume() can be called. These functions take dqonoff_mutex which ranks above write lock so we have to drop it before calling into quota code. CC: stable@vger.kernel.org # >= 3.0 Signed-off-by: Jan Kara <jack@suse.cz>	2012-11-19 21:34:32 +01:00
Al Viro	3587b1b097	fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() If the FAN_Q_OVERFLOW bit set in event->mask, the fanotify event metadata will not contain a valid file descriptor, but copy_event_to_user() didn't check for that, and unconditionally does a fd_install() on the file descriptor. Which in turn will cause a BUG_ON() in __fd_install(). Introduced by commit `352e3b2492` ("fanotify: sanitize failure exits in copy_event_to_user()") Mea culpa - missed that path ;-/ Reported-by: Alex Shi <lkml.alex@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-11-18 09:30:00 -10:00
Linus Torvalds	8d938105e4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc VFS fixes from Al Viro: "Remove a bogus BUG_ON() that can trigger spuriously + alpha bits of do_mount() constification I'd missed during the merge window." This pull request came in a week ago, I missed it for some reason. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill bogus BUG_ON() in do_close_on_exec() missing const in alpha callers of do_mount()	2012-11-18 09:13:48 -10:00
Linus Torvalds	d28d3730fd	Merge tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs Pull xfs bugfixes from Ben Myers: - fix attr tree double split corruption - fix broken error handling in xfs_vm_writepage - drop buffer io reference when a bad bio is built * tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs: xfs: drop buffer io reference when a bad bio is built xfs: fix broken error handling in xfs_vm_writepage xfs: fix attr tree double split corruption	2012-11-18 08:29:34 -10:00
Dave Chinner	d69043c42d	xfs: drop buffer io reference when a bad bio is built Error handling in xfs_buf_ioapply_map() does not handle IO reference counts correctly. We increment the b_io_remaining count before building the bio, but then fail to decrement it in the failure case. This leads to the buffer never running IO completion and releasing the reference that the IO holds, so at unmount we can leak the buffer. This leak is captured by this assert failure during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273 This is not a new bug - the b_io_remaining accounting has had this problem for a long, long time - it's just very hard to get a zero length bio being built by this code... Further, the buffer IO error can be overwritten on a multi-segment buffer by subsequent bio completions for partial sections of the buffer. Hence we should only set the buffer error status if the buffer is not already carrying an error status. This ensures that a partial IO error on a multi-segment buffer will not be lost. This part of the problem is a regression, however. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-17 09:36:57 -06:00
Dave Chinner	3daed8bc3e	xfs: fix broken error handling in xfs_vm_writepage When we shut down the filesystem, it might first be detected in writeback when we are allocating a inode size transaction. This happens after we have moved all the pages into the writeback state and unlocked them. Unfortunately, if we fail to set up the transaction we then abort writeback and try to invalidate the current page. This then triggers are BUG() in block_invalidatepage() because we are trying to invalidate an unlocked page. Fixing this is a bit of a chicken and egg problem - we can't allocate the transaction until we've clustered all the pages into the IO and we know the size of it (i.e. whether the last block of the IO is beyond the current EOF or not). However, we don't want to hold pages locked for long periods of time, especially while we lock other pages to cluster them into the write. To fix this, we need to make a clear delineation in writeback where errors can only be handled by IO completion processing. That is, once we have marked a page for writeback and unlocked it, we have to report errors via IO completion because we've already started the IO. We may not have submitted any IO, but we've changed the page state to indicate that it is under IO so we must now use the IO completion path to report errors. To do this, add an error field to xfs_submit_ioend() to pass it the error that occurred during the building on the ioend chain. When this is non-zero, mark each ioend with the error and call xfs_finish_ioend() directly rather than building bios. This will immediately push the ioends through completion processing with the error that has occurred. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-17 09:35:42 -06:00
Dave Chinner	42e2976f13	xfs: fix attr tree double split corruption In certain circumstances, a double split of an attribute tree is needed to insert or replace an attribute. In rare situations, this can go wrong, leaving the attribute tree corrupted. In this case, the attr being replaced is the last attr in a leaf node, and the replacement is larger so doesn't fit in the same leaf node. When we have the initial condition of a node format attribute btree with two leaves at index 1 and 2. Call them L1 and L2. The leaf L1 is completely full, there is not a single byte of free space in it. L2 is mostly empty. The attribute being replaced - call it X - is the last attribute in L1. The way an attribute replace is executed is that the replacement attribute - call it Y - is first inserted into the tree, but has an INCOMPLETE flag set on it so that list traversals ignore it. Once this transaction is committed, a second transaction it run to atomically mark Y as COMPLETE and X as INCOMPLETE, so that a traversal will now find Y and skip X. Once that transaction is committed, attribute X is then removed. So, the initial condition is: +--------+ +--------+ \| L1 \| \| L2 \| \| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \| \| fsp: 0 \| \| fsp: N \| \|--------\| \|--------\| \| attr A \| \| attr 1 \| \|--------\| \|--------\| \| attr B \| \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr X \| \| attr n \| +--------+ +--------+ So now we go to replace X, and see that L1:fsp = 0 - it is full so we can't insert Y in the same leaf. So we record the the location of attribute X so we can track it for later use, then we split L1 into L1 and L3 and reblance across the two leafs. We end with: +--------+ +--------+ +--------+ \| L1 \| \| L3 \| \| L2 \| \| fwd: 3 \|---->\| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \|<----\| bwd: 3 \| \| fsp: M \| \| fsp: J \| \| fsp: N \| \|--------\| \|--------\| \|--------\| \| attr A \| \| attr X \| \| attr 1 \| \|--------\| +--------+ \|--------\| \| attr B \| \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr W \| \| attr n \| +--------+ +--------+ And we track that the original attribute is now at L3:0. We then try to insert Y into L1 again, and find that there isn't enough room because the new attribute is larger than the old one. Hence we have to split again to make room for Y. We end up with this: +--------+ +--------+ +--------+ +--------+ \| L1 \| \| L4 \| \| L3 \| \| L2 \| \| fwd: 4 \|---->\| fwd: 3 \|---->\| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \|<----\| bwd: 4 \|<----\| bwd: 3 \| \| fsp: M \| \| fsp: J \| \| fsp: J \| \| fsp: N \| \|--------\| \|--------\| \|--------\| \|--------\| \| attr A \| \| attr Y \| \| attr X \| \| attr 1 \| \|--------\| + INCOMP + +--------+ \|--------\| \| attr B \| +--------+ \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr W \| \| attr n \| +--------+ +--------+ And now we have the new (incomplete) attribute @ L4:0, and the original attribute at L3:0. At this point, the first transaction is committed, and we move to the flipping of the flags. This is where we are supposed to end up with this: +--------+ +--------+ +--------+ +--------+ \| L1 \| \| L4 \| \| L3 \| \| L2 \| \| fwd: 4 \|---->\| fwd: 3 \|---->\| fwd: 2 \|---->\| fwd: 0 \| \| bwd: 0 \|<----\| bwd: 1 \|<----\| bwd: 4 \|<----\| bwd: 3 \| \| fsp: M \| \| fsp: J \| \| fsp: J \| \| fsp: N \| \|--------\| \|--------\| \|--------\| \|--------\| \| attr A \| \| attr Y \| \| attr X \| \| attr 1 \| \|--------\| +--------+ + INCOMP + \|--------\| \| attr B \| +--------+ \| attr 2 \| \|--------\| \|--------\| .......... .......... \|--------\| \|--------\| \| attr W \| \| attr n \| +--------+ +--------+ But that doesn't happen properly - the attribute tracking indexes are not pointing to the right locations. What we end up with is both the old attribute to be removed pointing at L4:0 and the new attribute at L4:1. On a debug kernel, this assert fails like so: XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725 because the new attribute location does not exist. On a production kernel, this goes unnoticed and the code proceeds ahead merrily and removes L4 because it thinks that is the block that is no longer needed. This leaves the hash index node pointing to entries L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free space, and so everything is busted. This corruption is caused by the removal of the old attribute triggering a join - it joins everything correctly but then frees the wrong block. xfs_repair will report something like: bad sibling back pointer for block 4 in attribute fork for inode 131 problem with attribute contents in inode 131 would clear attr fork bad nblocks 8 for inode 131, would reset to 3 bad anextents 4 for inode 131, would reset to 0 The problem lies in the assignment of the old/new blocks for tracking purposes when the double leaf split occurs. The first split tries to place the new attribute inside the current leaf (i.e. "inleaf == true") and moves the old attribute (X) to the new block. This sets up the old block/index to L1:X, and newly allocated block to L3:0. It then moves attr X to the new block and tries to insert attr Y at the old index. That fails, so it splits again. With the second split, the rebalance ends up placing the new attr in the second new block - L4:0 - and this is where the code goes wrong. What is does is it sets both the new and old block index to the second new block. Hence it inserts attr Y at the right place (L4:0) but overwrites the current location of the attr to replace that is held in the new block index (currently L3:0). It over writes it with L4:1 - the index we later assert fail on. Hopefully this table will show this in a foramt that is a bit easier to understand: Split old attr index new attr index vanilla patched vanilla patched before 1st L1:26 L1:26 N/A N/A after 1st L3:0 L3:0 L1:26 L1:26 after 2nd L4:0 L3:0 L4:1 L4:0 ^^^^ ^^^^ wrong wrong The fix is surprisingly simple, for all this analysis - just stop the rebalance on the out-of leaf case from overwriting the new attr index - it's already correct for the double split case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-11-17 09:34:13 -06:00
David Rientjes	fa0cbbf145	mm, oom: reintroduce /proc/pid/oom_adj This is mostly a revert of `01dc52ebdf` ("oom: remove deprecated oom_adj") from Davidlohr Bueso. It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier kernels. It simply scales the value linearly when /proc/pid/oom_score_adj is written. The major difference is that its scheduled removal is no longer included in Documentation/feature-removal-schedule.txt. We do warn users with a single printk, though, to suggest the more powerful and supported /proc/pid/oom_score_adj interface. Reported-by: Artem S. Tashkinov <t.artem@lycos.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-11-16 10:15:35 -08:00
Linus Torvalds	ce95a36bb9	Merge tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs Pull UBIFS fixes from Artem Bityutskiy: "Two patches which fix a problem reported by several people in the past, but only fixed now because no one gave enough material for debugging. Anyway, these fix the problem that sometimes after a power cut the file-system is not mountable with the following symptom: grab_empty_leb: could not find an empty LEB The fixes make the file-system mountable again." * tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs: UBIFS: fix mounting problems after power cuts UBIFS: introduce categorized lprops counter	2012-11-15 11:28:43 -08:00
Colin Ian King	70a6f46d7b	pstore: Fix NULL pointer dereference in console writes Passing a NULL id causes a NULL pointer deference in writers such as erst_writer and efi_pstore_write because they expect to update this id. Pass a dummy id instead. This avoids a cascade of oopses caused when the initial pstore_console_write passes a null which in turn causes writes to the console causing further oopses in subsequent pstore_console_write calls. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>	2012-11-14 18:30:21 -08:00
Al Viro	5a8477660d	kill bogus BUG_ON() in do_close_on_exec() It can be legitimately triggered via procfs access. Now, at least 2 of 3 of get_files_struct() callers in procfs are useless, but when and if we get rid of those we can always add WARN_ON() here. BUG_ON() at that spot is simply wrong. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-11-12 01:19:02 -05:00
Linus Torvalds	affd9a8dbc	Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Jeff Layton. * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: cifs: Do not lookup hashed negative dentry in cifs_atomic_open cifs: fix potential buffer overrun in cifs.idmap handling code	2012-11-10 06:59:35 +01:00
Thomas Betker	5ffd3412ae	jffs2: Fix lock acquisition order bug in jffs2_write_begin jffs2_write_begin() first acquires the page lock, then f->sem. This causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first acquires f->sem, then the page lock: jffs2_garbage_collect_live mutex_lock(&f->sem) (A) jffs2_garbage_collect_dnode jffs2_gc_fetch_page read_cache_page_async do_read_cache_page lock_page(page) (B) jffs2_write_begin grab_cache_page_write_begin find_lock_page lock_page(page) (B) mutex_lock(&f->sem) (A) We fix this by restructuring jffs2_write_begin() to take f->sem before the page lock. However, we make sure that f->sem is not held when calling jffs2_reserve_space(), as this is not permitted by the locking rules. The deadlock above was observed multiple times on an SoC with a dual ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred when using scp to copy files from a host system to the ARM target system. The fix was heavily tested on the same target system. Cc: stable@vger.kernel.org Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2012-11-09 17:02:50 +02:00
Linus Torvalds	ce6d841e9c	Merge branch 'akpm' (Fixes from Andrew) Merge misc fixes from Andrew Morton: "Five fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (5 patches) h8300: add missing L1_CACHE_SHIFT mm: bugfix: set current->reclaim_state to NULL while returning from kswapd() fanotify: fix missing break revert "epoll: support for disabling items, and a self-test app" checkpatch: improve network block comment style checking	2012-11-09 06:53:02 +01:00
Linus Torvalds	a601e63717	Merge tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs Pull xfs bugfixes from Ben Myers: - fix for large transactions spanning multiple iclog buffers - zero the allocation_args structure on the stack before using it to determine whether to use a worker for allocation - move allocation stack switch to xfs_bmapi_allocate in order to prevent deadlock on AGF buffers - growfs no longer reads in garbage for new secondary superblocks - silence a build warning - ensure that invalid buffers never get written to disk while on free list - don't vmap inode cluster buffers during free - fix buffer shutdown reference count mismatch - fix reading of wrapped log data * tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs: xfs: fix reading of wrapped log data xfs: fix buffer shudown reference count mismatch xfs: don't vmap inode cluster buffers during free xfs: invalidate allocbt blocks moved to the free list xfs: silence uninitialised f.file warning. xfs: growfs: don't read garbage for new secondary superblocks xfs: move allocation stack switch up to xfs_bmapi_allocate xfs: introduce XFS_BMAPI_STACK_SWITCH xfs: zero allocation_args on the kernel stack xfs: only update the last_sync_lsn when a transaction completes	2012-11-09 06:42:51 +01:00
Eric Paris	848561d368	fanotify: fix missing break Anders Blomdell noted in 2010 that Fanotify lost events and provided a test case. Eric Paris confirmed it was a bug and posted a fix to the list https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/RrJfTfyW2BE but never applied it. Repeated attempts over time to actually get him to apply it have never had a reply from anyone who has raised it So apply it anyway Signed-off-by: Alan Cox <alan@linux.intel.com> Reported-by: Anders Blomdell <anders.blomdell@control.lth.se> Cc: Eric Paris <eparis@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-11-09 06:41:47 +01:00
Andrew Morton	a80a6b85b4	revert "epoll: support for disabling items, and a self-test app" Revert commit `03a7beb55b` ("epoll: support for disabling items, and a self-test app") pending resolution of the issues identified by Michael Kerrisk, copied below. We'll revisit this for 3.8. : I've taken a look at this patch as it currently stands in 3.7-rc1, and : done a bit of testing. (By the way, the test program : tools/testing/selftests/epoll/test_epoll.c does not compile...) : : There are one or two places where the behavior seems a little strange, : so I have a question or two at the end of this mail. But other than : that, I want to check my understanding so that the interface can be : correctly documented. : : Just to go though my understanding, the problem is the following : scenario in a multithreaded application: : : 1. Multiple threads are performing epoll_wait() operations, : and maintaining a user-space cache that contains information : corresponding to each file descriptor being monitored by : epoll_wait(). : : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL) : a file descriptor from the epoll interest list, and : delete the corresponding record from the user-space cache. : : 3. The problem with (2) is that some other thread may have : previously done an epoll_wait() that retrieved information : about the fd in question, and may be in the middle of using : information in the cache that relates to that fd. Thus, : there is a potential race. : : 4. The race can't solved purely in user space, because doing : so would require applying a mutex across the epoll_wait() : call, which would of course blow thread concurrency. : : Right? : : Your solution is the EPOLL_CTL_DISABLE operation. I want to : confirm my understanding about how to use this flag, since : the description that has accompanied the patches so far : has been a bit sparse : : 0. In the scenario you're concerned about, deleting a file : descriptor means (safely) doing the following: : (a) Deleting the file descriptor from the epoll interest list : using EPOLL_CTL_DEL : (b) Deleting the corresponding record in the user-space cache : : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in : conjunction with EPOLLONESHOT. : : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in : conjunction is a logical error. : : 3. The correct way to code multithreaded applications using : EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows: : : a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should : should EPOLLONESHOT. : : b. When a thread wants to delete a file descriptor, it : should do the following: : : [1] Call epoll_ctl(EPOLL_CTL_DISABLE) : [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE) : was zero, then the file descriptor can be safely : deleted by the thread that made this call. : [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY, : then the descriptor is in use. In this case, the calling : thread should set a flag in the user-space cache to : indicate that the thread that is using the descriptor : should perform the deletion operation. : : Is all of the above correct? : : The implementation depends on checking on whether : (events & ~EP_PRIVATE_BITS) == 0 : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be : cleared. : : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE : is only useful in conjunction with EPOLLONESHOT. However, as things : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does : not have EPOLLONESHOT set in 'events' This results in the following : (slightly surprising) behavior: : : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0 : (the indicator that the file descriptor can be safely deleted). : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY. : : This doesn't seem particularly useful, and in fact is probably an : indication that the user made a logic error: they should only be using : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which : EPOLLONESHOT was set in 'events'. If that is correct, then would it : not make sense to return an error to user space for this case? Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paton J. Lewis" <palewis@adobe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-11-09 06:41:46 +01:00

1 2 3 4 5 ...

29199 Commits