Use the events hint now sent by some devices, to avoid unnecessary wakeups
for events that are of no interest for the caller. This code handles both
devices that are sending keyed events, and the ones that are not (and
event the ones that sometimes send events, and sometimes don't).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ep_modify() doesn't need to set event.data from within the ep->lock
spinlock as the comment suggests. The only place event.data is used is
ep_send_events_proc(), and this is protected by ep->mtx instead of
ep->lock. Also update the comment for mutex_lock() at the top of
ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
epoll_ctl(EPOLL_CTL_MOD).
ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xchg in ep_unregister_pollwait() is unnecessary because it is protected by
either epmutex or ep->mtx (the same protection as ep_remove()).
If xchg was necessary, it would be insufficient to protect against
problems: if multiple concurrent calls to ep_unregister_pollwait() were
possible then a second caller that returns without doing anything because
nwait == 0 could return before the waitqueues are removed by the first
caller, which looks like it could lead to problematic races with
ep_poll_callback().
So remove xchg and add comments about the locking.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If epoll_wait returns -EFAULT, the event that was being returned when the
fault was encountered will be forgotten. This is not a big deal since
EFAULT will happen only if a buggy userspace program passes in a bad
address, in which case what happens later usually doesn't matter.
However, it is easy to remember the event for later, and this patch makes
a simple change to do that.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
dereferencing it) to detect callback recursion, but it may be called from
irq context where the use of current is generally discouraged. It would
be better to use get_cpu() and put_cpu() to detect the callback recursion.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even
though there are no actual ready monitored fds. The bug shows up if you
add an epoll fd inside another fd container (poll, select, epoll).
The problem is that callback-based wake ups used by epoll does not carry
(patches will follow, to fix this) any information about the events that
actually happened. So the callback code, since it can't call the file*
->poll() inside the callback, chains the file* into a ready-list.
So, suppose you added an fd with EPOLLOUT only, and some data shows up on
the fd, the file* mapped by the fd will be added into the ready-list (via
wakeup callback). During normal epoll_wait() use, this condition is
sorted out at the time we're actually able to call the file*'s
f_op->poll().
Inside the old epoll's f_op->poll() though, only a quick check
!list_empty(ready-list) was performed, and this could have led to
reporting POLLIN even though no ready fds would show up at a following
epoll_wait(). In order to correctly report the ready status for an epoll
fd, the ready-list must be checked to see if any really available fd+event
would be ready in a following epoll_wait().
Operation (calling f_op->poll() from inside f_op->poll()) that, like wake
ups, must be handled with care because of the fact that epoll fds can be
added to other epoll fds.
Test code:
/*
* epoll_test by Davide Libenzi (Simple code to test epoll internals)
* Copyright (C) 2008 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*
* Davide Libenzi <davidel@xmailserver.org>
*
*/
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <limits.h>
#include <poll.h>
#include <sys/epoll.h>
#include <sys/wait.h>
#define EPWAIT_TIMEO (1 * 1000)
#ifndef POLLRDHUP
#define POLLRDHUP 0x2000
#endif
#define EPOLL_MAX_CHAIN 100L
#define EPOLL_TF_LOOP (1 << 0)
struct epoll_test_cfg {
long size;
long flags;
};
static int xepoll_create(int n) {
int epfd;
if ((epfd = epoll_create(n)) == -1) {
perror("epoll_create");
exit(2);
}
return epfd;
}
static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
if (epoll_ctl(epfd, cmd, fd, evt) < 0) {
perror("epoll_ctl");
exit(3);
}
}
static void xpipe(int *fds) {
if (pipe(fds)) {
perror("pipe");
exit(4);
}
}
static pid_t xfork(void) {
pid_t pid;
if ((pid = fork()) == (pid_t) -1) {
perror("pipe");
exit(5);
}
return pid;
}
static int run_forked_proc(int (*proc)(void *), void *data) {
int status;
pid_t pid;
if ((pid = xfork()) == 0)
exit((*proc)(data));
if (waitpid(pid, &status, 0) != pid) {
perror("waitpid");
return -1;
}
return WIFEXITED(status) ? WEXITSTATUS(status): -2;
}
static int check_events(int fd, int timeo) {
struct pollfd pfd;
fprintf(stdout, "Checking events for fd %d\n", fd);
memset(&pfd, 0, sizeof(pfd));
pfd.fd = fd;
pfd.events = POLLIN | POLLOUT;
if (poll(&pfd, 1, timeo) < 0) {
perror("poll()");
return 0;
}
if (pfd.revents & POLLIN)
fprintf(stdout, "\tPOLLIN\n");
if (pfd.revents & POLLOUT)
fprintf(stdout, "\tPOLLOUT\n");
if (pfd.revents & POLLERR)
fprintf(stdout, "\tPOLLERR\n");
if (pfd.revents & POLLHUP)
fprintf(stdout, "\tPOLLHUP\n");
if (pfd.revents & POLLRDHUP)
fprintf(stdout, "\tPOLLRDHUP\n");
return pfd.revents;
}
static int epoll_test_tty(void *data) {
int epfd, ifd = fileno(stdin), res;
struct epoll_event evt;
if (check_events(ifd, 0) != POLLOUT) {
fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
return 1;
}
epfd = xepoll_create(1);
fprintf(stdout, "Created epoll fd (%d)\n", epfd);
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt);
if (check_events(epfd, 0) & POLLIN) {
res = epoll_wait(epfd, &evt, 1, 0);
if (res == 0) {
fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
epfd);
return 2;
}
}
return 0;
}
static int epoll_wakeup_chain(void *data) {
struct epoll_test_cfg *tcfg = data;
int i, res, epfd, bfd, nfd, pfds[2];
pid_t pid;
struct epoll_event evt;
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
epfd = bfd = xepoll_create(1);
for (i = 0; i < tcfg->size; i++) {
nfd = xepoll_create(1);
xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
bfd = nfd;
}
xpipe(pfds);
if (tcfg->flags & EPOLL_TF_LOOP)
{
xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
/*
* If we're testing for loop, we want that the wakeup
* triggered by the write to the pipe done in the child
* process, triggers a fake event. So we add the pipe
* read size with EPOLLOUT events. This will trigger
* an addition to the ready-list, but no real events
* will be there. The the epoll kernel code will proceed
* in calling f_op->poll() of the epfd, triggering the
* loop we want to test.
*/
evt.events = EPOLLOUT;
}
xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
/*
* The pipe write must come after the poll(2) call inside
* check_events(). This tests the nested wakeup code in
* fs/eventpoll.c:ep_poll_safewake()
* By having the check_events() (hence poll(2)) happens first,
* we have poll wait queue filled up, and the write(2) in the
* child will trigger the wakeup chain.
*/
if ((pid = xfork()) == 0) {
sleep(1);
write(pfds[1], "w", 1);
exit(0);
}
res = check_events(epfd, 2000) & POLLIN;
if (waitpid(pid, NULL, 0) != pid) {
perror("waitpid");
return -1;
}
return res;
}
static int epoll_poll_chain(void *data) {
struct epoll_test_cfg *tcfg = data;
int i, res, epfd, bfd, nfd, pfds[2];
pid_t pid;
struct epoll_event evt;
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
epfd = bfd = xepoll_create(1);
for (i = 0; i < tcfg->size; i++) {
nfd = xepoll_create(1);
xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
bfd = nfd;
}
xpipe(pfds);
if (tcfg->flags & EPOLL_TF_LOOP)
{
xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
/*
* If we're testing for loop, we want that the wakeup
* triggered by the write to the pipe done in the child
* process, triggers a fake event. So we add the pipe
* read size with EPOLLOUT events. This will trigger
* an addition to the ready-list, but no real events
* will be there. The the epoll kernel code will proceed
* in calling f_op->poll() of the epfd, triggering the
* loop we want to test.
*/
evt.events = EPOLLOUT;
}
xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
/*
* The pipe write mush come before the poll(2) call inside
* check_events(). This tests the nested f_op->poll calls code in
* fs/eventpoll.c:ep_eventpoll_poll()
* By having the pipe write(2) happen first, we make the kernel
* epoll code to load the ready lists, and the following poll(2)
* done inside check_events() will test nested poll code in
* ep_eventpoll_poll().
*/
if ((pid = xfork()) == 0) {
write(pfds[1], "w", 1);
exit(0);
}
sleep(1);
res = check_events(epfd, 1000) & POLLIN;
if (waitpid(pid, NULL, 0) != pid) {
perror("waitpid");
return -1;
}
return res;
}
int main(int ac, char **av) {
int error;
struct epoll_test_cfg tcfg;
fprintf(stdout, "\n********** Testing TTY events\n");
error = run_forked_proc(epoll_test_tty, NULL);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing short wakeup chain\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == POLLIN ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = EPOLL_MAX_CHAIN;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing short poll chain\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == POLLIN ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = EPOLL_MAX_CHAIN;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = EPOLL_TF_LOOP;
fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = EPOLL_TF_LOOP;
fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
return 0;
}
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
epoll remains the only user, but a future patch will use it to protect
f_flags as well.
Cc: Davide Libenzi <davidel@xmailserver.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Linus suggested to put limits where the money is, and max_user_watches
already does that w/out the need of max_user_instances. That has the
advantage to mitigate the potential DoS while allowing pretty generous
default behavior.
Allowing top 4% of low memory (per user) to be allocated in epoll watches,
we have:
LOMEM MAX_WATCHES (per user)
512MB ~178000
1GB ~356000
2GB ~712000
A box with 512MB of lomem, will meet some challenge in hitting 180K
watches, socket buffers math teaches us. No more max_user_instances
limits then.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit f337b9c583 ("epoll: drop
unnecessary test") Thomas found that there is an unnecessary (always
true) test in ep_send_events(). The callback never inserts into
->rdllink while the send loop is performed, and also does the
~EP_PRIVATE_BITS test. Given we're holding the mutex during this time,
the conditions tested inside the loop are always true.
HOWEVER.
The test "!ep_is_linked(&epi->rdllink)" wasn't there because we insert
into ->rdllink, but because the send-events loop might terminate before
the whole list is scanned (-EFAULT).
In such cases, when the loop terminates early, and when a (leftover)
file received an event while we're performing the lockless loop, we need
such test to avoid to double insert the epoll items. The list_splice()
done a few steps below, will correctly re-insert the ones that were left
on "txlist".
This should fix the kenrel.org bugzilla entry 11831.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas found that there is an unnecessary (always true) test in
ep_send_events(). The callback never inserts into ->rdllink while the
send loop is performed, and also does the ~EP_PRIVATE_BITS test. Given
we're holding the mutex during this time, the conditions tested inside the
loop are always true. This patch drops the test done inside the
re-insertion loop.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds test that ensure the boundary conditions for the various
constants introduced in the previous patches is met. No code is generated.
[akpm@linux-foundation.org: fix alpha]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the new epoll_create2 syscall. It extends the old epoll_create
syscall by one parameter which is meant to hold a flag value. In this
patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
flag for the returned file descriptor to be set.
A new name EPOLL_CLOEXEC is introduced which in this implementation must
have the same value as O_CLOEXEC.
The following test must be adjusted for architectures other than x86 and
x86-64 and in case the syscall numbers changed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_epoll_create2
# ifdef __x86_64__
# define __NR_epoll_create2 291
# elif defined __i386__
# define __NR_epoll_create2 329
# else
# error "need __NR_epoll_create2"
# endif
#endif
#define EPOLL_CLOEXEC O_CLOEXEC
int
main (void)
{
int fd = syscall (__NR_epoll_create2, 1, 0);
if (fd == -1)
{
puts ("epoll_create2(0) failed");
return 1;
}
int coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if (coe & FD_CLOEXEC)
{
puts ("epoll_create2(0) set close-on-exec flag");
return 1;
}
close (fd);
fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
if (fd == -1)
{
puts ("epoll_create2(EPOLL_CLOEXEC) failed");
return 1;
}
coe = fcntl (fd, F_GETFD);
if (coe == -1)
{
puts ("fcntl failed");
return 1;
}
if ((coe & FD_CLOEXEC) == 0)
{
puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
return 1;
}
close (fd);
puts ("OK");
return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch just extends the anon_inode_getfd interface to take an additional
parameter with a flag value. The flag value is passed on to
get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.
No actual semantic changes here, the changed callers all pass 0 for now.
[akpm@linux-foundation.org: KVM fix]
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a) none of the callers even looks at inode or file returned by anon_inode_getfd()
b) any caller that would try to look at those would be racy, since by the time
it returns we might have raced with close() from another thread and that
file would be pining for fjords.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but
rb_set_parent() accesses node's pointer in its code. This creates a
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized
memory access. The warning is harmless since the following rb-tree node
insert is going to overwrite the node data. In any case I think it's
better to not have that happening at all, and fix it by simplifying the
code to get rid of a few lines that became superfluous after the previous
epoll changes.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
> I remember I talked with Arjan about this time ago. Basically, since 1)
> you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
> are used, you can see a wake_up() from inside another wake_up(), but they
> will never refer to the same lock instance.
> Think about:
>
> dfd = socket(...);
> efd1 = epoll_create();
> efd2 = epoll_create();
> epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
> epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
>
> When a packet arrives to the device underneath "dfd", the net code will
> issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
> callback wakeup entry on that queue, and the wake_up() performed by the
> "dfd" net code will end up in ep_poll_callback(). At this point epoll
> (efd1) notices that it may have some event ready, so it needs to wake up
> the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
> that ends up in another wake_up(), after having checked about the
> recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
> avoid stack blasting. Never hit the same queue, to avoid loops like:
>
> epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
> epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
> epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
>
> The code "if (tncur->wq == wq || ..." prevents re-entering the same
> queue/lock.
Since the epoll code is very careful to not nest same instance locks
allow the recursion.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>