This patch fixes a compile warning in lowcomms-tcp.c indicating that
kmem_cache_t is deprecated.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* master.kernel.org:/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (73 commits)
[DLM] Clean up lowcomms
[GFS2] Change gfs2_fsync() to use write_inode_now()
[GFS2] Fix indent in recovery.c
[GFS2] Don't flush everything on fdatasync
[GFS2] Add a comment about reading the super block
[GFS2] Mount problem with the GFS2 code
[GFS2] Remove gfs2_check_acl()
[DLM] fix format warnings in rcom.c and recoverd.c
[GFS2] lock function parameter
[DLM] don't accept replies to old recovery messages
[DLM] fix size of STATUS_REPLY message
[GFS2] fs/gfs2/log.c:log_bmap() fix printk format warning
[DLM] fix add_requestqueue checking nodes list
[GFS2] Fix recursive locking in gfs2_getattr
[GFS2] Fix recursive locking in gfs2_permission
[GFS2] Reduce number of arguments to meta_io.c:getbuf()
[GFS2] Move gfs2_meta_syncfs() into log.c
[GFS2] Fix journal flush problem
[GFS2] mark_inode_dirty after write to stuffed file
[GFS2] Fix glock ordering on inode creation
...
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes up most of the things pointed out by akpm and Pavel Machek
with comments below indicating why some things have been left:
Andrew Morton wrote:
>
>> +static struct nodeinfo *nodeid2nodeinfo(int nodeid, gfp_t alloc)
>> +{
>> + struct nodeinfo *ni;
>> + int r;
>> + int n;
>> +
>> + down_read(&nodeinfo_lock);
>
> Given that this function can sleep, I wonder if `alloc' is useful.
>
> I see lots of callers passing in a literal "0" for `alloc'. That's in fact
> a secret (GFP_ATOMIC & ~__GFP_HIGH). I doubt if that's what you really
> meant. Particularly as the code could at least have used __GFP_WAIT (aka
> GFP_NOIO) which is much, much more reliable than "0". In fact "0" is the
> least reliable mode possible.
>
> IOW, this is all bollixed up.
When 0 is passed into nodeid2nodeinfo the function does not try to allocate a
new structure at all. it's an indication that the caller only wants the nodeinfo
struct for that nodeid if there actually is one in existance.
I've tidied the function itself so it's more obvious, (and tidier!)
>> +/* Data received from remote end */
>> +static int receive_from_sock(void)
>> +{
>> + int ret = 0;
>> + struct msghdr msg;
>> + struct kvec iov[2];
>> + unsigned len;
>> + int r;
>> + struct sctp_sndrcvinfo *sinfo;
>> + struct cmsghdr *cmsg;
>> + struct nodeinfo *ni;
>> +
>> + /* These two are marginally too big for stack allocation, but this
>> + * function is (currently) only called by dlm_recvd so static should be
>> + * OK.
>> + */
>> + static struct sockaddr_storage msgname;
>> + static char incmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> whoa. This is globally singly-threaded code??
Yes. it is only ever run in the context of dlm_recvd.
>>
>> +static void initiate_association(int nodeid)
>> +{
>> + struct sockaddr_storage rem_addr;
>> + static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> Another static buffer to worry about. Globally singly-threaded code?
Yes. Only ever called by dlm_sendd.
>> +
>> +/* Send a message */
>> +static int send_to_sock(struct nodeinfo *ni)
>> +{
>> + int ret = 0;
>> + struct writequeue_entry *e;
>> + int len, offset;
>> + struct msghdr outmsg;
>> + static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> Singly-threaded?
Yep.
>>
>> +static void dealloc_nodeinfo(void)
>> +{
>> + int i;
>> +
>> + for (i=1; i<=max_nodeid; i++) {
>> + struct nodeinfo *ni = nodeid2nodeinfo(i, 0);
>> + if (ni) {
>> + idr_remove(&nodeinfo_idr, i);
>
> Didn't that need locking?
Not. it's only ever called at DLM shutdown after all the other threads
have been stopped.
>>
>> +static int write_list_empty(void)
>> +{
>> + int status;
>> +
>> + spin_lock_bh(&write_nodes_lock);
>> + status = list_empty(&write_nodes);
>> + spin_unlock_bh(&write_nodes_lock);
>> +
>> + return status;
>> +}
>
> This function's return value is meaningless. As soon as the lock gets
> dropped, the return value can get out of sync with reality.
>
> Looking at the caller, this _might_ happen to be OK, but it's a nasty and
> dangerous thing. Really the locking should be moved into the caller.
It's just an optimisation to allow the caller to schedule if there is no work
to do. if something arrives immediately afterwards then it will get picked up
when the process re-awakes (and it will be woken by that arrival).
The 'accepting' atomic has gone completely. as Andrew pointed out it didn't
really achieve much anyway. I suspect it was a plaster over some other
startup or shutdown bug to be honest.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Pavel Machek <pavel@ucw.cz>
This fixes the following gcc warnings generated on
the architectures where uint64_t != unsigned long long (e.g. ppc64).
fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'uint64_t'
fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'uint64_t'
fs/dlm/recoverd.c:48: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
fs/dlm/recoverd.c:202: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
fs/dlm/recoverd.c:210: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
Signed-off-by: Ryusuke Konishi <ryusuke@osrg.net>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We often abort a recovery after sending a status request to a remote node.
We want to ignore any potential status reply we get from the remote node.
If we get one of these unwanted replies, we've often moved on to the next
recovery message and incremented the message sequence counter, so the
reply will be ignored due to the seq number. In some cases, we've not
moved on to the next message so the seq number of the reply we want to
ignore is still correct, causing the reply to be accepted. The next
recovery message will then mistake this old reply as a new one.
To fix this, we add the flag RCOM_WAIT to indicate when we can accept a
new reply. We clear this flag if we abort recovery while waiting for a
reply. Before the flag is set again (to allow new replies) we know that
any old replies will be rejected due to their sequence number. We also
initialize the recovery-message sequence number to a random value when a
lockspace is first created. This makes it clear when messages are being
rejected from an old instance of a lockspace that has since been
recreated.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the not_ready routine sends a "fake" status reply with blank status
flags, it needs to use the correct size for a normal STATUS_REPLY by
including the size of the would-be config parameters. We also fill in the
non-existant config parameters with an invalid lvblen value so it's easier
to notice if these invalid paratmers are ever being used.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Requests that arrive after recovery has started are saved in the
requestqueue and processed after recovery is done. Some of these requests
are purged during recovery if they are from nodes that have been removed.
We move the purging of the requests (dlm_purge_requestqueue) to later in
the recovery sequence which allows the routine saving requests
(dlm_add_requestqueue) to avoid filtering out requests by nodeid since the
same will be done by the purge. The current code has add_requestqueue
filtering by nodeid but doesn't hold any locks when accessing the list of
current nodes. This also means that we need to call the purge routine
when the lockspace is being shut down since the add routine will not be
rejecting requests itself any more.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The attached patch fixes the DLM config so that it selects the chosen network
transport. It should fix the bug where DLM can be left selected when NET gets
unselected. This incorporates all the comments received about this patch.
Cc: Adrian Bunk <bunk@stusta.de>
Cc: Andrew Morton <akpm@osdl.org>
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
RH BZ 211622
The ALTMODE flag can be set in the lock master's copy of the lock but
never cleared, so ALTMODE will also be returned in a subsequent conversion
of the lock when it shouldn't be. This results in lock_dlm incorrectly
switching to the alternate lock mode when returning the result to gfs
which then asserts when it sees the wrong lock state. The fix is to
propagate the cleared sbflags value to the master node when the lock is
requested. QA's d_rwrandirectlarge test triggers this bug very quickly.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
The previous patch "[DLM] fix aborted recovery during
node removal" was incomplete as discovered with further testing. It set
the bit for the RS_LOCKS barrier but did not then wait for the barrier.
This is often ok, but sometimes it will cause yet another recovery hang.
If it's a new node that also has the lowest nodeid that skips the barrier
wait, then it misses the important step of collecting and reporting the
barrier status from the other nodes (which is the job of the low nodeid in
the barrier wait routine).
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
When many nodes are joining a lockspace simultaneously, the dlm gets a
quick sequence of stop/start events, a pair for adding each node.
dlm_controld in user space sends dlm_recoverd in the kernel each stop and
start event. dlm_controld will sometimes send the stop before
dlm_recoverd has had a chance to take up the previously queued start. The
stop aborts the processing of the previous start by setting the
RECOVERY_STOP flag. dlm_recoverd is erroneously clearing this flag and
ignoring the stop/abort if it happens to take up the start after the stop
meant to abort it. The fix is to check the sequence number that's
incremented for each stop/start before clearing the flag.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
With the new cluster infrastructure, dlm recovery for a node removal can
be aborted and restarted for a node addition. When this happens, the
restarted recovery isn't aware that it's doing recovery for the earlier
removal as well as the addition. So, it then skips the recovery steps
only required when nodes are removed. This can result in locks not being
purged for failed/removed nodes. The fix is to check for removed nodes
for which recovery has not been completed at the start of a new recovery
sequence.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
There's a race between dlm_recoverd (1) enabling locking and (2) clearing
out the requestqueue, and dlm_recvd (1) checking if locking is enabled and
(2) adding a message to the requestqueue. An order of recoverd(1),
recvd(1), recvd(2), recoverd(2) will result in a message being left on the
requestqueue. The fix is to have dlm_recvd check if dlm_recoverd has
enabled locking after taking the mutex for the requestqueue and if it has
processing the message instead of queueing it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 213682
If two nodes leave the lockspace (while unmounting the fs in the case of
gfs) after one has sent a STATUS message to the other, STATUS/STATUS_REPLY
messages will then ping-pong between the nodes when neither of them can
find the lockspace in question any longer. We kill this by not sending
another STATUS message when we get a STATUS_REPLY for an unknown
lockspace.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 213684
If a node sends an lkb to the new master (RCOM_LOCK message) during
recovery and recovery is then aborted on both nodes before it gets a
reply, the res_recover_locks_count needs to be reset to 0 so that when the
subsequent recovery comes along and sends the lkb to the new master again
the assertion doesn't trigger that checks that counter is zero.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patch adds a TCP based communications layer
to the DLM which is compile time selectable. The existing SCTP
layer gives the advantage of allowing multihoming, whereas
the TCP layer has been heavily tested in previous versions of
the DLM and is known to be robust and therefore can be used as
a baseline for performance testing.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Now that the lockspace struct is freed when the last sysfs object is released
this patch prevents use of that lockspace by sysfs. We attempt to re-get the
lockspace from the lockspace list and fail the request if it has been removed.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes the recounting on the lockspace kobject. Previously the lockspace was freed while userspace could have had a
reference to one of its sysfs files, causing an oops in kref_put.
Now the lockspace kfree is moved into the kobject release() function
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I didn't spot that the msg_iovlen was set to 2 if there
were two elements in the iovec but left at zero if not :(
I think this might be why bob was still seeing trouble.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The DLM always passes the iovec length as 1, this is wrong when the circular
buffer wraps round.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Don't show an empty "Distributed Lock Manager" menu if IP_SCTP=n.
Reported by Dmytro Bagrii in kernel Bugzilla #7268.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86. (It would be more on an x86_64 system). This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).
This patch:
The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer. Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer. This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>