Majianpeng reported a lockdep splat for f2fs. It turns out mutex_lock_all()
acquires an array of locks (in global/local lock style).
Any such operation is always serialized using cp_mutex, therefore there is no
fs_lock[] lock-order issue; tell lockdep about this using the
mutex_lock_nest_lock() primitive.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
I found a bug when testing power-off-recovery as follows.
[Bug Scenario]
1. create a file
2. fsync the file
3. reboot w/o any sync
4. try to recover the file
- found its fsync mark
- found its dentry mark
: try to recover its dentry
- get its file name
- get its parent inode number
: here we got zero value
The reason why we get the wrong parent inode number is that we didn't
synchronize the inode page with its newly created inode information perfectly.
Especially, previous f2fs stores fi->i_pino and writes it to the cached
node page in a wrong order, which incurs the zero-valued i_pino during the
recovery.
So, this patch modifies the creation flow to fix the synchronization order of
inode page with its inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If get_dnode_of_data gets a locked node page, let's skip redundant
get_node_page calls.
This is for the futher enhancement.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
During the dentry recovery routine, recover_inode() triggers __f2fs_add_link
with its directory inode.
In the following scenario, a bug is captured.
1. dir = f2fs_iget(pino)
2. __f2fs_add_link(dir, name)
3. iput(dir)
-> f2fs_evict_inode() faces with BUG_ON(atomic_read(fi->dirty_dents))
Kernel BUG at ffffffffa01c0676 [verbose debug info unavailable]
[<ffffffffa01c0676>] f2fs_evict_inode+0x276/0x300 [f2fs]
Call Trace:
[<ffffffff8118ea00>] evict+0xb0/0x1b0
[<ffffffff8118f1c5>] iput+0x105/0x190
[<ffffffffa01d2dac>] recover_fsync_data+0x3bc/0x1070 [f2fs]
[<ffffffff81692e8a>] ? io_schedule+0xaa/0xd0
[<ffffffff81690acb>] ? __wait_on_bit_lock+0x7b/0xc0
[<ffffffff8111a0e7>] ? __lock_page+0x67/0x70
[<ffffffff81165e21>] ? kmem_cache_alloc+0x31/0x140
[<ffffffff8118a502>] ? __d_instantiate+0x92/0xf0
[<ffffffff812a949b>] ? security_d_instantiate+0x1b/0x30
[<ffffffff8118a5b4>] ? d_instantiate+0x54/0x70
This means that we should flush all the dentry pages between iget and iput().
But, during the recovery routine, it is unallowed due to consistency, so we
have to wait the whole recovery process.
And then, write_checkpoint flushes all the dirty dentry blocks, and nicely we
can put the stale dir inodes from the dirty_dir_inode_list.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The reason of using sbi->por_doing is to alleviate data writes during the
recovery.
The find_fsync_dnodes() produces some dirty dentry pages, so we should
cover it too with sbi->por_doing.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In get_lock_data_page, if there is a data race between get_dnode_of_data for
node and grab_cache_page for data, f2fs is able to face with the following
BUG_ON(dn.data_blkaddr == NEW_ADDR).
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/data.c:251!
[<ffffffffa044966c>] get_lock_data_page+0x1ec/0x210 [f2fs]
Call Trace:
[<ffffffffa043b089>] f2fs_readdir+0x89/0x210 [f2fs]
[<ffffffff811a0920>] ? fillonedir+0x100/0x100
[<ffffffff811a0920>] ? fillonedir+0x100/0x100
[<ffffffff811a07f8>] vfs_readdir+0xb8/0xe0
[<ffffffff811a0b4f>] sys_getdents+0x8f/0x110
[<ffffffff816d7999>] system_call_fastpath+0x16/0x1b
This bug is able to be occurred when the block address of the data block is
changed after f2fs_put_dnode().
In order to avoid that, this patch fixes the lock order of node and data
blocks in which the node block lock is covered by the data block lock.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Currently f2fs recovers the dentry of fsynced files.
When power-off-recovery is conducted, this newly recovered inode should increase
node block count as well as inode block count.
This patch resolves this inconsistency that results in:
1. create a file
2. write data
3. fsync
4. reboot without sync
5. mount and recover the file
6. node block count is 1 and inode block count is 2
: fall into the inconsistent state
7. unlink the file
: trigger the following BUG_ON
------------[ cut here ]------------
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/f2fs.h:716!
Call Trace:
[<ffffffffa0344100>] ? get_node_page+0x50/0x1a0 [f2fs]
[<ffffffffa0344bfc>] remove_inode_page+0x8c/0x100 [f2fs]
[<ffffffffa03380f0>] ? f2fs_evict_inode+0x180/0x2d0 [f2fs]
[<ffffffffa033812e>] f2fs_evict_inode+0x1be/0x2d0 [f2fs]
[<ffffffff811c7a67>] evict+0xa7/0x1a0
[<ffffffff811c82b5>] iput+0x105/0x190
[<ffffffff811c2b30>] d_kill+0xe0/0x120
[<ffffffff811c2c57>] dput+0xe7/0x1e0
[<ffffffff811acc3d>] __fput+0x19d/0x2d0
[<ffffffff811acd7e>] ____fput+0xe/0x10
[<ffffffff81070645>] task_work_run+0xb5/0xe0
[<ffffffff81002941>] do_notify_resume+0x71/0xb0
[<ffffffff8175f14a>] int_signal+0x12/0x17
Reported-and-Tested-by: Chris Fries <C.Fries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
do_smart_update_queue() is called when an operation (semop,
semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check
which of the sleeping tasks can proceed.
do_smart_update_queue() missed a few wakeups:
- if a sleeping complex op was completed, then all per-semaphore queues
must be scanned - not only those that were modified by *sops
- if a sleeping simple op proceeded, then the global queue must be
scanned again
And:
- the test for "|sops == NULL) before scanning the global queue is not
required: If the global queue is empty, then it doesn't need to be
scanned - regardless of the reason for calling do_smart_update_queue()
The patch is not optimized, i.e. even completing a wait-for-zero
operation causes a rescan. This is done to keep the patch as simple as
possible.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull NFS client bugfixes from Trond Myklebust:
- Stable fix to prevent an rpc_task wakeup race
- Fix a NFSv4.1 session drain deadlock
- Fix a NFSv4/v4.1 mount regression when not running rpc.gssd
- Ensure auth_gss pipe detection works in namespaces
- Fix SETCLIENTID fallback if rpcsec_gss is not available
* tag 'nfs-for-3.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix SETCLIENTID fallback if GSS is not available
SUNRPC: Prevent an rpc_task wakeup race
NFSv4.1 Fix a pNFS session draining deadlock
SUNRPC: Convert auth_gss pipe detection to work in namespaces
SUNRPC: Faster detection if gssd is actually running
SUNRPC: Fix a bug in gss_create_upcall
Pull parisc fixes from Helge Deller:
"This time we made the kernel- and interruption stack allocation
reentrant which fixed some strange kernel crashes (specifically
protection ID traps).
Furthemore this patchset fixes the interrupt stack in UP and SMP
configurations by using native locking instructions. And finally
usage of floating point calculations on parisc were disabled in the
MPILIB."
* 'parisc-for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: fix irq stack on UP and SMP
parisc/superio: Use module_pci_driver to register driver
parisc: make interrupt and interruption stack allocation reentrant
parisc: show number of FPE and unaligned access handler calls in /proc/interrupts
parisc: add additional parisc git tree to MAINTAINERS file
parisc: use PAGE_SHIFT instead of hardcoded value 12 in pacache.S
parisc: add rp5470 entry to machine database
MPILIB: disable usage of floating point registers on parisc
Pull xfs fixes from Ben Myers:
"Here are fixes for corruption on 512 byte filesystems, a rounding
error, a use-after-free, some flags to fix lockdep reports, and
several fixes related to CRCs. We have a somewhat larger post -rc1
queue than usual due to fixes related to the CRC feature we merged for
3.10:
- Fix for corruption with FSX on 512 byte blocksize filesystems
- Fix rounding error in xfs_free_file_space
- Fix use-after-free with extent free intents
- Add several missing KM_NOFS flags to fix lockdep reports
- Several fixes for CRC related code"
* tag 'for-linus-v3.10-rc3' of git://oss.sgi.com/xfs/xfs:
xfs: remote attribute lookups require the value length
xfs: xfs_attr_shortform_allfit() does not handle attr3 format.
xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC
xfs: fix missing KM_NOFS tags to keep lockdep happy
xfs: Don't reference the EFI after it is freed
xfs: fix rounding in xfs_free_file_space
xfs: fix sub-page blocksize data integrity writes
Pull power management and ACPI fixes from Rafael Wysocki:
- Additional CPU ID for the intel_pstate driver from Dirk Brandewie.
- More cpufreq fixes related to ARM big.LITTLE support and locking from
Viresh Kumar.
- VIA C7 cpufreq build fix from Rafał Bilski.
- ACPI power management fix making it possible to use device power
states regardless of the CONFIG_PM setting from Rafael J Wysocki.
- New ACPI video blacklist item from Bastian Triller.
* tag 'pm+acpi-3.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / video: Add "Asus UL30A" to ACPI video detect blacklist
cpufreq: arm_big_little_dt: Instantiate as platform_driver
cpufreq: arm_big_little_dt: Register driver only if DT has valid data
cpufreq / e_powersaver: Fix linker error when ACPI processor is a module
cpufreq / intel_pstate: Add additional supported CPU ID
cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT
ACPI / PM: Allow device power states to be used for CONFIG_PM unset
Pull slave-dma fixes from Vinod Koul:
"We have two patches from Andy & Rafael fixing the Lynxpoint dma"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
ACPI / LPSS: register clock device for Lynxpoint DMA properly
dma: acpi-dma: parse CSRT to extract additional resources
kcore_vmalloc is in fs/proc/kcore.c and kcore_mem is unused across
the tree. Noticed while grepping the tree for some other kcore stuff.
(score looks pretty unmaintained to me.)
Signed-off-by: Kyle McMartin <kyle@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ARC fixes from Vineet Gupta:
- Fallouts/wreckage of Cache Flush optimizations / aliasing dcache
support
- Fix for an interesting bug where piped input to grep was getting
mysteriously clobbered
* tag 'arc-v3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: lazy dcache flush broke gdb in non-aliasing configs
ARC: Use enough bits for determining page's cache color
ARC: Brown paper bag bug in macro for checking cache color
ARC: copy_(to|from)_user() to honor usermode-access permissions
ARC: [mm] Prevent stray dcache lines after__sync_icache_dcach()
ARC: [TB10x] Remove redundant abilis,simple-pinctrl mechanism
Pull ARM fixes from Russell King:
"Just three this time, all really quite small"
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7729/1: vfp: ensure VFP_arch is non-zero when VFP is not supported
ARM: 7727/1: remove the .vm_mm value from gate_vma
ARM: 7723/1: crypto: sha1-armv4-large.S: fix SP handling
gdbserver inserting a breakpoint ends up calling copy_user_page() for a
code page. The generic version of which (non-aliasing config) didn't set
the PG_arch_1 bit hence update_mmu_cache() didn't sync dcache/icache for
corresponding dynamic loader code page - causing garbade to be executed.
So now aliasing versions of copy_user_highpage()/clear_page() are made
default. There is no significant overhead since all of special alias
handling code is compiled out for non-aliasing build
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Merge fixes from Andrew Morton:
"A bunch of fixes and one simple fbdev driver which missed the merge
window because people will still talking about it (to no great
effect)."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (30 commits)
aio: fix kioctx not being freed after cancellation at exit time
mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas
drivers/rtc/rtc-max8998.c: check for pdata presence before dereferencing
ocfs2: goto out_unlock if ocfs2_get_clusters_nocache() failed in ocfs2_fiemap()
random: fix accounting race condition with lockless irq entropy_count update
drivers/char/random.c: fix priming of last_data
mm/memory_hotplug.c: fix printk format warnings
nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary
drivers/block/brd.c: fix brd_lookup_page() race
fbdev: FB_GOLDFISH should depend on HAS_DMA
drivers/rtc/rtc-pl031.c: pass correct pointer to free_irq()
auditfilter.c: fix kernel-doc warnings
aio: fix io_getevents documentation
revert "selftest: add simple test for soft-dirty bit"
drivers/leds/leds-ot200.c: fix error caused by shifted mask
mm/THP: use pmd_populate() to update the pmd with pgtable_t pointer
linux/kernel.h: fix kernel-doc warning
mm compaction: fix of improper cache flush in migration code
rapidio/tsi721: fix bug in MSI interrupt handling
hfs: avoid crash in hfs_bnode_create
...