The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.
Based on an earlier patch from Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split up inode_add_to_list/__inode_add_to_list. Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.
The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.
This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.
Based on a patch originally from Eric Dumazet.
[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make node look as if it was on hlist, with hlist_del()
working correctly. Usable without any locking...
Convert a couple of places where we want to do that to
inode->i_hash.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now, rw_verify_area() checsk f_pos is negative or not. And if negative,
returns -EINVAL.
But, some special files as /dev/(k)mem and /proc/<pid>/mem etc.. has
negative offsets. And we can't do any access via read/write to the
file(device).
So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Andrew,
Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.
Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())
Thanks !
[PATCH V4] fs: allow for more than 2^31 files
Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :
<quote>
We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:
atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;
The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().
n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;
In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.
</quote>
Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.
get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.
unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.
Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968
After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__block_write_begin and block_prepare_write are identical except for slightly
different calling conventions. Convert all callers to the __block_write_begin
calling conventions and drop block_prepare_write.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new helper to write out the inode using the writeback code,
that is including the correct dirty bit and list manipulation. A few
of filesystems already opencode this, and a lot of others should be
using it instead of using write_inode_now which also writes out the
data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'davinci-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-davinci: (50 commits)
davinci: fix remaining board support after io_pgoffst removal
davinci: mityomapl138: make file local data static
arm/davinci: remove duplicated include
davinci: Initial support for Omapl138-Hawkboard
davinci: MityDSP-L138/MityARM-1808 read MAC address from I2C Prom
davinci: add tnetv107x touchscreen platform device
input: add driver for tnetv107x touchscreen controller
davinci: add keypad config for tnetv107x evm board
davinci: add tnetv107x keypad platform device
input: add driver for tnetv107x on-chip keypad controller
net: davinci_emac: cleanup unused cpdma code
net: davinci_emac: switch to new cpdma layer
net: davinci_emac: separate out cpdma code
net: davinci_emac: cleanup unused mdio emac code
omap: cleanup unused davinci mdio arch code
davinci: cleanup mdio arch code and switch to phy_id
net: davinci_emac: switch to new mdio
omap: add mdio platform devices
davinci: add mdio platform devices
net: davinci_emac: separate out davinci mdio
...
Fix up trivial conflict in drivers/input/keyboard/Kconfig (two entries
added next to each other - one from the davinci merge, one from the
input merge)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: (365 commits)
ALSA: hda - Disable sticky PCM stream assignment for AD codecs
ALSA: usb - Creative USB X-Fi volume knob support
ALSA: ca0106: Use card specific dac id for mute controls.
ALSA: ca0106: Allow different sound cards to use different SPI channel mappings.
ALSA: ca0106: Create a nice spot for mapping channels to dacs.
ALSA: ca0106: Move enabling of front dac out of hardcoded setup sequence.
ALSA: ca0106: Pull out dac powering routine into separate function.
ALSA: ca0106 - add Sound Blaster 5.1vx info.
ASoC: tlv320dac33: Use usleep_range for delays
ALSA: usb-audio: add Novation Launchpad support
ALSA: hda - Add workarounds for CT-IBG controllers
ALSA: hda - Fix wrong TLV mute bit for STAC/IDT codecs
ASoC: tpa6130a2: Error handling for broken chip
ASoC: max98088: Staticise m98088_eq_band
ASoC: soc-core: Fix codec->name memory leak
ALSA: hda - Apply ideapad quirk to Acer laptops with Cxt5066
ALSA: hda - Add some workarounds for Creative IBG
ALSA: hda - Fix wrong SPDIF NID assignment for CA0110
ALSA: hda - Fix codec rename rules for ALC662-compatible codecs
ALSA: hda - Add alc_init_jacks() call to other codecs
...
* 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/dvrabel/uwb:
uwb: Orphan the UWB and WUSB subsystems
uwb: Remove the WLP subsystem and drivers
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6:
mtd/m25p80: add support to parse the partitions by OF node
of/irq: of_irq.c needs to include linux/irq.h
of/mips: Cleanup some include directives/files.
of/mips: Add device tree support to MIPS
of/flattree: Eliminate need to provide early_init_dt_scan_chosen_arch
of/device: Rework to use common platform_device_alloc() for allocating devices
of/xsysace: Fix OF probing on little-endian systems
of: use __be32 types for big-endian device tree data
of/irq: remove references to NO_IRQ in drivers/of/platform.c
of/promtree: add package-to-path support to pdt
of/promtree: add of_pdt namespace to pdt code
of/promtree: no longer call prom_ functions directly; use an ops structure
of/promtree: make drivers/of/pdt.c no longer sparc-only
sparc: break out some PROM device-tree building code out into drivers/of
of/sparc: convert various prom_* functions to use phandle
sparc: stop exporting openprom.h header
powerpc, of_serial: Endianness issues setting up the serial ports
of: MTD: Fix OF probing on little-endian systems
of: GPIO: Fix OF probing on little-endian systems
Replace the BKL with a mutex to protect the venus_comm structure which
binds the mountpoint with the character device and holds the upcall
queues.
Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that shared inode state is locked using the cii->c_lock, the BKL is
only used to protect the upcall queues used to communicate with the
userspace cache manager. The remaining state is all local and we can
push the lock further down into coda_upcall().
Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We mostly need it to protect cached user permissions. The c_flags field
is advisory, reading the wrong value is harmless and in the worst case
we hit a slow path where we have to make an extra upcall to the
userspace cache manager when revalidating a dentry or inode.
Signed-off-by: Yoshihisa Abe <yoshiabe@cs.cmu.edu>
Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (110 commits)
sh: i2c-sh7760: Replase from ctrl_* to __raw_*
sh: clkfwk: Shuffle around to match the intc split up.
sh: clkfwk: modify for_each_frequency end condition
sh: fix clk_get() error handling
sh: clkfwk: Fix fault in frequency iterator.
sh: clkfwk: Add a helper for rate rounding by divisor ranges.
sh: clkfwk: Abstract rate rounding helper.
sh: clkfwk: support clock remapping.
sh: pci: Convert to upper/lower_32_bits() helpers.
sh: mach-sdk7786: Add support for the FPGA SRAM.
sh: Provide a generic SRAM pool for tiny memories.
sh: pci: Support secondary FPGA-driven PCIe clocks on SDK7786.
sh: pci: Support slot 4 routing on SDK7786.
sh: Fix up PMB locking.
sh: mach-sdk7786: Add support for fpga gpios.
sh: use pr_fmt for clock framework, too.
sh: remove name and id from struct clk
sh: free-without-alloc fix for sh_mobile_lcdcfb
sh: perf: Set up perf_max_events.
sh: perf: Support SH-X3 hardware counters.
...
Fix up trivial conflicts (perf_max_events got removed) in arch/sh/kernel/perf_event.c