Commit a0c516cbfc ("zram: don't grab mutex in zram_slot_free_noity")
introduced free request pending code to avoid scheduling by mutex under
spinlock and it was a mess which made code lenghty and increased
overhead.
Now, we don't need zram->lock any more to free slot so this patch
reverts it and then, tb_lock should protect it.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the zram table is protected by zram->lock but it's rather
coarse-grained lock and it makes hard for scalibility.
Let's use own rwlock instead of depending on zram->lock. This patch
adds new locking so obviously, it would make slow but this patch is just
prepartion for removing coarse-grained rw_semaphore(ie, zram->lock)
which is hurdle about zram scalability.
Final patch in this patchset series will remove the lock from read-path
and change rw_semaphore with mutex in write path. With bonus, we could
drop pending slot free mess in next patch.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a0c516cbfc ("zram: don't grab mutex in zram_slot_free_noity")
introduced pending zram slot free in zram's write path in case of
missing slot free by memory allocation failure in zram_slot_free_notify
but it is not necessary because we have already freed the slot right
before overwriting.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergey reported we don't need to handle pending free request every I/O
so that this patch removes it in read path while we remain it in write
path.
Let's consider below example.
Swap subsystem ask to zram "A" block free by swap_slot_free_notify but
zram had been pended it without real freeing. Swap subsystem allocates
"A" block for new data but request pended for a long time just handled
and zram blindly free new data on the "A" block. :(
That's why we couldn't remove handle pending free request right before
zram-write.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zram has lived in staging for a LONG LONG time and have been
fixed/improved by many contributors so code is clean and stable now. Of
course, there are lots of product using zram in real practice.
The major TV companys have used zram as swap since two years ago and
recently our production team released android smart phone with zram
which is used as swap, too and recently Android Kitkat start to use zram
for small memory smart phone. And there was a report Google released
their ChromeOS with zram, too and cyanogenmod have been used zram long
time ago. And I heard some disto have used zram block device for tmpfs.
In addition, I saw many report from many other peoples. For example,
Lubuntu start to use it.
The benefit of zram is very clear. With my experience, one of the
benefit was to remove jitter of video application with backgroud memory
pressure. It would be effect of efficient memory usage by compression
but more issue is whether swap is there or not in the system. Recent
mobile platforms have used JAVA so there are many anonymous pages. But
embedded system normally are reluctant to use eMMC or SDCard as swap
because there is wear-leveling and latency issues so if we do not use
swap, it means we can't reclaim anoymous pages and at last, we could
encounter OOM kill. :(
Although we have real storage as swap, it was a problem, too. Because
it sometime ends up making system very unresponsible caused by slow swap
storage performance.
Quote from Luigi on Google
"Since Chrome OS was mentioned: the main reason why we don't use swap
to a disk (rotating or SSD) is because it doesn't degrade gracefully
and leads to a bad interactive experience. Generally we prefer to
manage RAM at a higher level, by transparently killing and restarting
processes. But we noticed that zram is fast enough to be competitive
with the latter, and it lets us make more efficient use of the
available RAM. " and he announced.
http://www.spinics.net/lists/linux-mm/msg57717.html
Other uses case is to use zram for block device. Zram is block device
so anyone can format the block device and mount on it so some guys on
the internet start zram as /var/tmp.
http://forums.gentoo.org/viewtopic-t-838198-start-0.html
Let's promote zram and enhance/maintain it instead of removing.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Nitin Gupta <ngupta@vflare.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block IO driver changes from Jens Axboe:
- bcache update from Kent Overstreet.
- two bcache fixes from Nicholas Swenson.
- cciss pci init error fix from Andrew.
- underflow fix in the parallel IDE pg_write code from Dan Carpenter.
I'm sure the 1 (or 0) users of that are now happy.
- two PCI related fixes for sx8 from Jingoo Han.
- floppy init fix for first block read from Jiri Kosina.
- pktcdvd error return miss fix from Julia Lawall.
- removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from
Michael Opdenacker.
- comment typo fix for the loop driver from Olaf Hering.
- potential oops fix for null_blk from Raghavendra K T.
- two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing
an OOM problem and a problem with handling security locked conditions
* 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits)
mg_disk: Spelling s/finised/finished/
null_blk: Null pointer deference problem in alloc_page_buffers
mtip32xx: Correctly handle security locked condition
mtip32xx: Make SGL container per-command to eliminate high order dma allocation
drivers/block/loop.c: fix comment typo in loop_config_discard
drivers/block/cciss.c:cciss_init_one(): use proper errnos
drivers/block/paride/pg.c: underflow bug in pg_write()
drivers/block/sx8.c: remove unnecessary pci_set_drvdata()
drivers/block/sx8.c: use module_pci_driver()
floppy: bail out in open() if drive is not responding to block0 read
bcache: Fix auxiliary search trees for key size > cacheline size
bcache: Don't return -EINTR when insert finished
bcache: Improve bucket_prio() calculation
bcache: Add bch_bkey_equal_header()
bcache: update bch_bkey_try_merge
bcache: Move insert_fixup() to btree_keys_ops
bcache: Convert sorting to btree_keys
bcache: Convert debug code to btree_keys
bcache: Convert btree_iter to struct btree_keys
bcache: Refactor bset_tree sysfs stats
...
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
We need to initialise the work_struct when we initialise the rest of the
struct nvme_dev, otherwise we'll hit a lockdep warning when we remove
the device. Use PREPARE_WORK to change the function pointer instead
of INIT_WORK.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Pull ceph updates from Sage Weil:
"This is a big batch. From Ilya we have:
- rbd support for more than ~250 mapped devices (now uses same scheme
that SCSI does for device major/minor numbering)
- crush updates for new mapping behaviors (will be needed for coming
erasure coding support, among other things)
- preliminary support for tiered storage pools
There is also a big series fixing a pile cephfs bugs with clustered
MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph
fscache improvements from Li Wang, improved behavior when we get
ENOSPC from Josh Durgin, some readv/writev improvements from
Majianpeng, and the usual mix of small cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits)
ceph: cast PAGE_SIZE to size_t in ceph_sync_write()
ceph: fix dout() compile warnings in ceph_filemap_fault()
libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature
libceph: follow redirect replies from osds
libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}
libceph: follow {read,write}_tier fields on osd request submission
libceph: add ceph_pg_pool_by_id()
libceph: CEPH_OSD_FLAG_* enum update
libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()
libceph: introduce and start using oid abstraction
libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN
libceph: move ceph_file_layout helpers to ceph_fs.h
libceph: start using oloc abstraction
libceph: dout() is missing a newline
libceph: add ceph_kv{malloc,free}() and switch to them
libceph: support CEPH_FEATURE_EXPORT_PEER
ceph: add imported caps when handling cap export message
ceph: add open export target session helper
ceph: remove exported caps when handling cap import message
ceph: handle session flush message
...
On larger systems with many drives, it may help debugging to know which
queue is tied to which interrupt, just by looking at /proc/interrupts.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
We need to shut down the device cleanly when the system is being shut down.
This was in an earlier patch but was inadvertently lost during a rewrite.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Disable the admin queue if device fails during initialization so the
queue's irq is freed.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[rewritten to use nvme_free_queues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Some users need more than 64 partitions per device. Rather than simply
increasing the number of partitions, switch to the dynamic partition
allocation scheme.
This means that minor numbers are not stable across boots, but since major
numbers aren't either, I cannot see this being a significant problem.
Tested-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This attempts to delete all IO queues at the same time asynchronously on
shutdown. This is necessary for a present device that is not responding;
a shutdown operation previously would take 2 minutes per queue-pair
to timeout before moving on to the next queue, making a device removal
appear to take a very long time or "hung" as reported by users.
In the previous worst case, a removal may be stuck forever until a kill
signal is given if there are more than 32 queue pairs since it would run
out of admin command IDs after over an hour of timed out sync commands
(admin queue depth is 64).
This patch will wait for the admin command timeout for all commands to
complete, so the worst case now for an unresponsive controller is 60
seconds, though that still seems like a long time.
Since this adds another way to take queues offline, some duplicate code
resulted so I moved these into more convienient functions.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[make functions static, correct line length and whitespace issues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds checks to see if the nvme pci device was removed. The check
reads the status register for the value of -1, which it should never be
unless the device is no longer present.
If a user performs a surprise removal on an nvme device, the driver will
be notified either by the pci driver remove callback if the platform's
slot is capable of this event, or via reading the device BAR status
register, which will indicate controller failure and trigger a reset.
Either way, the device is not present so all outstanding commands would
timeout. This will not send queue deletion commands to a drive that
isn't present and fail after ioremap, significantly speeding up surprise
removal; previously this took over 2 minutes per IO queue pair created,
but this will complete removing the device within a few seconds.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Send nvme abort command to io requests that have timed out on an
initialized device. If the command is not returned after another timeout,
schedule the controller for reset.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[fix endianness issues]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Schedules a controller reset when it indicates it has a failed status. If
the device does not become ready after a reset, the pci device will be
scheduled for removal.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[fixed checkpatch issue]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction. Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool. This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.
This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>