null_blk is partitionable, but it doesn't store any of the info. When
it is loaded, you would normally see:
[1226739.343608] nullb0: unknown partition table
[1226739.343746] nullb1: unknown partition table
which can confuse some people. Add the appropriate gendisk flag
to suppress this info.
Signed-off-by: Jens Axboe <axboe@fb.com>
This was inadvertently dropped from an earlier commit, otherwise
the check against cq_vector == -1 to prevent double free doesn't
make any sense.
Fixes: 2b25d981790b
Signed-off-by: Jens Axboe <axboe@fb.com>
Because of the direct_access() API which returns a PFN. partitions
better start on 4K boundary, else offset ZERO of a partition will
not be aligned and blk_direct_access() will fail the call.
By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate
this to fdisk and friends.
The call to blk_queue_physical_block_size() is harmless and will
not affect the Kernel behavior in any way. It is only for
communication to user-mode.
before this patch running fdisk on a default size brd of 4M
the first sector offered is 34 (BAD), but after this patch it
will be 40, ie 8 sectors aligned. Also when entering some random
partition sizes the next partition-start sector is offered 8 sectors
aligned after this patch. (Please note that with fdisk the user
can still enter bad values, only the offered default values will
be correct)
Note that with bdev-size > 4M fdisk will try to align on a 1M
boundary (above first-sector will be 2048), in any case.
CC: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes up brd's partitions scheme, now enjoying all worlds.
The MAIN fix here is that currently, if one fdisks some partitions,
a BAD bug will make all partitions point to the same start-end sector
ie: 0 - brd_size And an mkfs of any partition would trash the partition
table and the other partition.
Another fix is that "mount -U uuid" will not work if show_part was not
specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag.
We now always load without it and remove the show_part parameter.
[We remove Dmitry's new module-param part_show it is now always
show]
So NOW the logic goes like this:
* max_part - Just says how many minors to reserve between ramX
devices. In any way, there can be as many partition as requested.
If minors between devices ends, then dynamic 259-major ids will
be allocated on the fly.
The default is now max_part=1, which means all partitions devt(s)
will be from the dynamic (259) major-range.
(If persistent partition minors is needed use max_part=X)
For example with /dev/sdX max_part is hard coded 16.
* Creation of new devices on the fly still/always work:
mknod /path/devnod b 1 X
fdisk -l /path/devnod
Will create a new device if [X / max_part] was not already
created before. (Just as before)
partitions on the dynamically created device will work as well
Same logic applies with minors as with the pre-created ones.
TODO: dynamic grow of device size. So each device can have it's
own size.
CC: Dmitry Monakhov <dmonakhov@openvz.org>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.
Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The queues and device need to be locked when messing with them.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This freezes and stops all the queues on device shutdown and restarts
them on resume. This fixes hotplug and reset issues when the controller
is actively being used.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Aborts all requeued commands prior to killing the request_queue. For
commands that time out on a dying request queue, set the "Do Not Retry"
bit on the command status so the command cannot be requeued. Finanally, if
the driver is requested to abort a command it did not start, do nothing.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This protects admin queue access on shutdown. When the controller is
disabled, the queue is frozen to prevent new entry, and unfrozen on
resume, and fixes cq_vector signedness to not suspend a queue twice.
Since unfreezing the queue makes it available for commands, it requires
the queue be initialized, so this moves this part after that.
Special handling is done when the device is unresponsive during
shutdown. This can be optimized to not require subsequent commands to
timeout, but saving that fix for later.
This patch also removes the kill signals in this path that were left-over
artifacts from the blk-mq conversion and no longer necessary.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since there is no gendisk associated with the admin queue, the driver
needs to hold a reference to it until all open references to the
controller are closed.
This also combines queue cleanup with freeing the tag set since these
should not be separate.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Once the nvme callback is set for a request, the driver can start it
and make it available for timeout handling. For timed out commands on a
device that is not initialized, this fixes potential deadlocks that can
occur on startup and shutdown when a device is unresponsive since they
can now be cancelled.
Asynchronous requests do not have any expected timeout, so these are
using the new "REQ_NO_TIMEOUT" request flags.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Looks like we pull it in through other ways on x86, but we fail
on sparc:
In file included from drivers/block/cryptoloop.c:30:0:
drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type
struct blk_mq_tag_set tag_set;
Add the include to loop.h, kill it from loop.c.
Signed-off-by: Jens Axboe <axboe@fb.com>
block core handles REQ_FUA by its flush state machine, so
won't do it in loop explicitly.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
No behaviour change, just move the handling for REQ_DISCARD
and REQ_FLUSH in these two functions.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The conversion is a bit straightforward, and use work queue to
dispatch requests of loop block, and one big change is that requests
is submitted to backend file/device concurrently with work queue,
so throughput may get improved much. Given write requests over same
file are often run exclusively, so don't handle them concurrently for
avoiding extra context switch cost, possible lock contention and work
schedule cost. Also with blk-mq, there is opportunity to get loop I/O
merged before submitting to backend file/device.
In the following test:
- base: v3.19-rc2-2041231
- loop over file in ext4 file system on SSD disk
- bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1
- throughput: IOPS
------------------------------------------------------
| | base | base with loop-mq | delta |
------------------------------------------------------
| randread | 1740 | 25318 | +1355%|
------------------------------------------------------
| read | 42196 | 51771 | +22.6%|
-----------------------------------------------------
| randwrite | 35709 | 34624 | -3% |
-----------------------------------------------------
| write | 39137 | 40326 | +3% |
-----------------------------------------------------
So loop-mq can improve throughput for both read and randread, meantime,
performance of write and randwrite isn't hurted basically.
Another benefit is that loop driver code gets simplified
much after blk-mq conversion, and the patch can be thought as
cleanup too.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Check IS_ERR_OR_NULL(return value) instead of just return value.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reduced to IS_ERR() by me, we never return NULL.
Signed-off-by: Jens Axboe <axboe@fb.com>
Sets the vector to an invalid value after it's freed so we don't free
it twice.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull ceph updates from Sage Weil:
"The big item here is support for inline data for CephFS and for
message signatures from Zheng. There are also several bug fixes,
including interrupted flock request handling, 0-length xattrs, mksnap,
cached readdir results, and a message version compat field. Finally
there are several cleanups from Ilya, Dan, and Markus.
Note that there is another series coming soon that fixes some bugs in
the RBD 'lingering' requests, but it isn't quite ready yet"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
ceph: fix setting empty extended attribute
ceph: fix mksnap crash
ceph: do_sync is never initialized
libceph: fixup includes in pagelist.h
ceph: support inline data feature
ceph: flush inline version
ceph: convert inline data to normal data before data write
ceph: sync read inline data
ceph: fetch inline data when getting Fcr cap refs
ceph: use getattr request to fetch inline data
ceph: add inline data to pagecache
ceph: parse inline data in MClientReply and MClientCaps
libceph: specify position of extent operation
libceph: add CREATE osd operation support
libceph: add SETXATTR/CMPXATTR osd operations support
rbd: don't treat CEPH_OSD_OP_DELETE as extent op
ceph: remove unused stringification macros
libceph: require cephx message signature by default
ceph: introduce global empty snap context
ceph: message versioning fixes
...
CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This
sneaked in with discard patches - it's one of the three osd ops (the
other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
is implemented with.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes, just
removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There are
some ath9k patches coming in through this tree that have been acked by
the wireless maintainers as they relied on the debugfs changes.
Everything has been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM
53kAoLeteByQ3iVwWurwwseRPiWa8+MI
=OVRS
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core update from Greg KH:
"Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes,
just removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There
are some ath9k patches coming in through this tree that have been
acked by the wireless maintainers as they relied on the debugfs
changes.
Everything has been in linux-next for a while"
* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
fs: debugfs: add forward declaration for struct device type
firmware class: Deletion of an unnecessary check before the function call "vunmap"
firmware loader: fix hung task warning dump
devcoredump: provide a one-way disable function
device: Add dev_<level>_once variants
ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
ath: use seq_file api for ath9k debugfs files
debugfs: add helper function to create device related seq_file
drivers/base: cacheinfo: remove noisy error boot message
Revert "core: platform: add warning if driver has no owner"
drivers: base: support cpu cache information interface to userspace via sysfs
drivers: base: add cpu_device_create to support per-cpu devices
topology: replace custom attribute macros with standard DEVICE_ATTR*
cpumask: factor out show_cpumap into separate helper function
driver core: Fix unbalanced device reference in drivers_probe
driver core: fix race with userland in device_add()
sysfs/kernfs: make read requests on pre-alloc files use the buffer.
sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
fs: sysfs: return EGBIG on write if offset is larger than file size
...
Pull block layer driver updates from Jens Axboe:
- NVMe updates:
- The blk-mq conversion from Matias (and others)
- A stack of NVMe bug fixes from the nvme tree, mostly from Keith.
- Various bug fixes from me, fixing issues in both the blk-mq
conversion and generic bugs.
- Abort and CPU online fix from Sam.
- Hot add/remove fix from Indraneel.
- A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)
- With the generic IO stat accounting from 3.19/core, converting md,
bcache, and rsxx to use those. From Gu Zheng.
- Boundary check for queue/irq mode for null_blk from Matias. Fixes
cases where invalid values could be given, causing the device to hang.
- The xen blkfront pull request, with two bug fixes from Vitaly.
* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
NVMe: fix race condition in nvme_submit_sync_cmd()
NVMe: fix retry/error logic in nvme_queue_rq()
NVMe: Fix FS mount issue (hot-remove followed by hot-add)
NVMe: fix error return checking from blk_mq_alloc_request()
NVMe: fix freeing of wrong request in abort path
xen/blkfront: remove redundant flush_op
xen/blkfront: improve protection against issuing unsupported REQ_FUA
NVMe: Fix command setup on IO retry
null_blk: boundary check queue_mode and irqmode
block/rsxx: use generic io stats accounting functions to simplify io stat accounting
md: use generic io stats accounting functions to simplify io stat accounting
drbd: use generic io stats accounting functions to simplify io stat accounting
md/bcache: use generic io stats accounting functions to simplify io stat accounting
NVMe: Update module version major number
NVMe: fail pci initialization if the device doesn't have any BARs
NVMe: add ->exit_hctx() hook
NVMe: make setup work for devices that don't do INTx
NVMe: enable IO stats by default
NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
NVMe: replace blk_put_request() with blk_mq_free_request()
...
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
Merge second patchbomb from Andrew Morton:
- the rest of MM
- misc fs fixes
- add execveat() syscall
- new ratelimit feature for fault-injection
- decompressor updates
- ipc/ updates
- fallocate feature creep
- fsnotify cleanups
- a few other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (99 commits)
cgroups: Documentation: fix trivial typos and wrong paragraph numberings
parisc: percpu: update comments referring to __get_cpu_var
percpu: update local_ops.txt to reflect this_cpu operations
percpu: remove __get_cpu_var and __raw_get_cpu_var macros
fsnotify: remove destroy_list from fsnotify_mark
fsnotify: unify inode and mount marks handling
fallocate: create FAN_MODIFY and IN_MODIFY events
mm/cma: make kmemleak ignore CMA regions
slub: fix cpuset check in get_any_partial
slab: fix cpuset check in fallback_alloc
shmdt: use i_size_read() instead of ->i_size
ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
ipc/msg: increase MSGMNI, remove scaling
ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
lib/decompress.c: consistency of compress formats for kernel image
decompress_bunzip2: off by one in get_next_block()
usr/Kconfig: make initrd compression algorithm selection not expert
fault-inject: add ratelimit option
ratelimit: add initialization macro
...
In current zram, we use DEVICE_ATTR() to define sys device attributes.
SO, we need to set (S_IRUGO | S_IWUSR) permission and other arguments
manually. Linux already provids the macro DEVICE_ATTR_[RW|RO|WO] to
define sys device attribute. It is simple and readable.
This patch uses kernel defined macro DEVICE_ATTR_[RW|RO|WO] to define
zram device attribute.
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In struct zram_table_entry, the element *value* contains obj size and obj
zram flags. Bit 0 to bit (ZRAM_FLAG_SHIFT - 1) represent obj size, and
bit ZRAM_FLAG_SHIFT to the highest bit of unsigned long represent obj
zram_flags. So the first zram flag(ZRAM_ZERO) should be from
ZRAM_FLAG_SHIFT instead of (ZRAM_FLAG_SHIFT + 1).
This patch fixes this cosmetic issue.
Also fix a typo, "page in now accessed" -> "page is now accessed"
Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements rw_page operation for zram block device.
I implemented the feature in zram and tested it. Test bed was the G2, LG
electronic mobile device, whtich has msm8974 processor and 2GB memory.
With a memory allocation test program consuming memory, the system
generates swap.
Operating time of swap_write_page() was measured.
--------------------------------------------------
| | operating time | improvement |
| | (20 runs average) | |
--------------------------------------------------
|with patch | 1061.15 us | +2.4% |
--------------------------------------------------
|without patch| 1087.35 us | |
--------------------------------------------------
Each test(with paged_io,with BIO) result set shows normal distribution and
has equal variance. I mean the two values are valid result to compare. I
can say operation with paged I/O(without BIO) is faster 2.4% with
confidence level 95%.
[minchan@kernel.org: make rw_page opeartion return 0]
[minchan@kernel.org: rely on the bi_end_io for zram_rw_page fails]
[sergey.senozhatsky@gmail.com: code cleanup]
[minchan@kernel.org: add comment]
Signed-off-by: karam.lee <karam.lee@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: <seungho1.park@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes parameter of valid_io_request for common usage. The
purpose of valid_io_request() is to determine if bio request is valid or
not.
This patch use I/O start address and size instead of a BIO parameter for
common usage.
Signed-off-by: karam.lee <karam.lee@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: <seungho1.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently rw_page block device operation has been added. This patchset
implements rw_page operation for zram block device and does some clean-up.
This patch (of 3):
Remove an unnecessary parameter(bio) from zram_bvec_rw() and
zram_bvec_read(). zram_bvec_read() doesn't use a bio parameter, so remove
it. zram_bvec_rw() calls a read/write operation not using bio, so a rw
parameter replaces a bio parameter.
Signed-off-by: karam.lee <karam.lee@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: <seungho1.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we have a race between the schedule timing out and the command
completing, we could have the task issuing the command exit
nvme_submit_sync_cmd() while the irq is running sync_completion().
If that happens, we could be corrupting memory, since the stack
that held 'cmdinfo' is no longer valid.
Fix this by always calling nvme_abort_cmd_info(). Once that call
completes, we know that we have either run sync_completion() if
the completion came in, or that we will never run it since we now
have special_completion() as the command callback handler.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This change enables the sunvdc driver to reconnect and recover if a vds
service domain is disconnected or bounced.
By default, it will wait indefinitely for the service domain to become
available again, but will honor a non-zero vdc-timout md property if one
is set. If a timeout is reached, any in-progress I/O's are completed
with -EIO.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both sunvdc and sunvnet implemented distinct functionality for incrementing
and decrementing dring indexes. Create common functions for use by both
from the sunvnet versions, which were chosen since they will still work
correctly in case a non power of two ring size is used.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Free resources allocated during port/disk probing so that the module may be
successfully reloaded after unloading.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The logic around retrying and erroring IO in nvme_queue_rq() is broken
in a few ways:
- If we fail allocating dma memory for a discard, we return retry. We
have the 'iod' stored in ->special, but we free the 'iod'.
- For a normal request, if we fail dma mapping of setting up prps, we
have the same iod situation. Additionally, we haven't set the callback
for the request yet, so we also potentially leak IOMMU resources.
Get rid of the ->special 'iod' store. The retry is uncommon enough that
it's not worth optimizing for or holding on to resources to attempt to
speed it up. Additionally, it's usually best practice to free any
request related resources when doing retries.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a lot of infrastructure for virtio 1.0 support.
Notable missing pieces: virtio pci, virtio balloon (needs spec extension),
vhost scsi.
Plus, there are some minor fixes in a couple of places.
Cc: David Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUh1CVAAoJECgfDbjSjVRpWZcH/2+EGPyng7Lca820UHA0cU1U
u4D8CAAwOGaVdnUUo8ox1eon3LNB2UgRtgsl3rBDR3YTgFfNPrfuYdnHO0dYIDc1
lS26NuPrVrTX0lA+OBPe2nlKrsrOkn8aw1kxG9Y0gKtNg/+HAGNW5e2eE7R/LrA5
94XbWZ8g9Yf4GPG1iFmih9vQvvN0E68zcUlojfCnllySgaIEYr8nTiGQBWpRgJat
fCqFAp1HMDZzGJQO+m1/Vw0OftTRVybyfai59e6uUTa8x1djvzPb/1MvREqQjegM
ylSuofIVyj7JPu++FbAjd9mikkb53GSc8ql3YmWNZLdr69rnkzP0GdzQvrdheAo=
=RtrR
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"virtio: virtio 1.0 support, misc patches
This adds a lot of infrastructure for virtio 1.0 support. Notable
missing pieces: virtio pci, virtio balloon (needs spec extension),
vhost scsi.
Plus, there are some minor fixes in a couple of places.
Note: some net drivers are affected by these patches. David said he's
fine with merging these patches through my tree.
Rusty's on vacation, he acked using my tree for these, too"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (70 commits)
virtio_ccw: finalize_features error handling
virtio_ccw: future-proof finalize_features
virtio_pci: rename virtio_pci -> virtio_pci_common
virtio_pci: update file descriptions and copyright
virtio_pci: split out legacy device support
virtio_pci: setup config vector indirectly
virtio_pci: setup vqs indirectly
virtio_pci: delete vqs indirectly
virtio_pci: use priv for vq notification
virtio_pci: free up vq->priv
virtio_pci: fix coding style for structs
virtio_pci: add isr field
virtio: drop legacy_only driver flag
virtio_balloon: drop legacy_only driver flag
virtio_ccw: rev 1 devices set VIRTIO_F_VERSION_1
virtio: allow finalize_features to fail
virtio_ccw: legacy: don't negotiate rev 1/features
virtio: add API to detect legacy devices
virtio_console: fix sparse warnings
vhost: remove unnecessary forward declarations in vhost.h
...
After Hot-remove of a device with a mounted partition,
when the device is hot-added again, the new node reappears
as nvme0n1. Mounting this new node fails with the error:
mount: mount /dev/nvme0n1p1 on /mnt failed: File exists.
The old nodes's FS entries still exist and the kernel can't re-create
procfs and sysfs entries for the new node with the same name.
The patch fixes this issue.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Indraneel M <indraneel.m@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull VFS changes from Al Viro:
"First pile out of several (there _definitely_ will be more). Stuff in
this one:
- unification of d_splice_alias()/d_materialize_unique()
- iov_iter rewrite
- killing a bunch of ->f_path.dentry users (and f_dentry macro).
Getting that completed will make life much simpler for
unionmount/overlayfs, since then we'll be able to limit the places
sensitive to file _dentry_ to reasonably few. Which allows to have
file_inode(file) pointing to inode in a covered layer, with dentry
pointing to (negative) dentry in union one.
Still not complete, but much closer now.
- crapectomy in lustre (dead code removal, mostly)
- "let's make seq_printf return nothing" preparations
- assorted cleanups and fixes
There _definitely_ will be more piles"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
copy_from_iter_nocache()
new helper: iov_iter_kvec()
csum_and_copy_..._iter()
iov_iter.c: handle ITER_KVEC directly
iov_iter.c: convert copy_to_iter() to iterate_and_advance
iov_iter.c: convert copy_from_iter() to iterate_and_advance
iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
iov_iter.c: convert iov_iter_zero() to iterate_and_advance
iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
iov_iter.c: iterate_and_advance
iov_iter.c: macros for iterating over iov_iter
kill f_dentry macro
dcache: fix kmemcheck warning in switch_names
new helper: audit_file()
nfsd_vfs_write(): use file_inode()
ncpfs: use file_inode()
kill f_dentry uses
lockd: get rid of ->f_path.dentry->d_sb
...
We return an error pointer or the request, not NULL. Half
the call paths got it right, the others didn't. Fix those up.
Signed-off-by: Jens Axboe <axboe@fb.com>
We allocate 'abort_req', but free 'req' in case of an error
submitting the IO.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
flush_op is unambiguously defined by feature_flush:
REQ_FUA | REQ_FLUSH -> BLKIF_OP_WRITE_BARRIER
REQ_FLUSH -> BLKIF_OP_FLUSH_DISKCACHE
0 -> 0
and thus can be removed. This is just a cleanup.
The patch was suggested by Boris Ostrovsky.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced
in d11e61583 and was factored out into blkif_request_flush_valid() in
0f1ca65ee. However:
1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH
and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request
will still pass the check.
2) blkif_request_flush_valid() is misnamed. It is bool but returns true when
the request is invalid.
3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that
-EOPNOTSUPP is more appropriate here.
Fix all of the above issues.
This patch is based on the original patch by Laszlo Ersek and a comment by
Jeff Moyer.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If a device appears while module is being removed,
driver will get a callback after we've given up
on the major number.
In theory this means this major number can get reused
by something else, resulting in a conflict.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
It's never declared so no need to make it extern.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Based on patch by Cornelia Huck.
Note: for consistency, and to avoid sparse errors,
convert all fields, even those no longer in use
for virtio v1.0.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>