ceph_release_page_vector() kfrees the vector; we shouldn't do it here too.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
Previously we were validating the passed-in stripe unit, object size,
and stripe count against each other (and not testing most other stuff).
Instead, make sure that the composed previous layout and new values are valid,
and only send the new values to the MDS. This lets users change the
pool without setting the whole layout, for instance.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
This reverts commit c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6.
We need to block and truncate all pages in order to reliably invalidate
them. Otherwise, we could:
- have some uptodate pages in the cache
- queue an invalidate
- write(2) locks some pages
- invalidate_work skips them
- write(2) only overwrites part of the page
- page now dirty and uptodate
-> partial leakage of invalidated data
It's not entirely clear why we started skipping locked pages in the first
place. I just ran this through fsx and didn't see any problems.
Signed-off-by: Sage Weil <sage@newdream.net>
The pool allocation failures are masked by the pool; there is no need to
spam the console about them. (That's the whole point of having the pool
in the first place.)
Mark msg allocations whose failure is safely handled as such.
Signed-off-by: Sage Weil <sage@newdream.net>
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.
Signed-off-by: Sage Weil <sage@newdream.net>
The 'rsize' mount option limits the maximum size of an individual
read(ahead) operation that is sent off to an OSD. This is distinct from
'rasize', which controls the size of the readahead window.
Signed-off-by: Sage Weil <sage@newdream.net>
When we get a ->readpages() aop, submit async reads for all page ranges
in the provided page list. Lock the pages immediately, so that VFS/MM
will block until the reads complete.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (26 commits)
Check validity of cl_rpcclient in nfs_server_list_show
NFS: Get rid of the nfs_rdata_mempool
NFS: Don't rely on PageError in nfs_readpage_release_partial
NFS: Get rid of unnecessary calls to ClearPageError() in read code
NFS: Get rid of nfs_restart_rpc()
NFS: Get rid of the unused nfs_write_data->flags field
NFS: Get rid of the unused nfs_read_data->flags field
NFSv4: Translate NFS4ERR_BADNAME into ENOENT when applied to a lookup
NFS: Remove the unused "lookupfh()" version of nfs4_proc_lookup()
NFS: Use the inode->i_version to cache NFSv4 change attribute information
SUNRPC: Remove unnecessary export of rpc_sockaddr2uaddr
SUNRPC: Fix rpc_sockaddr2uaddr
nfs/super.c: local functions should be static
pnfsblock: fix writeback deadlock
pnfsblock: fix NULL pointer dereference
pnfs: recoalesce when ld read pagelist fails
pnfs: recoalesce when ld write pagelist fails
pnfs: make _set_lo_fail generic
pnfsblock: add missing rpc_put_mount and path_put
SUNRPC/NFS: make rpc pipe upcall generic
...
* 'for-3.2' of git://linux-nfs.org/~bfields/linux: (103 commits)
nfs41: implement DESTROY_CLIENTID operation
nfsd4: typo logical vs bitwise negate for want_mask
nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL | NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED
nfsd4: seq->status_flags may be used unitialized
nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid
nfsd4: implement new 4.1 open reclaim types
nfsd4: remove unneeded CLAIM_DELEGATE_CUR workaround
nfsd4: warn on open failure after create
nfsd4: preallocate open stateid in process_open1()
nfsd4: do idr preallocation with stateid allocation
nfsd4: preallocate nfs4_file in process_open1()
nfsd4: clean up open owners on OPEN failure
nfsd4: simplify process_open1 logic
nfsd4: make is_open_owner boolean
nfsd4: centralize renew_client() calls
nfsd4: typo logical vs bitwise negate
nfs: fix bug about IPv6 address scope checking
nfsd4: more robust ignoring of WANT bits in OPEN
nfsd4: move name-length checks to xdr
nfsd4: move access/deny validity checks to xdr code
...
In ext4_file_open, the filesystem records the mountpoint of the first
file that is opened after mounting the filesystem. It does this by
allocating a 64-byte stack buffer, calling d_path() to grab the mount
point through which this file was accessed, and then memcpy()ing 64
bytes into the superblock's s_last_mounted field, starting from the
return value of d_path(), which is stored as "cp". However, if cp >
buf (which it frequently is since path components are prepended
starting at the end of buf) then we can end up copying stack data into
the superblock.
Writing stack variables into the superblock doesn't sound like a great
idea, so use strlcpy instead. Andi Kleen suggested using strlcpy
instead of strncpy.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In commit 8a9ea3237e7e ("Merge git://.../davem/net-next") where my sysfs
changes from the net tree merged with the sysfs rbtree changes from
Mickulas Patocka the conflict resolution failed to preserve the
simplified property that was the point of my changes.
That is sysfs_find_dirent can now say something is a match if and only
s_name and s_ns match what we are looking for, and sysfs_readdir can
simply return all of the directory entries where s_ns matches the
directory that we should be returning.
Now that we are back to exact matches we can tweak sysfs_find_dirent and
the name rb_tree to order sysfs_dirents by s_ns s_name and remove the
second loop in sysfs_find_dirent. However that change seems a bit much
for a conflict resolution so it can come later.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
Currently it happens only if new extent was allocated.
TESTCASE:
fallocate test_file -n -l4096
fallocate test_file -l4096
Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
fsck will complain about that.
Also remove ping pong in ext4_fallocate() in case of new extents,
where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
ext4_falloc_update_inode() restore it again.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits)
dp83640: free packet queues on remove
dp83640: use proper function to free transmit time stamping packets
ipv6: Do not use routes from locally generated RAs
|PATCH net-next] tg3: add tx_dropped counter
be2net: don't create multiple RX/TX rings in multi channel mode
be2net: don't create multiple TXQs in BE2
be2net: refactor VF setup/teardown code into be_vf_setup/clear()
be2net: add vlan/rx-mode/flow-control config to be_setup()
net_sched: cls_flow: use skb_header_pointer()
ipv4: avoid useless call of the function check_peer_pmtu
TCP: remove TCP_DEBUG
net: Fix driver name for mdio-gpio.c
ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT
rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces
ipv4: fix ipsec forward performance regression
jme: fix irq storm after suspend/resume
route: fix ICMP redirect validation
net: hold sock reference while processing tx timestamps
tcp: md5: add more const attributes
Add ethtool -g support to virtio_net
...
Fix up conflicts in:
- drivers/net/Kconfig:
The split-up generated a trivial conflict with removal of a
stale reference to Documentation/networking/net-modules.txt.
Remove it from the new location instead.
- fs/sysfs/dir.c:
Fairly nasty conflicts with the sysfs rb-tree usage, conflicting
with Eric Biederman's changes for tagged directories.
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
mm: memory hotplug: Check if pages are correctly reserved on a per-section basis
Revert "memory hotplug: Correct page reservation checking"
Update email address for stable patch submission
dynamic_debug: fix undefined reference to `__netdev_printk'
dynamic_debug: use a single printk() to emit messages
dynamic_debug: remove num_enabled accounting
dynamic_debug: consolidate repetitive struct _ddebug descriptor definitions
uio: Support physical addresses >32 bits on 32-bit systems
sysfs: add unsigned long cast to prevent compile warning
drivers: base: print rejected matches with DEBUG_DRIVER
memory hotplug: Correct page reservation checking
memory hotplug: Refuse to add unaligned memory regions
remove the messy code file Documentation/zh_CN/SubmitChecklist
ARM: mxc: convert device creation to use platform_device_register_full
new helper to create platform devices with dma mask
docs/driver-model: Update device class docs
docs/driver-model: Document device.groups
kobj_uevent: Ignore if some listeners cannot handle message
dynamic_debug: make netif_dbg() call __netdev_printk()
dynamic_debug: make netdev_dbg() call __netdev_printk()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...
- Both callers(truncate and punch_hole) already aligned left end point
so we no longer need split logic here.
- Remove dead duplicated code.
- Call ext4_ext_dirty only after we have updated eh_entries, otherwise
we'll loose entries update. Regression caused by d583fb87a3ff0
266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872)
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
TOMOYO: Fix incomplete read after seek.
Smack: allow to access /smack/access as normal user
TOMOYO: Fix unused kernel config option.
Smack: fix: invalid length set for the result of /smack/access
Smack: compilation fix
Smack: fix for /smack/access output, use string instead of byte
Smack: domain transition protections (v3)
Smack: Provide information for UDS getsockopt(SO_PEERCRED)
Smack: Clean up comments
Smack: Repair processing of fcntl
Smack: Rule list lookup performance
Smack: check permissions from user space (v2)
TOMOYO: Fix quota and garbage collector.
TOMOYO: Remove redundant tasklist_lock.
TOMOYO: Fix domain transition failure warning.
TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
TOMOYO: Simplify garbage collector.
TOMOYO: Fix make namespacecheck warnings.
target: check hex2bin result
encrypted-keys: check hex2bin result
...
Now that we support raid5 Enable it at mount. Raid6 will come next
raid4 is not demanded for so it will probably not be enabled.
(Until some one wants it)
NOTE: That mkfs.exofs had support for raid5/6 since long time
ago. (Making an empty raidX FS is just as easy as raid0 ;-} )
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Also I commented out and NULLed the .writepage (singular)
vector. Because it gives terrible write pattern to raid
and is apparently not needed. Even in OOM conditions the
system copes (even better) with out it.
TODO: How to specify to write_cache_pages() to start
or include a certain page?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This is finally the RAID5 Write support.
The bigger part of this patch is not the XOR engine itself, But the
read4write logic, which is a complete mini prepare_for_striping
reading engine that can read scattered pages of a stripe into cache
so it can be used for XOR calculation. That is, if the write was not
stripe aligned.
The main algorithm behind the XOR engine is the 2 dimensional array:
struct __stripe_pages_2d.
A drawing might save 1000 words
---
__stripe_pages_2d
|
n = pages_in_stripe_unit;
w = group_width - parity;
| pages array presented to the XOR lib
| |
V |
__1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---|
| |
__1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <---
|
... | ...
|
__1_page_stripe[n].pages --> [c0][c1]..[cw][c_par]
^
|
data added columns first then row
---
The pages are put on this array columns first. .i.e:
p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ...
So we are doing a corner turn of the pages.
Note that pages will zigzag down and left. but are put sequentially
in growing order. So when the time comes to XOR the stripe, only the
beginning and end of the array need be checked. We scan the array
and any NULL spot will be field by pages-to-be-read.
The FS that wants to support RAID5 needs to supply an
operations-vector that searches a given page in cache, and specifies
if the page is uptodate or need reading. All these pages to be read
are put on a slave ore_io_state and synchronously read. All the pages
of a stripe are read in one IO, using the scatter gather mechanism.
In write we constrain our IO to only be incomplete on a single
stripe. Meaning either the complete IO is within a single stripe so
we might have pages to read from both beginning or end of the
strip. Or we have some reading to do at beginning but end at strip
boundary. The left over pages are pushed to the next IO by the API
already established by previous work, where an IO offset/length
combination presented to the ORE might get the length truncated and
the user must re-submit the leftover pages. (Both exofs and NFS
support this)
But any ORE user should make it's best effort to align it's IO
before hand and avoid complications. A cached ore_layout->stripe_size
member can be used for that calculation. (NOTE: that ORE demands
that stripe_size may not be bigger then 32bit)
What else? Well read it and tell me.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch introduces the first stage of RAID5 support
mainly the skip-over-raid-units when reading. For
writes it inserts BLANK units, into where XOR blocks
should be calculated and written to.
It introduces the new "general raid maths", and the main
additional parameters and components needed for raid5.
Since at this stage it could corrupt future version that
actually do support raid5. The enablement of raid5
mounting and setting of parity-count > 0 is disabled. So
the raid5 code will never be used. Mounting of raid5 is
only enabled later once the basic XOR write is also in.
But if the patch "enable RAID5" is applied this code has
been tested to be able to properly read raid5 volumes
and is according to standard.
Also it has been tested that the new maths still properly
supports RAID0 and grouping code just as before.
(BTW: I have found more bugs in the pnfs-obj RAID math
fixed here)
The ore.c file is getting too big, so new ore_raid.[hc]
files are added that will include the special raid stuff
that are not used in striping and mirrors. In future write
support these will get bigger.
When adding the ore_raid.c to Kbuild file I was forced to
rename ore.ko to libore.ko. Is it possible to keep source
file, say ore.c and module file ore.ko the same even if there
are multiple files inside ore.ko?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
fs/exofs directory has multiple targets now, of which the
ore.ko will be needed by the pnfs-objects-layout-driver
(fs/nfs/objlayout).
As suggested by: Michal Marek <mmarek@suse.cz> convert
inclusion of exofs/ from obj-$(CONFIG_EXOFS_FS) => obj-$(y).
So ORE can be selected also from fs/nfs/Kconfig
CC: Michal Marek <mmarek@suse.cz>
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
that reduces a traffic and increases a performance.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
that reduces a traffic and increases a performance.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
to handle all lock requests on the client in an exclusive oplock case.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
If we have an oplock and negotiate mandatory locking style we handle
all brlock requests on the client.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Instead of saying all integer argument option should be listed in the beginning
move integer parsing to each option type.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* remove lot of update to different data structure
* add a seperate callback for zero copy request.
* above makes non zero copy code path simpler
* remove conditionalizing TREAD/TREADDIR/TWRITE in the zero copy path
* Fix the dotu p9_check_errors with zero copy. Add sufficient doc around
* Add support for both in and output buffers in zero copy callback
* pin and unpin pages in the same context
* use helpers instead of defining page offset and rest of page ourself
* Fix mem leak in p9_check_errors
* Remove 'E' and 'F' in p9pdu_vwritef
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
According to rfc5661 18.50, implement DESTROY_CLIENTID operation.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RFC5661 says:
The client may set one or both of
OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and
OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED.
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We should call cifs_all_info_to_fattr in rc == 0 case only.
Cc: <stable@kernel.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Currently code make an impression what grow procedure is very complicated
and some mythical paths, blocks are involved. But in fact grow in depth
it relatively simple procedure:
1) Just create new meta block and copy root data to that block.
2) Convert root from extent to index if old depth == 0
3) Update root block pointer
This patch does:
- Reorganize code to make it more self explanatory
- Do not pass path parameter to new_meta_block() in order to
provoke allocation from inode's group because top-level block
should site closer to it's inode, but not to leaf data block.
[ This happens anyway, due to logic in mballoc; we should drop
the path parameter from new_meta_block() entirely. -- tytso ]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In cifs_readv_receive we don't update rdata->result to error value
after kmap'ing a page. We should kunmap the page in the no error
case only.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Move the recently added readahead of the indirect pointer
tree during deallocation into its own function in order
that we can use it elsewhere in the future. Also this
fixes the resetting of the "first" variable in the
original patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The two variables being initialised in gfs2_inplace_reserve
to track the file & line number of the caller are never
used, so we might as well remove them.
If something does go wrong, then a stack trace is probably
more useful anyway.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some items picked up through automated code analysis. A few bits
of unreachable code and two unchecked return values.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2's fallocate code currently goes through the page cache. Since it's only
writing to the end of the file or to holes in it, it doesn't need to, and it
was causing issues on low memory environments. This patch pulls in some of
Steve's block allocation work, and uses it to simply allocate the blocks for
the file, and zero them out at allocation time. It provides a slight
performance increase, and it dramatically simplifies the code.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch improves the performance of delete/unlink
operations in a GFS2 file system where the files are large
by adding a layer of metadata read-ahead for indirect blocks.
Mileage will vary, but on my system, deleting an 8.6G file
dropped from 22 seconds to about 4.5 seconds.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Bob reported:
I found an off-by-one problem with how I coded this section:
It should be:
+ else if (blk >= cur->rd_data0 + cur->rd_data)
In fact, cur->rd_data0 + cur->rd_data is the start of the next
rgrp (the next ri_addr), so without the "=" check it can land on
the wrong rgrp.
In all normal cases, this won't be a problem: you're searching
for a block _within_ the rgrp, which will pass the test properly.
Where it gets into trouble is if you search the rgrps for the
block exactly equal to ri_addr. I don't think anything in the
kernel does this, but I found a place in gfs2-utils gfs2_edit
where it does. So I definitely need to fix it in libgfs2. I'd
like to suggest we fix it in the kernel as well for the sake of
keeping the functions similar.
So this patch fixes the above mentioned off by one error as well
as removing the unused parent pointer.
Reported-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>