Commit Graph

1730 Commits

Author SHA1 Message Date
Dave Chinner
70e60ce715 xfs: convert inode shrinker to per-filesystem contexts
Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes.  This significantly reduces
scanning overhead when lots of filesystems are present.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-20 08:07:02 +10:00
Dave Chinner
7f8275d0d6 mm: add context argument to shrinker callback
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-19 14:56:17 +10:00
Dave Chinner
7b6259e7a8 xfs: remove block number from inode lookup code
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 11:35:17 +10:00
Dave Chinner
1920779e67 xfs: rename XFS_IGET_BULKSTAT to XFS_IGET_UNTRUSTED
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 11:15:47 +10:00
Dave Chinner
7124fe0a5b xfs: validate untrusted inode numbers during lookup
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.

This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.

As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.

It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 11:15:33 +10:00
Christoph Hellwig
7dce11dbac xfs: always use iget in bulkstat
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes.  This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-06-23 18:11:11 +10:00
Dan Rosenberg
1817176a86 xfs: prevent swapext from operating on write-only files
This patch prevents user "foo" from using the SWAPEXT ioctl to swap
a write-only file owned by user "bar" into a file owned by "foo" and
subsequently reading it.  It does so by checking that the file
descriptors passed to the ioctl are also opened for reading.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 12:07:47 +10:00
Dave Chinner
254c8c2dbf xfs: remove nr_to_write writeback windup.
Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc->nr_to_write to get good writeback. Remove
that code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-08 18:12:44 -07:00
Alex Elder
1bf7dbfde8 Merge branch 'master' into for-linus 2010-06-04 13:22:30 -05:00
Christoph Hellwig
f936972949 xfs: improve xfs_isilocked
Use rwsem_is_locked to make the assertations for shared locks work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-06-03 16:22:29 +10:00
Christoph Hellwig
070ecdca54 xfs: skip writeback from reclaim context
Allowing writeback from reclaim context causes massive problems with stack
overflows as we can call into the writeback code which tends to be a heavy
stack user both in the generic code and XFS from random contexts that
perform memory allocations.

Follow the example of btrfs (and in slightly different form ext4) and refuse
to write out data from reclaim context.  This issue should really be handled
by the VM so that we can tune better for this case, but until we get it
sorted out there we have to hack around this in each filesystem with a
complex writeback path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-06-03 16:22:29 +10:00
Dave Chinner
5b257b4a1f xfs: fix race in inode cluster freeing failing to stale inodes
When an inode cluster is freed, it needs to mark all inodes in memory as
XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
have a different life cycle to the buffer, and once the buffer is torn down
during transaction completion, we must ensure none of the inodes get written
back (which is what XFS_ISTALE does).

Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
inodes either during inode reclaim or tail pushing on the AIL.  The buffer is
read back, but no longer contains inodes and so triggers assert failures and
shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.

There are two main causes of xfs_ifree_cluster() failing. The first is simple -
it checks in-memory inodes it finds in the per-ag icache to see if they are
clean without holding the flush lock. if they are clean it skips them
completely. However, If an inode is flushed delwri, it will
appear clean, but is not guaranteed to be written back until the flush lock has
been dropped. Hence we may have raced on the clean check and the inode may
actually be dirty. Hence always mark inodes found in memory stale before we
check properly if they are clean.

The second is more complex, and makes the first problem easier to hit.
Basically the in-memory inode scan is done with full knowledge it can be racing
with inode flushing and AIl tail pushing, which means that inodes that it can't
get the flush lock on might not be attached to the buffer after then in-memory
inode scan due to IO completion occurring. This is actually documented in the
code as "needs better interlocking". i.e. this is a zero-day bug.

Effectively, the in-memory scan must be done while the inode buffer is locked
and Io cannot be issued on it while we do the in-memory inode scan. This
ensures that inodes we couldn't get the flush lock on are guaranteed to be
attached to the cluster buffer, so we can then catch all in-memory inodes and
mark them stale.

Now that the inode cluster buffer is locked before the in-memory scan is done,
there is no need for the two-phase update of the in-memory inodes, so simplify
the code into two loops and remove the allocation of the temporary buffer used
to hold locked inodes across the phases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-03 16:22:29 +10:00
Christoph Hellwig
fb3b504ade xfs: fix access to upper inodes without inode64
If a filesystem is mounted without the inode64 mount option we
should still be able to access inodes not fitting into 32 bits, just
not created new ones.  For this to work we need to make sure the
inode cache radix tree is initialized for all allocation groups, not
just those we plan to allocate inodes from.  This patch makes sure
we initialize the inode cache radix tree for all allocation groups,
and also cleans xfs_initialize_perag up a bit to separate the
inode32 logical from the general perag structure setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:56 -05:00
Dave Chinner
9b98b6f3e1 xfs: fix might_sleep() warning when initialising per-ag tree
The use of radix_tree_preload() only works if the radix tree was
initialised without the __GFP_WAIT flag. The per-ag tree uses
GFP_NOFS, so does not trigger allocation of new tree nodes from the
preloaded array. Hence it enters the allocator with a spinlock held
and triggers the might_sleep() warnings.

Reported-by; Chris Mason <chris.mason@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:50 -05:00
Julia Lawall
38e712ab3d fs/xfs/quota: Add missing mutex_unlock
Add a mutex_unlock missing on the error path.  The use of this lock
is balanced elsewhere in the file.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression E1;
@@

* mutex_lock(E1,...);
  <+... when != E1
  if (...) {
    ... when != E1
*   return ...;
  }
  ...+>
* mutex_unlock(E1,...);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:41 -05:00
Huang Weiyi
3bd0946eb1 xfs: remove duplicated #include
Remove duplicated #include('s) in
  fs/xfs/linux-2.6/xfs_quotaops.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:36 -05:00
Li Zefan
f8adb4d574 xfs: convert more trace events to DEFINE_EVENT
Use DECLARE_EVENT_CLASS, and save ~15K:

   text    data     bss     dec     hex filename
 171949   43028      48  215025   347f1 fs/xfs/linux-2.6/xfs_trace.o.orig
 156521   43028      36  199585   30ba1 fs/xfs/linux-2.6/xfs_trace.o

No change in functionality.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:31 -05:00
Huang Weiyi
292ec4cf35 xfs: xfs_trace.c: remove duplicated #include
Remove duplicated #include('s) in
  fs/xfs/linux-2.6/xfs_trace.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:24 -05:00
Dave Chinner
07f1a4f5e8 xfs: Check new inode size is OK before preallocating
The new xfsqa test 228 tries to preallocate more space than the
filesystem contains. it should fail, but instead triggers an assert
about lock flags.  The failure is due to the size extension failing
in vmtruncate() due to rlimit being set. Check this before we start
the preallocation to avoid allocating space that will never be used.

Also the path through xfs_vn_allocate already holds the IO lock, so
it should not be present in the lock flags when the setattr fails.
Hence the assert needs to take this into account. This will prevent
other such callers from hitting this incorrect ASSERT.

(Fixed a reference to "newsize" to read "new_size". -Alex)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 15:19:12 -05:00
Christoph Hellwig
fdc07f44c8 xfs: clean up xlog_align
Add suggested cleanups to commit 29db3370a1369541d58d692fbfb168b8a0bd7f41
from review that didn't end up being commited.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 14:58:36 -05:00
Christoph Hellwig
025101dca4 xfs: cleanup log reservation calculactions
Instead of having small helper functions calling big macros do the
calculations for the log reservations directly in the functions.
These are mostly 1:1 from the macros execept that the macros kept
the quota calculations in their callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 14:58:30 -05:00
Eric Sandeen
32891b292d xfs: be more explicit if RT mount fails due to config
Recent testers were slightly confused that a realtime mount failed
due to missing CONFIG_XFS_RT; we can make that a little more
obvious.

V2: drop the else as suggested by Christoph

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 14:58:24 -05:00
Eric Sandeen
657a4cffde xfs: replace E2BIG with EFBIG where appropriate
Many places in the xfs code return E2BIG when they really mean
EFBIG; trying to grow past 16T on a 32 bit machine, for example,
says "Argument list too long" rather than "File too large" which is
not particularly helpful.

Some of these don't make perfect sense as EFBIG either, but still
better than E2BIG IMHO.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-28 14:58:16 -05:00
Christoph Hellwig
7ea8085910 drop unused dentry argument to ->fsync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:05:02 -04:00
Alex Elder
88e88374ee Merge branch 'delayed-logging-for-2.6.35' into for-linus 2010-05-24 11:57:36 -05:00
Dave Chinner
ccf7c23fc1 xfs: Ensure inode allocation buffers are fully replayed
With delayed logging, we can get inode allocation buffers in the
same transaction inode unlink buffers. We don't currently mark inode
allocation buffers in the log, so inode unlink buffers take
precedence over allocation buffers.

The result is that when they are combined into the same checkpoint,
only the unlinked inode chain fields are replayed, resulting in
uninitialised inode buffers being detected when the next inode
modification is replayed.

To fix this, we need to ensure that we do not set the inode buffer
flag in the buffer log item format flags if the inode allocation has
not already hit the log. To avoid requiring a change to log
recovery, we really need to make this a modification that relies
only on in-memory sate.

We can do this by checking during buffer log formatting (while the
CIL cannot be flushed) if we are still in the same sequence when we
commit the unlink transaction as the inode allocation transaction.
If we are, then we do not add the inode buffer flag to the buffer
log format item flags. This means the entire buffer will be
replayed, not just the unlinked fields. We do this while
CIL flusheѕ are locked out to ensure that we don't race with the
sequence numbers changing and hence fail to put the inode buffer
flag in the buffer format flags when we really need to.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:41:22 -05:00
Dave Chinner
df806158b0 xfs: enable background pushing of the CIL
If we let the CIL grow without bound, it will grow large enough to violate
recovery constraints (must be at least one complete transaction in the log at
all times) or take forever to write out through the log buffers. Hence we need
a check during asynchronous transactions as to whether the CIL needs to be
pushed.

We track the amount of log space the CIL consumes, so it is relatively simple
to limit it on a pure size basis. Make the limit the minimum of just under half
the log size (recovery constraint) or 8MB of log space (which is an awful lot
of metadata).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:20 -05:00
Dave Chinner
9da1ab181a xfs: forced unmounts need to push the CIL
If the filesystem is being shut down and the there is no log error,
the current code forces out the current log buffers. This code now needs
to push the CIL before it forces out the log buffers to acheive the same
result.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:14 -05:00
Dave Chinner
71e330b593 xfs: Introduce delayed logging core code
The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.

We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.

The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.

To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion.  This includes adding a log ticket for tracking space
log required/used by the context checkpoint.

To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out.  This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.

Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.

This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.

The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.

Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.

Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.

Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.

We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.

To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:03 -05:00
Dave Chinner
ed3b4d6cdc xfs: Improve scalability of busy extent tracking
When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:34:00 -05:00
Dave Chinner
955833cf2a xfs: make the log ticket ID available outside the log infrastructure
The ticket ID is needed to uniquely identify transactions when doing busy
extent matching. Delayed logging changes the lifecycle of busy extents with
respect to the transaction structure lifecycle. Hence we can no longer use
the transaction structure as a means of determining the owner of the busy
extent as it may be freed and reused while the busy extent is still active.

This commit provides the infrastructure to access the xlog_tid_t held in the
ticket from a transaction handle. This avoids the need for callers to peek
into the transaction and log structures to find this out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:52 -05:00
Dave Chinner
169a7b078e xfs: clean up log ticket overrun debug output
Push the error message output when a ticket overrun is detected
into the ticket printing functions. Also remove the debug version
of the code as the production version will still panic just as
effectively on a debug kernel via the panic mask being set.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:46 -05:00
Dave Chinner
c11554104f xfs: Clean up XFS_BLI_* flag namespace
Clean up the buffer log format (XFS_BLI_*) flags because they have a
polluted namespace. They XFS_BLI_ prefix is used for both in-memory
and on-disk flag feilds, but have overlapping values for different
flags. Rename the buffer log format flags to use the XFS_BLF_*
prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed
flags.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:39 -05:00
Dave Chinner
64fc35de60 xfs: modify buffer item reference counting
The buffer log item reference counts used to take referenceѕ for every
transaction, similar to the pin counting. This is symmetric (like the
pin/unpin) with respect to transaction completion, but with dleayed logging
becomes assymetric as the pinning becomes assymetric w.r.t. transaction
completion.

To make both cases the same, allow the buffer pinning to take a reference to
the buffer log item and always drop the reference the transaction has on it
when being unlocked. This is balanced correctly because the unpin operation
always drops a reference to the log item. Hence reference counting becomes
symmetric w.r.t. item pinning as well as w.r.t active transactions and as a
result the reference counting model remain consistent between normal and
delayed logging.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:31 -05:00
Dave Chinner
3383ca5780 xfs: allow log ticket allocation to take allocation flags
Delayed logging currently requires ticket allocation to succeed, so
we need to be able to sleep on allocation. It also should not allow
memory allocation to recurse into the filesystem. hence we need to
pass allocation flags directing the type of allocation the caller
requires.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:17 -05:00
Dave Chinner
524ee36fa4 xfs: Don't reuse the same transaction ID for duplicated transactions.
The transaction ID is written into the log as the unique identifier
for transactions during recover. When duplicating a transaction, we
reuse the log ticket, which means it has the same transaction ID as
the previous transaction.

Rather than regenerating a random transaction ID for the duplicated
transaction, just add one to the current ID so that duplicated
transaction can be easily spotted in the log and during recovery
during problem diagnosis.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:10 -05:00
Linus Torvalds
e8bebe2f71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
  fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
  get rid of home-grown mutex in cris eeprom.c
  switch ecryptfs_write() to struct inode *, kill on-stack fake files
  switch ecryptfs_get_locked_page() to struct inode *
  simplify access to ecryptfs inodes in ->readpage() and friends
  AFS: Don't put struct file on the stack
  Ban ecryptfs over ecryptfs
  logfs: replace inode uid,gid,mode initialization with helper function
  ufs: replace inode uid,gid,mode initialization with helper function
  udf: replace inode uid,gid,mode init with helper
  ubifs: replace inode uid,gid,mode initialization with helper function
  sysv: replace inode uid,gid,mode initialization with helper function
  reiserfs: replace inode uid,gid,mode initialization with helper function
  ramfs: replace inode uid,gid,mode initialization with helper function
  omfs: replace inode uid,gid,mode initialization with helper function
  bfs: replace inode uid,gid,mode initialization with helper function
  ocfs2: replace inode uid,gid,mode initialization with helper function
  nilfs2: replace inode uid,gid,mode initialization with helper function
  minix: replace inode uid,gid,mode init with helper
  ext4: replace inode uid,gid,mode init with helper
  ...

Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21 19:37:45 -07:00
Stephen Hemminger
46e58764f0 xfs: constify xattr_handler
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:19 -04:00
Jens Axboe
ee9a3607fb Merge branch 'master' into for-2.6.35
Conflicts:
	fs/ext3/fsync.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21 21:27:26 +02:00
Christoph Hellwig
c472b43275 quota: unify ->set_dqblk
Pass the larger struct fs_disk_quota to the ->set_dqblk operation so
that the Q_SETQUOTA and Q_XSETQUOTA operations can be implemented
with a single filesystem operation and we can retire the ->set_xquota
operation.  The additional information (RT-subvolume accounting and
warn counts) are left zero for the VFS quota implementation.

Add new fieldmask values for setting the numer of blocks and inodes
values which is required for the VFS quota, but wasn't for XFS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-21 19:30:44 +02:00
Christoph Hellwig
b9b2dd36c1 quota: unify ->get_dqblk
Pass the larger struct fs_disk_quota to the ->get_dqblk operation so
that the Q_GETQUOTA and Q_XGETQUOTA operations can be implemented
with a single filesystem operation and we can retire the ->get_xquota
operation.  The additional information (RT-subvolume accounting and
warn counts) are left zero for the VFS quota implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-21 19:30:43 +02:00
Christoph Hellwig
b4ed4626a9 xfs: mark xfs_iomap_write_ helpers static
And also drop a useless argument to xfs_iomap_write_direct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:20 -05:00
Christoph Hellwig
bd1556a146 xfs: clean up end index calculation in xfs_page_state_convert
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:20 -05:00
Christoph Hellwig
2b8f12b7e4 xfs: clean up mapping size calculation in __xfs_get_blocks
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:19 -05:00
Christoph Hellwig
558e689169 xfs: clean up xfs_iomap_valid
Rename all iomap_valid identifiers to imap_valid to fit the new
world order, and clean up xfs_iomap_valid to convert the passed in
offset to blocks instead of the imap values to bytes.  Use the
simpler inode->i_blkbits instead of the XFS macros for this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:19 -05:00
Christoph Hellwig
34a52c6c06 xfs: move I/O type flags into xfs_aops.c
The IOMAP_ flags are now only used inside xfs_aops.c for extent
probing and I/O completion tracking, so more them here, and rename
them to IO_* as there's no mapping involved at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:17 -05:00
Christoph Hellwig
207d041602 xfs: kill struct xfs_iomap
Now that struct xfs_iomap contains exactly the same units as struct
xfs_bmbt_irec we can just use the latter directly in the aops code.
Replace the missing IOMAP_NEW flag with a new boolean output
parameter to xfs_iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:17 -05:00
Christoph Hellwig
e513182d4d xfs: report iomap_bn in block base
Report the iomap_bn field of struct xfs_iomap in terms of filesystem
blocks instead of in terms of bytes.  Shift the byte conversions
into the caller, and replace the IOMAP_DELAY and IOMAP_HOLE flag
checks with checks for HOLESTARTBLOCK and DELAYSTARTBLOCK.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:17 -05:00
Christoph Hellwig
8699bb0a48 xfs: report iomap_offset and iomap_bsize in block base
Report the iomap_offset and iomap_bsize fields of struct xfs_iomap
in terms of fsblocks instead of in terms of disk blocks.  Shift the
byte conversions into the callers temporarily, but they will
disappear or get cleaned up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:17 -05:00
Christoph Hellwig
9563b3d899 xfs: remove iomap_delta
The iomap_delta field in struct xfs_iomap just contains the
difference between the offset passed to xfs_iomap and the
iomap_offset.  Just calculate it in the only caller that cares.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:17 -05:00
Christoph Hellwig
046f1685bb xfs: remove iomap_target
Instead of using the iomap_target field in struct xfs_iomap
and the IOMAP_REALTIME flag just use the already existing
xfs_find_bdev_for_inode helper.  There's some fallout as we
need to pass the inode in a few more places, which we also
use to sanitize some calling conventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:16 -05:00
Christoph Hellwig
826bf0adce xfs: limit xfs_imap_to_bmap to a single mapping
We only call xfs_iomap for single mappings anyway, so remove all
code dealing with multiple mappings from xfs_imap_to_bmap and add
asserts that we never get results that we do not expect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:16 -05:00
Christoph Hellwig
4a5224d7b1 xfs: simplify buffer to transaction matching
We currenly have a routine xfs_trans_buf_item_match_all which checks
if any log item in a transaction contains a given buffer, and a
second one that only does this check for the first, embedded chunk
of log items.  We only use the second routine if we know we only
have that log item chunk, so get rid of the limited routine and
always use the more complete one.

Also rename the old xfs_trans_buf_item_match_all to
xfs_trans_buf_item_match and update various surrounding comments,
and move the remaining xfs_trans_buf_item_match on top of the file
to avoid a forward prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:16 -05:00
Tao Ma
2d1ff3c75a xfs: Make fiemap work in query mode.
According to Documentation/filesystems/fiemap.txt, If fm_extent_count
is zero, then the fm_extents[] array is ignored (no extents will be
returned), and the fm_mapped_extents count will hold the number of
extents needed.

But as the commit 97db39a1f6 has changed
bmv_count to the caller's input buffer, this number query function can't
work any more. As this commit is written to change bmv_count from
MAXEXTNUM because of ENOMEM.

This patch just try to  set bm.bmv_count to something sane.
Thanks to Dave Chinner <david@fromorbit.com> for the suggestion.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2010-05-19 09:58:16 -05:00
Alex Elder
48389ef175 xfs: kill off l_sectbb_mask
There remains only one user of the l_sectbb_mask field in the log
structure.  Just kill it off and compute the mask where needed from
the power-of-2 sector size.

(Only update from last post is to accomodate the changes in the
previous patch in the series.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:16 -05:00
Alex Elder
69ce58f08a xfs: record log sector size rather than log2(that)
Change struct log so it keeps track of the size (in basic blocks) of
a log sector in l_sectBBsize rather than the log-base-2 of that
value (previously, l_sectbb_log).  The name was chosen for
consistency with the other fields in the structure that represent
a number of basic blocks.

(Updated so that a variable used in computing and verifying a log's
sector size is named "log2_size".  Also added the "BB" to the
structure field name, based on feedback from Eric Sandeen.  Also
dropped some superfluous parentheses.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2010-05-19 09:58:15 -05:00
Christoph Hellwig
1414a6046a xfs: remove dead XFS_LOUD_RECOVERY code
This can't be enabled through the build system and has been dead for
ages.  Note that the CRC patches add back log checksumming, but the
code is quite different from the version removed here anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:15 -05:00
Christoph Hellwig
8112e9dc6d xfs: removed unused XFS_QMOPT_ flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:15 -05:00
Christoph Hellwig
191f8488f9 xfs: remove a few macro indirections in the quota code
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:15 -05:00
Christoph Hellwig
8a7b8a89a3 xfs: access quotainfo structure directly
Access fields in m_quotainfo directly instead of hiding them behind the
XFS_QI_* macros.  Add local variables for the quotainfo pointer in places
where we have lots of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:14 -05:00
Christoph Hellwig
37bc5743fd xfs: wait for direct I/O to complete in fsync and write_inode
We need to wait for all pending direct I/O requests before taking care of
metadata in fsync and write_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:14 -05:00
Andrea Gelmini
fce1cad651 xfs: xfs_trace.c: duplicated include
fs/xfs/linux-2.6/xfs_trace.c: xfs_attr_sf.h is included more than once.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:14 -05:00
Alex Elder
3f943d853d xfs: minor odds and ends in xfs_log_recover.c
Odds and ends in "xfs_log_recover.c".  This patch just contains some
minor things that didn't seem to warrant their own individual
patches:
- In xlog_bread_noalign(), drop an assertion that a pointer is
  non-null (the crash will tell us it was a bad pointer).
- Add a more descriptive header comment for xlog_find_verify_cycle().
- Make a few additions to the comments in xlog_find_head().  Also
  rearrange some expressions in a few spots to produce the same
  result, but in a way that seems more clear what's being computed.

(Updated in response to Dave's review comments.  Note I did not
split this patch like I said I would.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:14 -05:00
Alex Elder
e3bb2e30d5 xfs: avoid repeated pointer dereferences
In xlog_find_cycle_start() use a local variable for some repeated
operations rather than constantly accessing the memory location
whose address is passed in.

(This version drops an assertion that a pointer is non-null.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:14 -05:00
Alex Elder
9db127edb5 xfs: change a few labels in xfs_log_recover.c
Rename a label used in xlog_find_head() that I thought was poorly
chosen.  Also combine two adjacent labels xlog_find_tail() into a
single label, and give it a more generic name.

(Now using Dave's suggested "validate_head" name for first label.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:13 -05:00
Christoph Hellwig
8c38366f99 xfs: enforce synchronous writes in xfs_bwrite
xfs_bwrite is used with the intention of synchronously writing out
buffers, but currently it does not actually clear the async flag if
that's left from previous writes but instead implements async
behaviour if it finds it.  Remove the code handling asynchronous
writes as we've got rid of those entirely outside of the log and
delwri buffers, and make sure that we clear the async and read flags
before writing the buffer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:13 -05:00
Christoph Hellwig
df308bcfec xfs: remove periodic superblock writeback
All modifications to the superblock are done transactional through
xfs_trans_log_buf, so there is no reason to initiate periodic
asynchronous writeback.  This only removes the superblock from the
delwri list and will lead to sub-optimal I/O scheduling.

Cut down xfs_sync_fsdata now that it's only used for synchronous
superblock writes and move the log coverage checks into the two
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-19 09:58:13 -05:00
Dave Chinner
f983710758 xfs: make the log ticket transaction id random
The transaction ID that is written to the log for a transaction is
currently set by taking the lower 32 bits of the memory address of
the ticket structure.  This is not guaranteed to be unique as
tickets comes from a slab and slots can be reallocated immediately
after being freed. As a result, there is no guarantee of uniqueness
in the ticket ID value.

Fix this by assigning a random number to the ticket ID field so that
it is extremely unlikely that duplicates will occur and remove the
possibility of transactions being mixed up during recovery due to
duplicate IDs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:13 -05:00
Alex Elder
36adecff50 xfs: nothing special about 1-block log sector
There are a number of places where a log sector size of 1 uses
special case code.  The round_up() and round_down() macros
produce the correct result even when the log sector size is 1, and
this eliminates the need for treating this as a special case.

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:12 -05:00
Alex Elder
ff30a6221d xfs: encapsulate bbcount validity checking
Define a function that encapsulates checking the validity of a log
block count.

(Updated from previous version--no longer includes error reporting in the
encapsulated validation function.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:12 -05:00
Alex Elder
5c17f5339f xfs: kill XLOG_SECTOR_ROUND*()
XLOG_SECTOR_ROUNDUP_BBCOUNT() and XLOG_SECTOR_ROUNDDOWN_BLKNO()
are now fairly simple macro translations.  Just get rid of them in
favor of the round_up() and round_down() macro calls they represent.

Also, in spots in xlog_get_bp() and xlog_write_log_records(),
round_up() was being called with value 1, which just evaluates
to the macro's second argument; so just use that instead.
In the latter case, make use of that value, as long as it's
already been computed.

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:12 -05:00
Alex Elder
8511998baa xfs: simplify XLOG_SECTOR_ROUND*()
XLOG_SECTOR_ROUNDUP_BBCOUNT() is defined in "fs/xfs/xfs_log_recover.c"
in an overly-complicated way.  It is basically roundup(), but that
is not at all clear from its definition.  (Actually, there is
another macro round_up() that applies for power-of-two-based masks
which I'll be using here.)

The operands in XLOG_SECTOR_ROUNDUP_BBCOUNT() are basically the
block number (bbs) and the log sector basic block mask
(log->l_sectbb_mask).  I'll call them B and M for this discussion.

The macro computes is value this way:
	M && (B & M) ? (B + M + 1) & ~M : B

Put another way, we can break it into 3 cases:
	1)  ! M          -> B			# 0 mask, no effect
	2)  ! (B & M)    -> B			# sector aligned
	3)  M && (B & M) -> (B + M + 1) & ~M	# round up otherwise

The round_up() macro is cleverly defined using a value, v, and a
power-of-2, p, and the result is the nearest multiple of p greater
than or equal to v.  Its value is computed something like this:
	((v - 1) | (p - 1)) + 1
Let's consider using this in the context of the 3 cases above.

When p = 2^0 = 1, the result boils down to ((v - 1) | 0) + 1, so it
just translates any value v to itself.  That handles case (1) above.

When p = 2^n, n > 0, we know that (p - 1) will be a mask with all n
bits 0..n-1 set.  The condition in this case occurs when none of
those mask bits is set in the value v provided.  If that is the
case, subtracting 1 from v will have 1's in all those lower bits (at
least).  Therefore, OR-ing the mask with that decremented value has
no effect, so adding the 1 back again will just translate the v to
itself.  This handles case (2).

Otherwise, the value v is greater than some multiple of p, and
decrementing it will produce a result greater than or equal to that
multiple.  OR-ing in the mask will produce a value 1 less than the
next multiple of p, so finally adding 1 back will result in the
desired rounded-up value.  This handles case (3).

Hopefully this is convincing.

While I was at it, I converted XLOG_SECTOR_ROUNDDOWN_BLKNO() to use
the round_down() macro.

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:12 -05:00
Alex Elder
6881a229f6 xfs: fix min bufsize bugs in two places
This fixes a bug in two places that I found by inspection.  In
xlog_find_verify_cycle() and xlog_write_log_records(), the code
attempts to allocate a buffer to hold as many blocks as possible.
It gives up if the number of blocks to be allocated gets too small.
Right now it uses log->l_sectbb_log as that lower bound, but I'm
sure it's supposed to be the actual log sector size instead.  That
is, the lower bound should be (1 << log->l_sectbb_log).

Also define a simple macro xlog_sectbb(log) to represent the number
of basic blocks in a sector for the given log.

(No change from original submission; I have implemented Christoph's
suggestion about storing l_sectsize rather than l_sectbb_log in
a new, separate patch in this series.)

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:12 -05:00
Alex Elder
a0e856b0b4 xfs: add const qualifiers to xfs error function args
Change the tag and file name arguments to xfs_error_report() and
xfs_corruption_error() to use a const qualifier.

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:11 -05:00
Christoph Hellwig
74457cf4a3 xfs: remove xfs_dqmarker
The xfs_dqmarker structure does not need to exist anymore. Move the
remaining flags field out of it and remove the structure altogether.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:11 -05:00
Dave Chinner
3a8406f6d6 xfs: convert the dquot free list to use list heads
Convert the dquot free list on the filesystem to use listhead
infrastructure rather than the roll-your-own in the quota code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:11 -05:00
Dave Chinner
e6a81f13aa xfs: convert the dquot hash list to use list heads
Convert the dquot hash list on the filesystem to use listhead
infrastructure rather than the roll-your-own in the quota code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:11 -05:00
Dave Chinner
368e136174 xfs: remove duplicate code from dquot reclaim
The dquot shaker and the free-list reclaim code use exactly the same
algorithm but the code is duplicated and slightly different in each
case. Make the shaker code use the single dquot reclaim code to
remove the code duplication.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:11 -05:00
Dave Chinner
3a25404b3f xfs: convert the per-mount dquot list to use list heads
Convert the dquot list on the filesytesm to use listhead
infrastructure rather than the roll-your-own in the quota code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:10 -05:00
Dave Chinner
9abbc539bf xfs: add log item recovery tracing
Currently there is no tracing in log recovery, so it is difficult to
determine what is going on when something goes wrong.

Add tracing for log item recovery to provide visibility into the log
recovery process. The tracing added shows regions being extracted
from the log transactions and added to the transaction hash forming
recovery items, followed by the reordering, cancelling and finally
recovery of the items.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:10 -05:00
Christoph Hellwig
e6b1f27370 xfs: clean up xlog_write_adv_cnt
Replace the awkward xlog_write_adv_cnt with an inline helper that makes
it more obvious that it's modifying it's paramters, and replace the use
of an integer type for "ptr" with a real void pointer.  Also move
xlog_write_adv_cnt to xfs_log_priv.h as it will be used outside of
xfs_log.c in the delayed logging series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:10 -05:00
Dave Chinner
55b66332d0 xfs: introduce new internal log vector structure
The current log IO vector structure is a flat array and not
extensible. To make it possible to keep separate log IO vectors for
individual log items, we need a method of chaining log IO vectors
together.

Introduce a new log vector type that can be used to wrap the
existing log IO vectors on use that internally to the log. This
means that the existing external interface (xfs_log_write) does not
change and hence no changes to the transaction commit code are
required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:10 -05:00
Christoph Hellwig
99428ad0f6 xfs: reindent xlog_write
Reindent xlog_write to normal one tab indents and move all variable
declarations into the closest enclosing block.

Split from a bigger patch by Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:10 -05:00
Dave Chinner
b5203cd0a4 xfs: factor xlog_write
xlog_write is a mess that takes a lot of effort to understand. It is
a mass of nested loops with 4 space indents to get it to fit in 80 columns
and lots of funky variables that aren't obvious what they mean or do.

Break it down into understandable chunks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-05-19 09:58:09 -05:00
Dave Chinner
9b9fc2b760 xfs: log ticket reservation underestimates the number of iclogs
When allocation a ticket for a transaction, the ticket is initialised with the
worst case log space usage based on the number of bytes the transaction may
consume. Part of this calculation is the number of log headers required for the
iclog space used up by the transaction.

This calculation makes an undocumented assumption that if the transaction uses
the log header space reservation on an iclog, then it consumes either the
entire iclog or it completes. That is - the transaction that is first in an
iclog is the transaction that the log header reservation is accounted to. If
the transaction is larger than the iclog, then it will use the entire iclog
itself. Document this assumption.

Further, the current calculation uses the rule that we can fit iclog_size bytes
of transaction data into an iclog. This is in correct - the amount of space
available in an iclog for transaction data is the size of the iclog minus the
space used for log record headers. This means that the calculation is out by
512 bytes per 32k of log space the transaction can consume. This is rarely an
issue because maximally sized transactions are extremely uncommon, and for 4k
block size filesystems maximal transaction reservations are about 400kb. Hence
the error in this case is less than the size of an iclog, so that makes it even
harder to hit.

However, anyone using larger directory blocks (16k directory blocks push the
maximum transaction size to approx. 900k on a 4k block size filesystem) or
larger block size (e.g. 64k blocks push transactions to the 3-4MB size) could
see the error grow to more than an iclog and at this point the transaction is
guaranteed to get a reservation underrun and shutdown the filesystem.

Fix this by adjusting the calculation to calculate the correct number of iclogs
required and account for them all up front.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:09 -05:00
Dave Chinner
b1c1b5b610 xfs: Clean up xfs_trans_committed code after factoring
Now that the code has been factored, clean up all the remaining
style cruft, simplify the code and re-order functions so that it
doesn't need forward declarations.

Also move the remaining functions that require forward declarations
(xfs_trans_uncommit, xfs_trans_free) so that all the forward
declarations can be removed from the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:09 -05:00
Dave Chinner
8e646a55ac xfs: update and factor xfs_trans_committed()
The function header to xfs-trans_committed has long had this
comment:

 * THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item()

To prepare for different methods of committing items, convert the
code to use xfs_trans_next_item() and factor the code into smaller,
more digestible chunks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:09 -05:00
Christoph Hellwig
a3ccd2ca43 xfs: clean up xfs_trans_commit logic even more
> +shut_us_down:
> +	shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0;
> +	if (!(tp->t_flags & XFS_TRANS_DIRTY) || shutdown) {
> +		xfs_trans_unreserve_and_mod_sb(tp);
> +		/*

This whole area in _xfs_trans_commit is still a complete mess.

So while touching this code, unravel this mess as well to make the
whole flow of the function simpler and clearer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:08 -05:00
Dave Chinner
0924378a68 xfs: split out iclog writing from xfs_trans_commit()
Split the the part of xfs_trans_commit() that deals with writing the
transaction into the iclog into a separate function. This isolates the
physical commit process from the logical commit operation and makes
it easier to insert different transaction commit paths without affecting
the existing algorithm adversely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:08 -05:00
Dave Chinner
713bf88bba xfs: fix reservation release commit flag in xfs_bmap_add_attrfork()
xfs_bmap_add_attrfork() passes XFS_TRANS_PERM_LOG_RES to xfs_trans_commit()
to indicate that the commit should release the permanent log reservation
as part of the commit. This is wrong - the correct flag is
XFS_TRANS_RELEASE_LOG_RES - and it is only by the chance that both these
flags have the value of 0x4 that the code is doing the right thing.

Fix it by changing to use the correct flag.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:08 -05:00
Dave Chinner
8e12385086 xfs: remove stale parameter from ->iop_unpin method
The staleness of a object being unpinned can be directly derived
from the object itself - there is no need to extract it from the
object then pass it as a parameter into IOP_UNPIN().

This means we can kill the XFS_LID_BUF_STALE flag - it is set,
checked and cleared in the same places XFS_BLI_STALE flag in the
xfs_buf_log_item so it is now redundant and hence safe to remove.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:08 -05:00
Dave Chinner
4aaf15d1aa xfs: Add inode pin counts to traces
We don't record pin counts in inode events right now, and this makes
it difficult to track down problems related to pinning inodes. Add
the pin count to the inode trace class and add trace events for
pinning and unpinning inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:08 -05:00
Dave Chinner
43f5efc5b5 xfs: factor log item initialisation
Each log item type does manual initialisation of the log item.
Delayed logging introduces new fields that need initialisation, so
factor all the open coded initialisation into a common function
first.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19 09:58:07 -05:00
Jan Engelhardt
e2a07812e9 xfs: add blockdev name to kthreads
This allows to see in `ps` and similar tools which kthreads are
allotted to which block device/filesystem, similar to what jbd2
does. As the process name is a fixed 16-char array, no extra
space is needed in tasks.

  PID TTY      STAT   TIME COMMAND
    2 ?        S      0:00 [kthreadd]
  197 ?        S      0:00  \_ [jbd2/sda2-8]
  198 ?        S      0:00  \_ [ext4-dio-unwrit]
  204 ?        S      0:00  \_ [flush-8:0]
 2647 ?        S      0:00  \_ [xfs_mru_cache]
 2648 ?        S      0:00  \_ [xfslogd/0]
 2649 ?        S      0:00  \_ [xfsdatad/0]
 2650 ?        S      0:00  \_ [xfsconvertd/0]
 2651 ?        S      0:00  \_ [xfsbufd/ram0]
 2652 ?        S      0:00  \_ [xfsaild/ram0]
 2653 ?        S      0:00  \_ [xfssyncd/ram0]

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:07 -05:00
Zhitong Wang
fda168c245 xfs: Fix integer overflow in fs/xfs/linux-2.6/xfs_ioctl*.c
The am_hreq.opcount field in the xfs_attrmulti_by_handle() interface
is not bounded correctly. The opcount is used to determine the size
of the buffer required. The size is bounded, but can overflow and so
the size checks may not be sufficient to catch invalid opcounts.
Fix it by catching opcount values that would cause overflows before
calculating the size.

Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19 09:58:07 -05:00
Dave Chinner
9bf729c0af xfs: add a shrinker to background inode reclaim
On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.

This is more complex than it needs to be because the VM folk don't
want a context added to the shrinker infrastructure. Hence we need
to add a global list of XFS mount structures so the shrinker can
traverse them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-04-29 16:22:13 -05:00
Jens Axboe
7407cf355f Merge branch 'master' into for-2.6.35
Conflicts:
	fs/block_dev.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-29 09:36:24 +02:00
Dmitry Monakhov
fbd9b09a17 blkdev: generalize flags for blkdev_issue_fn functions
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28 19:47:36 +02:00
Dave Chinner
dd77ef924c xfs: more swap extent fixes for dynamic fork offsets
A new xfsqa test (226) with a prototype xfs_fsr change to try to
handle dynamic fork offsets better triggers an assertion failure
where the inode data fork is in btree format, yet there is room in
the inode for it to be in extent format. The two inodes look like:

before: ino 0x101 (target), num_extents 11, Max in-fork extents 6, broot size 40, fork offset 96
before: ino 0x115 (temp),  num_extents 5, Max in-fork extents 3, broot size 40, fork offset 56
after: ino 0x101 (target), num_extents 5, Max in-fork extents 6, broot size 40, fork offset 96
after: ino 0x115 (temp), num_extents 11, Max in-fork extents 3, broot size 40, fork offset 56

Basically the target inode ends up with 5 extents in btree format,
but it had space for 6 extents in extent format, so ends up
incorrect. Notably here the broot size is the same, and that is
where the kernel code is going wrong - the btree root will fit, so
it lets the swap go ahead.

The check should not allow the swap to take place if the number of
extents while in btree format is less than the number of extents
that can fit in the inode in extent format. Adding that check will
prevent this swap and corruption from occurring.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-04-26 12:38:51 -05:00
Dave Chinner
f1d486a361 xfs: don't warn on EAGAIN in inode reclaim
Any inode reclaim flush that returns EAGAIN will result in the inode
reclaim being attempted again later. There is no need to issue a
warning into the logs about this situation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-04-16 13:51:44 -05:00