Commit Graph

1786 Commits

Author SHA1 Message Date
Darrick J. Wong
be20e6c67b md: Call blk_queue_flush() to establish flush/fua support
Before 2.6.37, the md layer had a mechanism for catching I/Os with the
barrier flag set, and translating the barrier into barriers for all
the underlying devices.  With 2.6.37, I/O barriers have become plain
old flushes, and the md code was updated to reflect this.  However,
one piece was left out -- the md layer does not tell the block layer
that it supports flushes or FUA access at all, which results in md
silently dropping flush requests.

Since the support already seems there, just add this one piece of
bookkeeping.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24 16:40:33 +11:00
NeilBrown
8f9e0ee38f md/raid1: really fix recovery looping when single good device fails.
Commit 4044ba58dd supposedly fixed a
problem where if a raid1 with just one good device gets a read-error
during recovery, the recovery would abort and immediately restart in
an infinite loop.

However it depended on raid1_remove_disk removing the spare device
from the array.  But that does not happen in this case.  So add a test
so that in the 'recovery_disabled' case, the device will be removed.

This suitable for any kernel since 2.6.29 which is when
recovery_disabled was introduced.

Cc: stable@kernel.org
Reported-by: Sebastian Färber <faerber@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24 16:39:46 +11:00
Justin Maggard
c26a44ed1e md: fix return value of rdev_size_change()
When trying to grow an array by enlarging component devices,
rdev_size_store() expects the return value of rdev_size_change() to be
in sectors, but the actual value is returned in KBs.

This functionality was broken by commit
     dd8ac336c1
so this patch is suitable for any kernel since 2.6.30.

Cc: stable@kernel.org
Signed-off-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-11-24 16:36:17 +11:00
Mike Snitzer
77304d2aba block: read i_size with i_size_read()
Convert direct reads of an inode's i_size to using i_size_read().

i_size_{read,write} use a seqcount to protect reads from accessing
incomple writes.  Concurrent i_size_write()s require mutual exclussion
to protect the seqcount that is used by i_size_{read,write}.  But
i_size_read() callers do not need to use additional locking.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: NeilBrown <neilb@suse.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-11-10 14:40:53 +01:00
NeilBrown
f3ac8bf7ce md: tidy up device searches in read_balance.
The code for searching through the device list to read-balance in
raid1 is rather clumsy and hard to follow.  Try to simplify it a bit.

No important functionality change here.


Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29 16:40:33 +11:00
NeilBrown
046abeede7 md/raid1: fix some typos in comments.
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29 16:40:33 +11:00
NeilBrown
9b19553e0b md/raid1: discard unused variable.
This structure field (flushing_bio_list) is never used, so remove it.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29 16:40:33 +11:00
NeilBrown
be2a2656ee md: unplug writes to external bitmaps.
When writing to an 'external' bitmap we don't currently unplug the
device before waiting, so we can get a 3msec delay each time;
So use REQ_UNPLUG to force and unplug.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-29 16:40:32 +11:00
NeilBrown
a167f66324 md: use separate bio pool for each md device.
bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.

So allocate a local bio pool for each md array and use that rather
than the common pool.

This pool is used both for regular IO and metadata updates.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:15 +11:00
NeilBrown
2b193363ef md: change type of first arg to sync_page_io.
Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:11 +11:00
NeilBrown
1c4588e9c1 md/raid1: perform mem allocation before disabling writes during resync.
Though this mem alloc is GFP_NOIO an so will not deadlock, it seems
better to do the allocation before 'raise_barrier' which stops any IO
requests while the resync proceeds.

raid10 always uses this order, so it is at least consistent to do the
same in raid1.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:09 +11:00
NeilBrown
6746557f03 md: use bio_kmalloc rather than bio_alloc when failure is acceptable.
bio_alloc can never fail (as it uses a mempool) but an block
indefinitely, especially if the caller is holding a reference to a
previously allocated bio.

So these to places which both handle failure and hold multiple bios
should not use bio_alloc, they should use bio_kmalloc.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:36:06 +11:00
NeilBrown
4e78064f42 md: Fix possible deadlock with multiple mempool allocations.
It is not safe to allocate from a mempool while holding an item
previously allocated from that mempool as that can deadlock when the
mempool is close to exhaustion.

So don't use a bio list to collect the bios to write to multiple
devices in raid1 and raid10.
Instead queue each bio as it becomes available so an unplug will
activate all previously allocated bios and so a new bio has a chance
of being allocated.

This means we must set the 'remaining' count to '1' before submitting
any requests, then when all are submitted, decrement 'remaining' and
possible handle the write completion at that point.

Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:34:07 +11:00
Tejun Heo
e804ac780e md: fix and update workqueue usage
Workqueue usage in md has two problems.

* Flush can be used during or depended upon by memory reclaim, but md
  uses the system workqueue for flush_work which may lead to deadlock.

* md depends on flush_scheduled_work() to achieve exclusion against
  completion of removal of previous instances.  flush_scheduled_work()
  may incur unexpected amount of delay and is scheduled to be removed.

This patch adds two workqueues to md - md_wq and md_misc_wq.  The
former is guaranteed to make forward progress under memory pressure
and serves flush_work.  The latter serves as the flush domain for
other works.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:32:29 +11:00
NeilBrown
57dab0bdf6 md: use sector_t in bitmap_get_counter
bitmap_get_counter returns the number of sectors covered
by the counter in a pass-by-reference variable.
In some cases this can be very large, so make it a sector_t
for safety.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:32:26 +11:00
NeilBrown
4b532c9b8c md: remove md_mutex locking.
lock_kernel calls were recently pushed down into open/release
functions.
md doesn't need that protection.
Then the BKL calls were change to md_mutex.  We don't need those
either.
So remove it all.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-28 17:30:21 +11:00
NeilBrown
d97a41dc9c md: Fix regression with raid1 arrays without persistent metadata.
A RAID1 which has no persistent metadata, whether internal or
external, will hang on the first write.
This is caused by commit  070dc6dd71
In that case, MD_CHANGE_PENDING never gets cleared.

So during md_update_sb, is neither persistent or external,
clear MD_CHANGE_PENDING.

This is suitable for 2.6.36-stable.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-10-28 17:30:20 +11:00
Andrew Morton
ca1cab37d9 workqueues: s/ON_STACK/ONSTACK/
Silly though it is, completions and wait_queue_heads use foo_ONSTACK
(COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK,
__WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I
guess workqueues should do the same thing.

s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/
s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:14 -07:00
Linus Torvalds
a2887097f2 Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
  xen-blkfront: disable barrier/flush write support
  Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
  block: remove BLKDEV_IFL_WAIT
  aic7xxx_old: removed unused 'req' variable
  block: remove the BH_Eopnotsupp flag
  block: remove the BLKDEV_IFL_BARRIER flag
  block: remove the WRITE_BARRIER flag
  swap: do not send discards as barriers
  fat: do not send discards as barriers
  ext4: do not send discards as barriers
  jbd2: replace barriers with explicit flush / FUA usage
  jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
  jbd: replace barriers with explicit flush / FUA usage
  nilfs2: replace barriers with explicit flush / FUA usage
  reiserfs: replace barriers with explicit flush / FUA usage
  gfs2: replace barriers with explicit flush / FUA usage
  btrfs: replace barriers with explicit flush / FUA usage
  xfs: replace barriers with explicit flush / FUA usage
  block: pass gfp_mask and flags to sb_issue_discard
  dm: convey that all flushes are processed as empty
  ...
2010-10-22 17:07:18 -07:00
Linus Torvalds
e9dd2b6837 Merge branch 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
  cfq-iosched: Fix a gcc 4.5 warning and put some comments
  block: Turn bvec_k{un,}map_irq() into static inline functions
  block: fix accounting bug on cross partition merges
  block: Make the integrity mapped property a bio flag
  block: Fix double free in blk_integrity_unregister
  block: Ensure physical block size is unsigned int
  blkio-throttle: Fix possible multiplication overflow in iops calculations
  blkio-throttle: limit max iops value to UINT_MAX
  blkio-throttle: There is no need to convert jiffies to milli seconds
  blkio-throttle: Fix link failure failure on i386
  blkio: Recalculate the throttled bio dispatch time upon throttle limit change
  blkio: Add root group to td->tg_list
  blkio: deletion of a cgroup was causes oops
  blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
  block: set the bounce_pfn to the actual DMA limit rather than to max memory
  block: revert bad fix for memory hotplug causing bounces
  Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
  block: set the bounce_pfn to the actual DMA limit rather than to max memory
  block: Prevent hang_check firing during long I/O
  cfq: improve fsync performance for small files
  ...

Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
2010-10-22 17:00:32 -07:00
Linus Torvalds
092e0e7e52 Merge branch 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  vfs: make no_llseek the default
  vfs: don't use BKL in default_llseek
  llseek: automatically add .llseek fop
  libfs: use generic_file_llseek for simple_attr
  mac80211: disallow seeks in minstrel debug code
  lirc: make chardev nonseekable
  viotape: use noop_llseek
  raw: use explicit llseek file operations
  ibmasmfs: use generic_file_llseek
  spufs: use llseek in all file operations
  arm/omap: use generic_file_llseek in iommu_debug
  lkdtm: use generic_file_llseek in debugfs
  net/wireless: use generic_file_llseek in debugfs
  drm: use noop_llseek
2010-10-22 10:52:56 -07:00
Linus Torvalds
c37927d435 Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
  block: autoconvert trivial BKL users to private mutex
  drivers: autoconvert trivial BKL users to private mutex
  ipmi: autoconvert trivial BKL users to private mutex
  mac: autoconvert trivial BKL users to private mutex
  mtd: autoconvert trivial BKL users to private mutex
  scsi: autoconvert trivial BKL users to private mutex

Fix up trivial conflicts (due to addition of private mutex right next to
deletion of a version string) in drivers/char/pcmcia/cm40[04]0_cs.c
2010-10-22 10:49:54 -07:00
Jens Axboe
fa251f8990 Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier
Conflicts:
	block/blk-core.c
	drivers/block/loop.c
	mm/swapfile.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19 09:13:04 +02:00
Arnd Bergmann
6038f373a3 llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time.  Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
//   but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
   *off = E
|
   *off += E
|
   func(..., off, ...)
|
   E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
  *off = E
|
  *off += E
|
  func(..., off, ...)
|
  E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
 ...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
 .llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
 .read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
 .write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
 .open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
...  .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
...  .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
...  .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+	.llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
 .write = write_f,
 .read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-15 15:53:27 +02:00
Vasiliy Kulikov
5c04f5512f md: check return code of read_sb_page
Function read_sb_page may return ERR_PTR(...). Check for it.

Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-07 12:02:50 +11:00
NeilBrown
db8d9d3591 md/raid1: minor bio initialisation improvements.
When performing a resync we pre-allocate some bios and repeatedly use
them.  This requires us to re-initialise them each time.
One field (bi_comp_cpu) and some flags weren't being initiaised
reliably.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-07 12:00:50 +11:00
NeilBrown
7571ae887d md/raid1: avoid overflow in raid1 resync when bitmap is in use.
bitmap_start_sync returns - via a pass-by-reference variable - the
number of sectors before we need to check with the bitmap again.
Since commit ef42567335 this number can be substantially larger,
2^27 is a common value.

Unfortunately it is an 'int' and so when raid1.c:sync_request shifts
it 9 places to the left it becomes 0.  This results in a zero-length
read which the scsi layer justifiably complains about.

This patch just removes the shift so the common case becomes safe with
a trivially-correct patch.

In the next merge window we will convert this 'int' to a 'sector_t'

Reported-by: "George Spelvin" <linux@horizon.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-10-07 11:54:46 +11:00
Arnd Bergmann
2a48fc0ab2 block: autoconvert trivial BKL users to private mutex
The block device drivers have all gained new lock_kernel
calls from a recent pushdown, and some of the drivers
were already using the BKL before.

This turns the BKL into a set of per-driver mutexes.
Still need to check whether this is safe to do.

file=$1
name=$2
if grep -q lock_kernel ${file} ; then
    if grep -q 'include.*linux.mutex.h' ${file} ; then
            sed -i '/include.*<linux\/smp_lock.h>/d' ${file}
    else
            sed -i 's/include.*<linux\/smp_lock.h>.*$/include <linux\/mutex.h>/g' ${file}
    fi
    sed -i ${file} \
        -e "/^#include.*linux.mutex.h/,$ {
                1,/^\(static\|int\|long\)/ {
                     /^\(static\|int\|long\)/istatic DEFINE_MUTEX(${name}_mutex);

} }"  \
    -e "s/\(un\)*lock_kernel\>[ ]*()/mutex_\1lock(\&${name}_mutex)/g" \
    -e '/[      ]*cycle_kernel_lock();/d'
else
    sed -i -e '/include.*\<smp_lock.h\>/d' ${file}  \
                -e '/cycle_kernel_lock()/d'
fi

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-05 15:01:10 +02:00
NeilBrown
ddcf3522cf md: fix v1.x metadata update when a disk is missing.
If an array with 1.x metadata is assembled with the last disk missing,
md doesn't properly record the fact that the disk was missing.

This is unlikely to cause a real problem as the event count will be
different to the count on the missing disk so it won't be included in
the array.  However it could still cause confusion.

So make sure we clear all the relevant slots, not just the early ones.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-09-17 13:53:28 +10:00
NeilBrown
126925c090 md: call md_update_sb even for 'external' metadata arrays.
Now that we depend on md_update_sb to clear variable bits in
mddev->flags (rather than trying not to set them) it is important to
always call md_update_sb when appropriate.

md_check_recovery has this job but explicitly avoids it for ->external
metadata arrays.  This is not longer appropraite, or needed.

However we do want to avoid taking the mddev lock if only
MD_CHANGE_PENDING is set as that is not cleared by md_update_sb for
external-metadata arrays.

Reported-by:  "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-09-17 13:53:13 +10:00
Martin K. Petersen
c8bf133682 Consolidate min_not_zero
We have several users of min_not_zero, each of them using their own
definition.  Move the define to kernel.h.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
2010-09-10 20:07:38 +02:00
Mike Snitzer
b372d360df dm: convey that all flushes are processed as empty
Rename __clone_and_map_flush to __clone_and_map_empty_flush for added
clarity.

Simplify logic associated with REQ_FLUSH conditionals.

Introduce a BUG_ON() and add a few more helpful comments to the code
so that it is clear that all flushes are empty.

Cleanup __split_and_process_bio() so that an empty flush isn't processed
by a 'sector_count' focused while loop.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Kiyoshi Ueda
05447420f9 dm: fix locking context in queue_io()
Now queue_io() is called from dec_pending(), which may be called with
interrupts disabled, so queue_io() must not enable interrupts
unconditionally and must save/restore the current interrupts status.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Tejun Heo
6a8736d10c dm: relax ordering of bio-based flush implementation
Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's.  This patch relaxes ordering around flushes.

* A flush bio is no longer deferred to workqueue directly.  It's
  processed like other bio's but __split_and_process_bio() uses
  md->flush_bio as the clone source.  md->flush_bio is initialized to
  empty flush during md initialization and shared for all flushes.

* As a flush bio now travels through the same execution path as other
  bio's, there's no need for dedicated error handling path either.  It
  can use the same error handling path in dec_pending().  Dedicated
  error handling removed along with md->flush_error.

* When dec_pending() detects that a flush has completed, it checks
  whether the original bio has data.  If so, the bio is queued to the
  deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

* As flush sequencing is handled in the usual issue/completion path,
  dm_wq_work() no longer needs to handle flushes differently.  Now its
  only responsibility is re-issuing deferred bio's the same way as
  _dm_request() would.  REQ_FLUSH handling logic including
  process_flush() is dropped.

* There's no reason for queue_io() and dm_wq_work() write lock
  dm->io_lock.  queue_io() now only uses md->deferred_lock and
  dm_wq_work() read locks dm->io_lock.

* bio's no longer need to be queued on the deferred list while a flush
  is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.

This avoids stalling the device during flushes and simplifies the
implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Tejun Heo
29e4013de7 dm: implement REQ_FLUSH/FUA support for request-based dm
This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
  suggested by Mike Snitzer

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Tejun Heo
d87f4c14f2 dm: implement REQ_FLUSH/FUA support for bio-based dm
This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.

* -EOPNOTSUPP handling logic dropped.

* Preflush is handled as before but postflush is dropped and replaced
  with passing down REQ_FUA to member request_queues.  This replaces
  one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
  for flushes and guarantees all FLUSH bio's going to targets are zero
`  length.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
  targets are zero length.  bio_empty_barrier() tests are replaced
  with REQ_FLUSH tests.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
  enough to be marked with unlikely().

* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
  doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
  capability.

* Request based dm isn't converted yet.  dm_init_request_based_queue()
  resets flush support to 0 for now.  To avoid disturbing request
  based dm code, dm->flush_error is added for bio based dm while
  requested based dm continues to use dm->barrier_error.

Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Tejun Heo
e9c7469bb4 md: implment REQ_FLUSH/FUA support
This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
  processing of other requests and thus there is no reason to mark the
  queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
  logic.  Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
  be handled the same way as other writes - their processing can be
  deferred to request_queue of member devices.  md_barrier_request()
  is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough.  raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
  request_queues of member devices.  Barrier related logic removed.

* raid5: Queue draining logic dropped.  FUA bit is propagated through
  biodrain and stripe resconstruction such that all the updated parts
  of the stripe are written out with FUA writes if any of the dirtying
  writes was FUA.  preread_active_stripes handling in make_request()
  is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Tejun Heo
4913efe456 block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()
Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().

blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
device has write cache and can flush it, it should set REQ_FLUSH.  If
the device can handle FUA writes, it should also set REQ_FUA.

All blk_queue_ordered() users are converted.

* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:36 +02:00
NeilBrown
070dc6dd71 md: resolve confusion of MD_CHANGE_CLEAN
MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.

The two purposes are:
 1/ tell md_update_sb that an update is needed and that it is just a
   clean/dirty transition.
 2/ tell user-space that an transition from clean to dirty is pending
    (something wants to write), and tell te kernel (by clearin the
    flag) that the transition is OK.

The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.

This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-30 18:06:21 +10:00
Dan Williams
bd52b74626 md: don't clear MD_CHANGE_CLEAN in md_update_sb() for external arrays
If this bit is cleared in md_update_sb() the kernel will allow writes to the
array if userspace triggers md_allow_write(), e.g. through stripe_cache_size,
when mdmon is not active.  When mdmon is active the array transitions to
active-idle bypassing write-pending, setting up a race for mdmon to set the
array clean before a write arrives.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-30 18:06:20 +10:00
NeilBrown
7c44ece988 Move .gitignore from drivers/md to lib/raid6
Another missing bit of the raid6 -> /lib move.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-30 17:35:52 +10:00
NeilBrown
2c7d46ec19 md raid-1/10 Fix bio_rw bit manipulations again
commit 7b6d91daee changed the behaviour
of a few variables in raid1 and raid10 from flags to bit-sets, but
left them as type 'bool' so they did not work.

Change them (back) to unsigned long.
(historical note: see 1ef04fefe2)

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jiri Slaby <jslaby@suse.cz> and many others
2010-08-18 16:16:05 +10:00
NeilBrown
6b96562054 md: provide appropriate return value for spare_active functions.
md_check_recovery expects ->spare_active to return 'true' if any
spares were activated, but none of them do, so the consequent change
in 'degraded' is not notified through sysfs.

So count the number of spares activated, subtract it from 'degraded'
just once, and return it.

Reported-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-18 12:04:32 +10:00
Adrian Drzewiecki
e6ffbcb6cd md: Notify sysfs when RAID1/5/10 disk is In_sync.
When RAID1 is done syncing disks, it'll update the state
of synced rdevs to In_sync. But it neglected to notify
sysfs that the attribute changed. So any programs that
are waiting for an rdev's state to change will not be
woken.

(raid5/raid10 added by neilb)

Signed-off-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-18 11:49:02 +10:00
NeilBrown
3a3a5ddb7a Update recovery_offset even when external metadata is used.
The update of ->recovery_offset in sync_sbs is appropriate even then external
metadata is in use.  However sync_sbs is only called when native
metadata is used.

So move that update in to the top of md_update_sb (which is the only
caller of sync_sbs) before the test on ->external.

This moves the update out of ->write_lock protection, but those fields
only need ->reconfig_mutex protection which they still have.

Also move the test on ->persistent up to where ->external is set as
for metadata update purposes they are the same.

Clear MD_CHANGE_DEVS and MD_CHANGE_CLEAN as they can only be confusing
if ->external is set or ->persistent isn't.

Finally move the update of ->utime down as it is only relevent (like
the ->events update) for native metadata.

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
2010-08-18 11:39:38 +10:00
Mike Snitzer
959eb4e559 dm mpath: support discard
Enable discard support in the DM multipath target.

This discard support depends on a few discard-specific fixes to the
block layer's request stacking driver methods.

Discard requests are optional so don't allow a failed discard to trigger
path failures.  If there is a real problem with a given path the
barriers associated with the discard (either before or after the
discard) will cause path failure.  That said, unconditionally passing
discard failures up the stack is not ideal.  This must be fixed once DM
has more information about the nature of the underlying storage failure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
2010-08-12 04:14:32 +01:00
Mikulas Patocka
7b76ec11fe dm stripe: support discards
The DM core will submit a discard bio to the stripe target for each
stripe in a striped DM device.  The stripe target will determine
stripe-specific portions of the supplied bio to be remapped into
individual (at most 'num_discard_requests' extents).  If a given
stripe-specific discard bio doesn't touch a particular stripe the bio
will be dropped.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:14:26 +01:00
Mike Snitzer
a79245b3e5 dm: split discard requests on target boundaries
Update __clone_and_map_discard to loop across all targets in a DM
device's table when it processes a discard bio.  If a discard crosses a
target boundary it must be split accordingly.

Update __issue_target_requests and __issue_target_request to allow a
cloned discard bio to have a custom start sector and size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:14:24 +01:00
Mikulas Patocka
c96053b767 dm stripe: optimize sector division
Optimize sector division: If the number of stripes is a power of two,
we can do shift and mask instead of division.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:14:21 +01:00
Mikulas Patocka
65988525ab dm stripe: move sector translation to a function
Move sector to stripe translation into a function.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:14:14 +01:00