[PATCH v2 0/1] nilfs2: add missing blkdev_issue_flush() to nilfs_sync

Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag THE_NILFS_FLUSHED is
introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed.

Signed-off-by: Andreas Rohner <andreas.rohner-***@public.gmane.org>
---
fs/nilfs2/file.c | 3 ++-
fs/nilfs2/ioctl.c | 3 ++-
fs/nilfs2/segment.c | 2 ++
fs/nilfs2/super.c | 8 ++++++++
fs/nilfs2/the_nilfs.h | 6 ++++++
5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..7857460 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,7 +56,8 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
mutex_unlock(&inode->i_mutex);

nilfs = inode->i_sb->s_fs_info;
- if (!err && nilfs_test_opt(nilfs, BARRIER)) {
+ if (!err && nilfs_test_opt(nilfs, BARRIER) &&
+ !test_and_set_nilfs_flushed(nilfs)) {
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (err != -EIO)
err = 0;
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..dc5d101 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,7 +1022,8 @@ static int nilfs_ioctl_sync(struct inode *inode, struct file *filp,
return ret;

nilfs = inode->i_sb->s_fs_info;
- if (nilfs_test_opt(nilfs, BARRIER)) {
+ if (nilfs_test_opt(nilfs, BARRIER) &&
+ !test_and_set_nilfs_flushed(nilfs)) {
ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (ret == -EIO)
return ret;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..54a6be1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1842,6 +1842,8 @@ static void nilfs_segctor_complete_write(struct nilfs_sc_info *sci)
nilfs_segctor_clear_metadata_dirty(sci);
} else
clear_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags);
+
+ clear_nilfs_flushed(nilfs);
}

static int nilfs_segctor_wait(struct nilfs_sc_info *sci)
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..332fdf0 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,7 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+ set_nilfs_flushed(nilfs);
return nilfs_sync_super(sb, flag);
}

@@ -514,6 +515,13 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);

+ if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+ !test_and_set_nilfs_flushed(nilfs)) {
+ err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+ if (err != -EIO)
+ err = 0;
+ }
+
return err;
}

diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..d12a8ce 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -41,6 +41,7 @@ enum {
THE_NILFS_DISCONTINUED, /* 'next' pointer chain has broken */
THE_NILFS_GC_RUNNING, /* gc process is running */
THE_NILFS_SB_DIRTY, /* super block is dirty */
+ THE_NILFS_FLUSHED, /* volatile data was flushed to disk */
};

/**
@@ -202,6 +203,10 @@ struct the_nilfs {
};

#define THE_NILFS_FNS(bit, name) \
+static inline int test_and_set_nilfs_##name(struct the_nilfs *nilfs) \
+{ \
+ return test_and_set_bit(THE_NILFS_##bit, &(nilfs)->ns_flags); \
+} \
static inline void set_nilfs_##name(struct the_nilfs *nilfs) \
{ \
set_bit(THE_NILFS_##bit, &(nilfs)->ns_flags); \
@@ -219,6 +224,7 @@ THE_NILFS_FNS(INIT, init)
THE_NILFS_FNS(DISCONTINUED, discontinued)
THE_NILFS_FNS(GC_RUNNING, gc_running)
THE_NILFS_FNS(SB_DIRTY, sb_dirty)
+THE_NILFS_FNS(FLUSHED, flushed)

/*
* Mount option operations

Ryusuke Konishi

2014-09-01 17:59:39 UTC

Hi Andreas,

Post by Andreas Rohner
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device. But this depends on
the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary. So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.
In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag THE_NILFS_FLUSHED is
introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed.

The patch looks good to me except that I feel the use of atomic
test-and-set bitwise operations something unfavorable (though it's
logically correct). I will try to send this to upstream as is unless
a comment comes to mind.

Thanks,
Ryusuke Konishi

Post by Andreas Rohner
---
fs/nilfs2/file.c | 3 ++-
fs/nilfs2/ioctl.c | 3 ++-
fs/nilfs2/segment.c | 2 ++
fs/nilfs2/super.c | 8 ++++++++
fs/nilfs2/the_nilfs.h | 6 ++++++
5 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2497815..7857460 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -56,7 +56,8 @@ int nilfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
mutex_unlock(&inode->i_mutex);
nilfs = inode->i_sb->s_fs_info;
- if (!err && nilfs_test_opt(nilfs, BARRIER)) {
+ if (!err && nilfs_test_opt(nilfs, BARRIER) &&
+ !test_and_set_nilfs_flushed(nilfs)) {
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (err != -EIO)
err = 0;
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 422fb54..dc5d101 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1022,7 +1022,8 @@ static int nilfs_ioctl_sync(struct inode *inode, struct file *filp,
return ret;
nilfs = inode->i_sb->s_fs_info;
- if (nilfs_test_opt(nilfs, BARRIER)) {
+ if (nilfs_test_opt(nilfs, BARRIER) &&
+ !test_and_set_nilfs_flushed(nilfs)) {
ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (ret == -EIO)
return ret;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index a1a1916..54a6be1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1842,6 +1842,8 @@ static void nilfs_segctor_complete_write(struct nilfs_sc_info *sci)
nilfs_segctor_clear_metadata_dirty(sci);
} else
clear_bit(NILFS_SC_SUPER_ROOT, &sci->sc_flags);
+
+ clear_nilfs_flushed(nilfs);
}
static int nilfs_segctor_wait(struct nilfs_sc_info *sci)
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 228f5bd..332fdf0 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -310,6 +310,7 @@ int nilfs_commit_super(struct super_block *sb, int flag)
nilfs->ns_sbsize));
}
clear_nilfs_sb_dirty(nilfs);
+ set_nilfs_flushed(nilfs);
return nilfs_sync_super(sb, flag);
}
@@ -514,6 +515,13 @@ static int nilfs_sync_fs(struct super_block *sb, int wait)
}
up_write(&nilfs->ns_sem);
+ if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+ !test_and_set_nilfs_flushed(nilfs)) {
+ err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+ if (err != -EIO)
+ err = 0;
+ }
+
return err;
}
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index d01ead1..d12a8ce 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -41,6 +41,7 @@ enum {
THE_NILFS_DISCONTINUED, /* 'next' pointer chain has broken */
THE_NILFS_GC_RUNNING, /* gc process is running */
THE_NILFS_SB_DIRTY, /* super block is dirty */
+ THE_NILFS_FLUSHED, /* volatile data was flushed to disk */
};
/**
@@ -202,6 +203,10 @@ struct the_nilfs {
};
#define THE_NILFS_FNS(bit, name) \
+static inline int test_and_set_nilfs_##name(struct the_nilfs *nilfs) \
+{ \
+ return test_and_set_bit(THE_NILFS_##bit, &(nilfs)->ns_flags); \
+} \
static inline void set_nilfs_##name(struct the_nilfs *nilfs) \
{ \
set_bit(THE_NILFS_##bit, &(nilfs)->ns_flags); \
@@ -219,6 +224,7 @@ THE_NILFS_FNS(INIT, init)
THE_NILFS_FNS(DISCONTINUED, discontinued)
THE_NILFS_FNS(GC_RUNNING, gc_running)
THE_NILFS_FNS(SB_DIRTY, sb_dirty)
+THE_NILFS_FNS(FLUSHED, flushed)
/*
* Mount option operations
--
2.1.0
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Andreas Rohner

2014-09-01 18:43:40 UTC

Hi Ryusuke,

I originally thought, that it is necessary to do it atomically to avoid
a race condition, but I am not so sure about that any more. I think the
only case we have to avoid is, to call set_nilfs_flushed() after
blkdev_issue_flush(), because this could race with the
clear_nilfs_flushed() from the segment construction. So this should also
work:

+ if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+ !nilfs_flushed(nilfs)) {
+ set_nilfs_flushed(nilfs);
+ err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+ if (err != -EIO)
+ err = 0;
+ }
+

What do you think?

br,
Andreas Rohner

Post by Ryusuke Konishi
Thanks,
Ryusuke Konishi

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Andreas Rohner

2014-09-01 19:18:30 UTC

Post by Andreas Rohner
Hi Ryusuke,

I originally thought, that it is necessary to do it atomically to avoid
a race condition, but I am not so sure about that any more. I think the
only case we have to avoid is, to call set_nilfs_flushed() after
blkdev_issue_flush(), because this could race with the
clear_nilfs_flushed() from the segment construction. So this should also
+ if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+ !nilfs_flushed(nilfs)) {
+ set_nilfs_flushed(nilfs);
+ err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+ if (err != -EIO)
+ err = 0;
+ }
+

On the other hand, it says in the comments to set_bit(), that it can be
reordered on architectures other than x86. test_and_set_bit() implies a
memory barrier on all architectures. But I don't think the processor
would reorder set_nilfs_flushed() after the external function call to
blkdev_issue_flush(), would it?

/**
* set_bit - Atomically set a bit in memory
* @nr: the bit to set
* @addr: the address to start counting from
*
* This function is atomic and may not be reordered. See __set_bit()
* if you do not require the atomic guarantees.
*
* Note: there are no guarantees that this function will not be reordered
* on non x86 architectures, so if you are writing portable code,
* make sure not to rely on its reordering guarantees.
*/

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2014-09-03 00:35:25 UTC

Post by Andreas Rohner
Hi Ryusuke,

I originally thought, that it is necessary to do it atomically to avoid
a race condition, but I am not so sure about that any more. I think the
only case we have to avoid is, to call set_nilfs_flushed() after
blkdev_issue_flush(), because this could race with the
clear_nilfs_flushed() from the segment construction. So this should also
+ if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+ !nilfs_flushed(nilfs)) {
+ set_nilfs_flushed(nilfs);
+ err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+ if (err != -EIO)
+ err = 0;
+ }
+

I believe compiler doesn't reorder set_bit() operation after an
external function call unless it knows the content of the function and
the function can be optimized. But, yes, set_bit() doesn't imply
memory barrier unlike test_and_set_bit(). As for
blkdev_issue_flush(), it would imply memory barrier by some lock
functions or other primitive used inside it. (I haven't actually
confirmed that the premise is true)

On the other hand, we need explicit barrier operation like
smp_mb__after_atomic() if a certain operation is performed after
set_bit() and the changed bit should be visible to other processors
before the operation.

Regards,
Ryusuke Konishi

Post by Andreas Rohner
/**
* set_bit - Atomically set a bit in memory
*
* This function is atomic and may not be reordered. See __set_bit()
* if you do not require the atomic guarantees.
*
* Note: there are no guarantees that this function will not be reordered
* on non x86 architectures, so if you are writing portable code,
* make sure not to rely on its reordering guarantees.
*/
br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Andreas Rohner

2014-09-03 12:32:22 UTC

Post by Andreas Rohner
Hi Ryusuke,

I originally thought, that it is necessary to do it atomically to avoid
a race condition, but I am not so sure about that any more. I think the
only case we have to avoid is, to call set_nilfs_flushed() after
blkdev_issue_flush(), because this could race with the
clear_nilfs_flushed() from the segment construction. So this should also
+ if (wait && !err && nilfs_test_opt(nilfs, BARRIER) &&
+ !nilfs_flushed(nilfs)) {
+ set_nilfs_flushed(nilfs);
+ err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+ if (err != -EIO)
+ err = 0;
+ }
+

Yes blkdev_issue_flush() probably implies a memory barrier.

Post by Ryusuke Konishi
On the other hand, we need explicit barrier operation like
smp_mb__after_atomic() if a certain operation is performed after
set_bit() and the changed bit should be visible to other processors
before the operation.

Great suggestion. I didn't know about those functions. Do we also need a
call to smp_mb__before_atomic() before clear_nilfs_flushed(nilfs) in
segment.c?

I would be happy to provide another version of the patch with
set_nilfs_flushed(nilfs) and smp_mb__after_atomic() if you prefer that
version over the test_and_set_bit approach...

br,
Andreas Rohner
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2014-09-07 05:12:32 UTC

Hi Andreas,

Great suggestion. I didn't know about those functions.

I recommend you read Documentation/memory-barries.txt. It's an
excellent document summarizing information on what we should know
about memory synchronization on smp. Documentation/atomic_ops.txt
also contains some information on barriers related to atomic
operations.

Post by Andreas Rohner
Do we also need a call to smp_mb__before_atomic() before
clear_nilfs_flushed(nilfs) in segment.c?

I think the timing restrictions of this flag are not so serve. The
only restrictions that the flag must ensure are:

1) Bioes for log are completed before this flag is cleared.
2) Clearing the flag is propagated to the processor executing
nilfs_sync_fs() and nilfs_sync_file() before log writer returns.

The restriction (1) is guaranteed since nilfs_wait_on_logs() is called
before nilfs_segctor_complete_write(). This sequence appears at
nilfs_segctor_wait() function.

The restriction (2) looks to be satisfied by (at least)
nilfs_segctor_notify() that nilfs_segctor_construct() calls or
nilfs_transaction_unlock() that nilfs_construct_dsync_segment() calls.

Post by Andreas Rohner
I would be happy to provide another version of the patch with
set_nilfs_flushed(nilfs) and smp_mb__after_atomic() if you prefer that
version over the test_and_set_bit approach...

Two additional comments:

- Splitting test_and_set_bit() into test_bit() and set_bit() can
introduce a race condition. Two processors can call test_bit() at
the same time and both can call set_bit() and blkdev_issue_flush().
But, this race is not critical. It only allows duplicate
blkdev_issue_flush() calls in the rare case, and I think it's
ignorable.

- clear_nilfs_flushed() seems to be called more than necessary.
Incomplete logs that the mount time recovery of nilfs doesn't
salvage do not need to be flushed. In this sense, it may be enough
only for logs containing a super root and those for datasync
nilfs_construct_dsync_segment() creates.

By the way, we are using atomic bit operations too much. Even though
{set,clear}_bit() don't imply a memory barrier, they still imply a
lock prefix to protect the flag from other bit operations on ns_flags.
For load and store of integer variables which are properly aligned to
a cache line, modern processors naturally satisfy atomicity without
additional lock operations. I think we can replace the flag with just
an integer variable like "int ns_flushed_device". How do you think ?

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Andreas Rohner

2014-09-08 19:03:02 UTC

Hi Ryusuke,

Sorry for the late response I was busy over the weekend.

Post by Ryusuke Konishi
Hi Andreas,

Great suggestion. I didn't know about those functions.

Post by Andreas Rohner
Do we also need a call to smp_mb__before_atomic() before
clear_nilfs_flushed(nilfs) in segment.c?

I think the timing restrictions of this flag are not so serve. The
1) Bioes for log are completed before this flag is cleared.
2) Clearing the flag is propagated to the processor executing
nilfs_sync_fs() and nilfs_sync_file() before log writer returns.
The restriction (1) is guaranteed since nilfs_wait_on_logs() is called
before nilfs_segctor_complete_write(). This sequence appears at
nilfs_segctor_wait() function.
The restriction (2) looks to be satisfied by (at least)
nilfs_segctor_notify() that nilfs_segctor_construct() calls or
nilfs_transaction_unlock() that nilfs_construct_dsync_segment() calls.

I agree with both points.

- Splitting test_and_set_bit() into test_bit() and set_bit() can
introduce a race condition. Two processors can call test_bit() at
the same time and both can call set_bit() and blkdev_issue_flush().
But, this race is not critical. It only allows duplicate
blkdev_issue_flush() calls in the rare case, and I think it's
ignorable.

I agree.

Post by Ryusuke Konishi
- clear_nilfs_flushed() seems to be called more than necessary.
Incomplete logs that the mount time recovery of nilfs doesn't
salvage do not need to be flushed. In this sense, it may be enough
only for logs containing a super root and those for datasync
nilfs_construct_dsync_segment() creates.

Yes you are right I will change that as well.

On the other hand it seems to me, that almost any file operation causes
a super root to be written. Even if you use fdatasync(). If the i_mtime
on the inode has to be changed, then NILFS_I_INODE_DIRTY is set and the
fdatasync() turns into a normal sync(), which always writes a super
root. Every write() to a file causes an update of i_mtime. I could only
make fdatasync() work as intended with mmap(), but maybe I am missing
something. So we may not save us a lot of updates by only updating the
flag in case the log contains a super root...

Post by Ryusuke Konishi
By the way, we are using atomic bit operations too much. Even though
{set,clear}_bit() don't imply a memory barrier, they still imply a
lock prefix to protect the flag from other bit operations on ns_flags.
For load and store of integer variables which are properly aligned to
a cache line, modern processors naturally satisfy atomicity without
additional lock operations. I think we can replace the flag with just
an integer variable like "int ns_flushed_device". How do you think ?

I think that is a good idea. I will implement that right away.

br,
Andreas Rohner

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2014-09-09 18:55:40 UTC