snapshots contain the same rrd database

Discussion:

Piotr Szymaniak

2014-02-26 13:32:02 UTC

Hi,

I got a system crash after some 160+ days uptime. After a hard reboot I
noticed my rrd database looks corrupted.

So I changed some recent checkpoints to snapshots, mounted them and...
all the rrd files are the same!

Here's some info about current state:

wloczykij ~ # lscp /dev/sda3 | grep ss
211211 2014-02-22 16:58:11 ss - 119 54904
211219 2014-02-22 18:18:28 ss - 124 54910
211811 2014-02-25 00:39:21 ss - 140 54922
211872 2014-02-25 09:47:16 ss - 160 54922
212008 2014-02-26 01:13:14 ss - 114 54929
212026 2014-02-26 03:22:45 ss - 28 54928
212042 2014-02-26 04:13:48 ss - 29 54928
212045 2014-02-26 04:24:00 ss - 29 54928

wloczykij ~ # mount | grep cp
/dev/sda3 on /tmp/211219 type nilfs2 (ro,cp=211219)
/dev/sda3 on /tmp/211211 type nilfs2 (ro,cp=211211)
/dev/sda3 on /tmp/212026 type nilfs2 (ro,cp=212026)
/dev/sda3 on /tmp/212045 type nilfs2 (ro,cp=212045)

wloczykij ~ # for sumrrd in 211219 211211 212026 212045; do md5sum /tmp/$sumrrd/var/www/grubelek.pl/termometr/temp0.rrd; done
71f60c620a493021bb5e1c32c555abe8 /tmp/211219/var/www/grubelek.pl/termometr/temp0.rrd
71f60c620a493021bb5e1c32c555abe8 /tmp/211211/var/www/grubelek.pl/termometr/temp0.rrd
71f60c620a493021bb5e1c32c555abe8 /tmp/212026/var/www/grubelek.pl/termometr/temp0.rrd
71f60c620a493021bb5e1c32c555abe8 /tmp/212045/var/www/grubelek.pl/termometr/temp0.rrd

This is bad news! What should I do next? All the rrd dumps have the same
modification date:
<lastupdate>1376166602</lastupdate> 
(looks previous boot before the crash?)

I just moved the rrd to btrfs and made a subvolume snapshot and after about an
hour rrd files are different:

wloczykij ~ # md5sum /home/services/termometr/temp0.rrd /home/snapshot-2014-02-26/services/termometr/temp0.rrd
2999dc7071d94e701d5246d79ccc488f /home/services/termometr/temp0.rrd
1621f31fb7c27f1f3c0b0d8f0f5ede9e /home/snapshot-2014-02-26/services/termometr/temp0.rrd

wloczykij ~ # nilfs-tune -l /dev/sda3
nilfs-tune 2.1.5
Filesystem volume name: (none)
Filesystem UUID: f18e80b1-f3c1-49ec-baa5-39c0edc4c0b9
Filesystem magic number: 0x3434
Filesystem revision #: 2.0
Filesystem features: (none)
Filesystem state: invalid or mounted
Filesystem OS type: Linux
Block size: 4096
Filesystem created: Sat Aug 13 10:36:21 2011
Last mount time: Wed Feb 26 09:33:53 2014
Last write time: Wed Feb 26 14:15:29 2014
Mount count: 59
Maximum mount count: 50
Reserve blocks uid: 0 (user root)
Reserve blocks gid: 0 (group root)
First inode: 11
Inode size: 128
DAT entry size: 32
Checkpoint size: 192
Segment usage size: 16
Number of segments: 465
Device size: 3908042752
First data block: 1
# of blocks per segment: 2048
Reserved segments %: 5
Last checkpoint #: 212170
Last block address: 546866
Last sequence #: 35128
Free blocks count: 227328
Commit interval: 600
# of blks to create seg: 0
CRC seed: 0x1a1e847d
CRC check sum: 0x57f59c5c
CRC check data size: 0x00000118

wloczykij ~ # uname -sr
Linux 3.4.56

Piotr Szymaniak.

--
(...) postÄpili tak, jakby odkryli zasady rzÄdzÄce fizykÄ kwantowÄ,
a nastÄpnie wykorzystali je do zaprojektowania nowej gry telewizyjnej
- a potem, co gorsza, doszli do wniosku, ÅŒe caÅa fizyka kwantowa tylko
do tego siÄ nadaje...
-- Stephen King, "Dreamcatcher"

Vyacheslav Dubeyko

2014-02-26 13:54:21 UTC

Permalink

Hi Piotr,

Post by Piotr Szymaniak
Hi,
I got a system crash after some 160+ days uptime. After a hard reboot I
noticed my rrd database looks corrupted.
So I changed some recent checkpoints to snapshots, mounted them and...
all the rrd files are the same!

To be honest, I don't understand clearly:
(1) How did you get the issue?
(2) Did you create snapshots after crash?
(3) Had you some snapshots before crash?

If you had a crash then you should have some error messages in the
system log. Have you something? Or did you lose all error messages
during the crash?

Anyway, I need to have the reproducing path for investigate the issue.
Of course, I am not going to wait 160 days before achieving the issue
reproducibility. :) One of the possible way is to share some small
NILFS2 volume with good issue reproducibility. But, currently, I don't
quite follow in what way I can reproduce the issue.

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Piotr Szymaniak

2014-02-26 14:21:40 UTC

Permalink

Post by Vyacheslav Dubeyko
Hi Piotr,

(1) How did you get the issue?

To me it looks like the file hasn't changed since the first boot. Like
it's not written at all? Is there a way to check something like "file
position" on disk in specific snapshot?

rrds are a bit weird databases. When created they are, ie. size A. And
all the way in time they gather some data and are always in that size A.
The size doesn't change. Maybe this is related?

Post by Vyacheslav Dubeyko
(2) Did you create snapshots after crash?

Yes.

Post by Vyacheslav Dubeyko
(3) Had you some snapshots before crash?

No.

Post by Vyacheslav Dubeyko
If you had a crash then you should have some error messages in the
system log. Have you something? Or did you lose all error messages
during the crash?

The crash was related to a process running on a different filesystem. My
syslog has only garbage, so yes, it is lost.

Post by Vyacheslav Dubeyko
Anyway, I need to have the reproducing path for investigate the issue.
Of course, I am not going to wait 160 days before achieving the issue
reproducibility. :) One of the possible way is to share some small
NILFS2 volume with good issue reproducibility. But, currently, I don't
quite follow in what way I can reproduce the issue.

I suppose this could be related to this "size A" mentioned above. Will
try to figure out some reproducibility path.

Piotr Szymaniak.

--
Jest tam jedno powiedzenie... nie pamietam go dokladnie, ale brzmi
mniej wiecej tak: "Czlowiek wyczuwajacy wiatr zmian winien budowac nie
oslony od wiatru, lecz mlyny".
-- Stephen King, "The Dead Zone"

Vyacheslav Dubeyko

2014-02-26 14:39:01 UTC

Permalink

Post by Piotr Szymaniak

Post by Vyacheslav Dubeyko
Hi Piotr,

(1) How did you get the issue?

To me it looks like the file hasn't changed since the first boot. Like
it's not written at all? Is there a way to check something like "file
position" on disk in specific snapshot?
rrds are a bit weird databases. When created they are, ie. size A. And
all the way in time they gather some data and are always in that size A.
The size doesn't change. Maybe this is related?

So, as far as I can judge, you can reproduce the issue stably. And you
suppose that file doesn't be written at all. How segctor and
nilfs-clenared threads behave itself as processes? Could you check that
they doesn't eat 100% of CPU?

If segctor is unable to flush data then it makes sense to use "echo t

Post by Piotr Szymaniak
/proc/sysrq-trigger" for getting info about processes state. This

command outputs into system log usually.

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2014-02-27 02:45:47 UTC

Permalink

Hi Piotr,

<snip>

Post by Piotr Szymaniak
wloczykij ~ # uname -sr
Linux 3.4.56

This version looks a bit old. The current head of linux-3.4.y is
v3.4.82.

The following important bug fixes are not included in this version:

$ git shortlog v3.4.56..v3.4.82 | grep nilfs
nilfs2: fix segctor bug that causes file system corruption
nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error
nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection

Was it a distro kernel?

If you can try the latest version, I hope it both for avoiding
critical error of yours and for narrowing down cause of the problem.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Piotr Szymaniak

2014-02-27 10:58:07 UTC

Permalink

Post by Vyacheslav Dubeyko
Hi Piotr,

<snip>

Post by Piotr Szymaniak
wloczykij ~ # uname -sr
Linux 3.4.56

This version looks a bit old. The current head of linux-3.4.y is
v3.4.82.

Updated.

Post by Vyacheslav Dubeyko
$ git shortlog v3.4.56..v3.4.82 | grep nilfs
nilfs2: fix segctor bug that causes file system corruption
nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error
nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection
Was it a distro kernel?

No, it's a hand made Gentoo kernel.

Piotr Szymaniak.

--
Czekajcie, czekajcie. Ktos cos do mnie mowil, ale nie wiem kto i nie
wiem co.
-- Rafal Solecki