Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#533 - anon_mount 1 in fstab freezes boot in wrt3200acm #8011

Closed
openwrt-bot opened this issue Feb 18, 2017 · 4 comments
Closed

FS#533 - anon_mount 1 in fstab freezes boot in wrt3200acm #8011

openwrt-bot opened this issue Feb 18, 2017 · 4 comments
Labels

Comments

@openwrt-bot
Copy link

hnyman:

I have run into strange trouble with my new WRT3200ACM:

anon_mount 1 option in /etc/config/fstab causes the boot process to hang.
Hang can be triggered manually with "/sbin/block mount" when that option is set.
Kernel log fills from entries "pxa3xx-nand f10d0000.flash: Wait time out!!!"

I am filing this bug report for 17.01, but the bug seems to be also in master.


For some reason enabling the anon_mount option in /etc/config/fstab prevents the boot process from completing. The boot progresses to some stage and router e.g. answers ping, but power led blinks and it does not allow login and does not provide dhcp etc. services.

I have had that option in all my routers (WNDR3700, WNDR3800, R7800) to make USB memory sticks to automount when inserted into the router's USB port. So it is a quite normal setting.

I first compiled a mvebu wrt3200acm firmware image from master with kernel 4.9, and that failed to boot. I thought that the new kernel 4.9 might be the reason. So I installed 17.01.0-rc2 official image and it booted ok. I installed packages, configured, rebooted and everything was ok as long as I did not change that anon_mount to 1 in fstab. After changing that, the next boot again gets stuck. (Editing the option in failsafe enables the router to boot normally.)

Boot completes ok with the output of "block detect" entered into fstab:

config 'global'
        option  anon_swap       '0'
        option  anon_mount      '0'
        option  auto_swap       '1'
        option  auto_mount      '1'
        option  delay_root      '5'
        option  check_fs        '0'

Boot does not complete with anon_mount 1 in /etc/config/fstab:

config 'global'
        option  anon_swap       '0'
        option  anon_mount      '1'
        option  auto_swap       '1'
        option  auto_mount      '1'
        option  delay_root      '5'
        option  check_fs        '0'

I have no good guess about the reason. My initial guess was some connection to ubi1:syscfg being mounted as /tmp/syscfg as that is different than in my other devices, but it is hard to see how that could be the reason. (That mounting is done in /lib/preinit/81_linksys_syscfg )

After some investigation it looks like the reason is actually "/sbin/block mount".

At the boot the init script /etc/init.d/fstab actually runs "/sbin/block mount".
Running that command manually from console (with anon_mount 1) makes the console to hang. From another console window I can see that system & kernel log fills from entries like these:

kern.err kernel: [ 3032.483307] pxa3xx-nand f10d0000.flash: Wait time out!!!
kern.err kernel: [ 3032.683301] pxa3xx-nand f10d0000.flash: Wait time out!!!
kern.err kernel: [ 3032.883296] pxa3xx-nand f10d0000.flash: Wait time out!!!
kern.err kernel: [ 3033.083293] pxa3xx-nand f10d0000.flash: Wait time out!!!
kern.err kernel: [ 3033.283291] pxa3xx-nand f10d0000.flash: Wait time out!!!
kern.err kernel: [ 3033.483287] pxa3xx-nand f10d0000.flash: Wait time out!!!
kern.err kernel: [ 3033.683283] pxa3xx-nand f10d0000.flash: Wait time out!!!

And the "block" process can't even be killed with "kill -9" or "kill -15".
I rebooted the router from power switch to get it killed.

With anon_mount 0 in fstab the same command works ok and produces:

root@LEDE:~# /sbin/block mount
block: /dev/ubiblock0_0 is already mounted on /rom
block: /dev/ubi0_1 is already mounted on /overlay

@openwrt-bot
Copy link
Author

hnyman:

Really strange.
I flashed the other partition with LEDE master with kernel 4.9, and tested fstab anon_mount there. It seemed to work there, so I switched the router to boot again from the same 17.01 firmware that failed yesterday with anon_mount 1. And to my surprise anon_mount 1 works now with the 17.01 firmware, too. (The same firmware as yesterday, as I have not reflashed it)

The only difference is that now the other partition has been flashed once with LEDE, as yesterday there was still the Linksys OEM firmware in the other partition.

Now "/sbin/block mount" works also with anon_mount 1 active, and shows a second ubi partition. Unlike yesterday:

17.01.0-rc2, r3131-42f3c1f

root@LEDE:~# /sbin/block mount
block: /dev/ubiblock0_0 is already mounted on /rom
block: /dev/ubi0_1 is already mounted on /overlay
block: /dev/ubi1_0 is already mounted on /mnt/ubi1_0

root@LEDE:~# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/root                 2560      2560         0 100% /rom
tmpfs                   256812       592    256220   0% /tmp
/dev/ubi0_1              58604      5512     50064  10% /overlay
overlayfs:/overlay       58604      5512     50064  10% /
ubi1:syscfg              71908       476     67724   1% /tmp/syscfg
tmpfs                      512         0       512   0% /dev
/dev/ubi1_0              71908       476     67724   1% /mnt/ubi1_0

It looks like the syscfg partition gets now mounted both as /tmp/syscfg and /mnt/ubi1_0

So, just a hypothesis, but possibly /sbin/block misdetects something when it scans the partitions, if the other partition contains the Linksys OEM firmware. Possibly some naming or formatting is slightly different?

EDIT:
Other alternative is that the sysupgrade with settings saved (to master with 4.9) has changed the syscfg partition somehow, as I think that the mvebu sysupgrade utilises that partition for saving the settings in sysupgrade.

@openwrt-bot
Copy link
Author

hnyman:

Just reporting here that forum user MagicSimi has reproduced the error condition:
https://forum.lede-project.org/t/wrt3200acm-fstab-problem-with-anon-mount/1673/8

He has similar setup as I had when the error materialised: Router has been flashed once with LEDE and the other partition still contains Linksys OEM firmware.

MagicSimi:

Ups, with the proper setup I could reproduce it as well. Helps to set "anon_mount 1".

Based on his experience, having additional defined mount sections in fstab does not change things:

I was playing around with the USB stick inserted, having a mount section for it, and/or having it mounted, when running "/sbin/block mount". That didn't show any difference, always the same error "pxa3xx-nand f10d0000.flash: Wait time out!!!".

So my guess is that something in the name or structure of the Linksys OEM firmware partition causes "/sbin/block mount" to get confused and try something that causes itself (or the NAND driver) to get stuck in a loop, where the device is unsuccessfully polled every 0.2 seconds (like my kernel log shows).

@openwrt-bot
Copy link
Author

hnyman:

This looks like a nasty bug as it will hit newcomer users who have flashed LEDE just once and are configuring their router. Simply adding "anon_mount 1" option to fstab will make router unbootable. Failsafe will work, but may be hard to find by new users :-(

Summary of findings so far:

  • fstab init script calls fstools "/sbin/block mount", which hangs permanently at boot (or if issued from command line later) if there is /etc/config/fstab containing "anon_mount 1" option.

  • It hangs when it tries to auto-mount the inactive rootfs partition that contains the original Linksys OEM contents. The partition contains raw jffs2 instead of ubi.

  • The bug will not get triggered if both firmwares have been flashed with LEDE or Openwrt, as then also the inactive rootfs contains ubi.

The hang can be seen from kernel log if the command was issued from console and another console is used to read the log. The log gets filled with this at 0.2s intervals:

kern.err kernel: [ 3032.483307] pxa3xx-nand f10d0000.flash: Wait time out!!!

So far this has been found in WRT3200ACM, but the same will likely happen with all WRT1900/1200/3200ACx devices that have dual firmware partitions.


Longer explanation:

I added debug statements to fstools block, and MagicSimi tested in his device:
https://forum.lede-project.org/t/wrt3200acm-fstab-problem-with-anon-mount/1673/13

Based on testing, the router is trying to automount "mtdblock6" and gets stuck there.

block: list_for_each_entry starts
block: mount_device for mtdblock6
block: mount_device for mtdblock6 auto_mount test
block: mount_device for mtdblock6 extroot test
block: mount_device for mtdblock6 check anon_mount
block: mount_device for mtdblock6 try anon_mount
block: handle_mount starts
...hang...

cat /proc/mtd:

dev: size erasesize name
mtd0: 00200000 00020000 "u-boot"
mtd1: 00020000 00020000 "u_env"
mtd2: 00040000 00020000 "s_env"
mtd3: 00040000 00020000 "devinfo"
mtd4: 001e0000 00020000 "sysdiag"
mtd5: 05000000 00020000 "kernel1"
mtd6: 04a00000 00020000 "rootfs1"
mtd7: 05000000 00020000 "kernel2"
mtd8: 04a00000 00020000 "ubi"
mtd9: 05600000 00020000 "syscfg"
mtd10: 005c0000 00020000 "unused_area"

/sbin/block info:
/dev/mtdblock6: TYPE="jffs2"
/dev/ubiblock0_0: UUID="643ef9c1-d0d092d2-0fa9f12b-3da3502a" VERSION="4.0" MOUNT="/rom" TYPE="squashfs"
/dev/ubi0_1: UUID="a9709d2e-35ba-4444-a544-aa10a4b4133e" VERSION="w4r0" MOUNT="/overlay" TYPE="ubifs"
/dev/ubi1_0: UUID="e16dd61c-066d-45ee-9b2b-0a92e84b06bb" VERSION="w4r0" TYPE="ubifs"
/dev/sda1: UUID="856ba000-4a72-41c5-aa1c-d374766efdf3" VERSION="1" TYPE="swap"
/dev/sda2: UUID="d99873c8-a42f-4f19-9f5b-68571ba47558" VERSION="1.0" MOUNT="/mnt/sda2" TYPE="ext3"

"block info" shows that it is a "jffs2" partition.

After that finding, I checked the binary copy of my own router's mtd partitions (which I made immediately after flashing LEDE successfully first time), and the corresponding partition really seems to contain raw jffs2 data. Starting with magic 0x1985.

85 19 01 e0 31 00 00 00 1d d9

Apparently fstools block still tries to mount it, and then the kernel driver hangs for some reason. It expects to be find ubi? The real reason for the hang is so far unclear. Something in the flash driver? jffs2 should not be mounted at all in a NAND system?

I have added more debug print commands to block and hopefully MagicSimi will test it again. That should reveal the parameters passed to the mount function etc. So far it likely that the hnag happens at the call to the system mount function.

It looks like fstools block mount should be somehow modified.

@openwrt-bot
Copy link
Author

hnyman:

Update: detailed info on the parameters used by the fstools block mount

The bug can also be triggered just by running this in a router that still has Linksys OEM firmware in the other partition:

mount -t jffs2 /dev/mtdblock6 /mnt/mtdblock6

(mtdblock6 or mtdblock8 depending on which partition contains OEM firmware)

It never finishes, doesn't give an error. Kernel log floods with
"pxa3xx-nand f10d0000.flash: Wait time out!!!" message.

This sounds like a fix is needed from fstools specialists and/or mvebu gurus. Somehow this wrong(?) jffs2 partition with OEM contents should be skipped from the automatic mount tries.

Like I said earlier, this is a nasty bug as it hits newcomer users who have flashed LEDE just once and are configuring their router. Simply adding "anon_mount 1" option to fstab will make that router unbootable.


Debugging from MagicSimi:

With "anon_mount 1":

root@LEDE:/tmp# /tmp/block mount block: list_for_each_entry starts block: mount_device for mtdblock6, type=jffs2, version=(null), label=(null), uuid=(null) block: mount_device for mtdblock6 auto_mount test block: mount_device for mtdblock6 extroot test block: mount_device for mtdblock6 check anon_mount block: mount_device for mtdblock6 try anon_mount block: handle_mount mount source=/dev/mtdblock6, target=/mnt/mtdblock6, fstype=jffs2 ...hang...

With "anon_mount 0":

root@LEDE:/tmp# /tmp/block mount block: list_for_each_entry starts block: mount_device for mtdblock6, type=jffs2, version=(null), label=(null), uuid=(null) block: mount_device for mtdblock6 auto_mount test block: mount_device for mtdblock6 extroot test block: mount_device for mtdblock6 check anon_mount block: mount_device for mtdblock6 ends block: mount_device for ubiblock0_0, type=squashfs, version=4.0, label=(null), uuid=643ef9c1-d0d092d2-0fa9f12b-3da3502a block: mount_device for ubiblock0_0 auto_mount test block: /dev/ubiblock0_0 is already mounted on /rom

block: mount_device for ubi0_1, type=ubifs, version=w4r0,
label=(null), uuid=a9709d2e-35ba-4444-a544-aa10a4b4133e
block: mount_device for ubi0_1 auto_mount test
block: /dev/ubi0_1 is already mounted on /overlay

block: mount_device for ubi1_0, type=ubifs, version=w4r0,
label=(null), uuid=e16dd61c-066d-45ee-9b2b-0a92e84b06bb
block: mount_device for ubi1_0 auto_mount test
block: mount_device for ubi1_0 extroot test
block: mount_device for ubi1_0 check anon_mount
block: mount_device for ubi1_0 ends
block: list_for_each_entry ends

The bug can also be triggered just by running

mount -t jffs2 /dev/mtdblock6 /mnt/mtdblock6

It never finishes, doesn't give an error that /mnt/mtdblock6 doesn't exist. In the log is the usual "pxa3xx-nand f10d0000.flash: Wait time out!!!" message.

It seems to be a problem with /dev/mtdblock6, which (if I understand the partition layout correctly) is the partition where the original linksys firmware is located.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant