Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ramips/mt7621: SQUASHFS filesystem corruption #9085

Open
openwrt-bot opened this issue Oct 20, 2021 · 66 comments
Open

ramips/mt7621: SQUASHFS filesystem corruption #9085

openwrt-bot opened this issue Oct 20, 2021 · 66 comments
Labels
bug issue report with a confirmed bug flyspray kernel pull request/issue with Linux kernel related changes release/21.02 pull request/issue targeted (also) for OpenWrt 21.02 release release/22.03 pull request/issue targeted (also) for OpenWrt 22.03 release target/ramips pull request/issue for ramips target

Comments

@openwrt-bot
Copy link

crowston:

Supply the following if possible:

  • Device problem occurs on

Western Digital My Net N750

  • Software versions of OpenWrt/LEDE release, packages, etc.

openwrt-21.02.0

strongswan, dnscrypt-proxy2, avahi-utils, luci-app-ddns

  • Steps to reproduce

I installed openwrt-21.02.0-ath79-generic-wd_mynet-n750-squashfs-sysupgrade.bin on a Western Digital My Net N750 that had been running openwrt-19.

The router seemed okay initially but after power cycling, it started reporting errors:

Oct 17 12:20:37 router2 kernel: [ 38.613970] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 12:20:37 router2 kernel: [ 38.621029] SQUASHFS error: squashfs_read_data failed to read block 0x23686e
Oct 17 12:20:37 router2 kernel: [ 38.628199] SQUASHFS error: Unable to read fragment cache entry [23686e]
Oct 17 12:20:37 router2 kernel: [ 38.635010] SQUASHFS error: Unable to read page, block 23686e, size 16b28

The filesystem problem would leave some random file damaged, so different services would fail. Over time, the router became less and less functional as various files became inaccessible and after a few cycles, wouldn't boot at all.

I wondered if there was a problem with my old configuration on the new release (though I'm not sure how that could damage the squashfs), so I reinstalled a few more times in different ways, e.g., doing a factory install (openwrt-21.02.0-ath79-generic-wd_mynet-n750-squashfs-factory.bin and then the upgrade) instead of just the upgrade, and configuring from scratch rather than from the backup. But each time I had the same problem with the router.

It wasn't the same block on different installs, I noticed, but it seemed to be consistent for a particular installation attempt.

Oct 17 16:11:14 router2 kernel: [ 53.182571] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:11:14 router2 kernel: [ 53.189582] SQUASHFS error: squashfs_read_data failed to read block 0x21e9e6
Oct 17 16:11:14 router2 kernel: [ 53.196749] SQUASHFS error: Unable to read fragment cache entry [21e9e6]
Oct 17 16:11:14 router2 kernel: [ 53.203559] SQUASHFS error: Unable to read page, block 21e9e6, size fd9c

Once there were two blocks (I think this is a reboot of the install above):

Oct 17 16:29:04 router2 kernel: [ 78.505075] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:29:04 router2 kernel: [ 78.512103] SQUASHFS error: squashfs_read_data failed to read block 0x1e6e76
Oct 17 16:29:05 router2 kernel: [ 79.111366] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:29:05 router2 kernel: [ 79.118386] SQUASHFS error: squashfs_read_data failed to read block 0x21e9e6
Oct 17 16:29:05 router2 kernel: [ 79.125565] SQUASHFS error: Unable to read fragment cache entry [21e9e6]
Oct 17 16:29:05 router2 kernel: [ 79.132445] SQUASHFS error: Unable to read page, block 21e9e6, size fd9c

One time there was first a jffs error, followed by lots of squashfs errors. Sorry, I don't have the log for that one.

I now realize that I should have tried power cycling a clean install a few times to see if there were errors right away or if they only happened after files were installed/changed.

To check whether the router was just having a hardware problem, I reinstalled openwrt-19.07.8 and configured it the same. I have not seen any errors after a few power cycles, which points to a problem with the new release. I did not see any bug reports on this tracker that mention squashfs problems and googling, I did not find any useful discussions, hence this bug report.

I guess it could be that the new release uses a bad bit of memory that the earlier release managed to miss. I looked for but didn't find a memory test utility, so I don't know how to examine that possibility. Though the fact that it was different blocks each time makes it not sound like a hardware problem.

@openwrt-bot
Copy link
Author

crowston:

I tried installing on a different router and after a few powercycles saw the same SQUASHFS errors, suggesting it's not just bad memory:

Fri Oct 22 11:30:14 2021 kern.err kernel: [ 97.569402] SQUASHFS error: xz decompression failed, data probably corrupt
Fri Oct 22 11:30:14 2021 kern.err kernel: [ 97.576445] SQUASHFS error: squashfs_read_data failed to read block 0x4b5872
Fri Oct 22 11:30:14 2021 kern.err kernel: [ 97.584696] SQUASHFS error: xz decompression failed, data probably corrupt
Fri Oct 22 11:30:14 2021 kern.err kernel: [ 97.591837] SQUASHFS error: squashfs_read_data failed to read block 0x4b5872

But most of the time it seems to work fine.

@openwrt-bot
Copy link
Author

M95D:

I have this exact problem with WRT1900ACv1, OpenWRT built from git master. It won't boot at all with the new firmware.

@openwrt-bot
Copy link
Author

M95D:

More debugging:

Apparently, the image is not correctly written to flash. Reading back the squashfs and trying to mount it on a x86 Gentoo linux gives the same decompression errors.

See attachment for details.
router is booted from the working firmware (mtd5). mtd7 is the new defective firmware.

@openwrt-bot
Copy link
Author

M95D:

Even more debugging:

I extracted the squashfs from the original firmware image that was uploaded to the router. They are identical, except for some extra 0xFF at the end (ubifs read back from the router's mtd is larger, probably because it extends until the end of the erase block).

So, it's not a flash write issue, and it's not a hardware defect.
Both squashfs images can be extracted with the unsquashfs tool without any errors. So, there must be something wrong with the kernel xz decompressor. This affects both my router and my x64 Gentoo machine. Both kernels are v5.10

@openwrt-bot
Copy link
Author

M95D:

It seems that ARM BCJ filter decoder is needed in kernel, even on the desktop. Having only x86 BCJ filter decoder won't help.

Maybe there should be a warning put somwhere to alert users that alter the default kernel config.

@openwrt-bot
Copy link
Author

brianmercer:

My WD Mynet N750 is also unstable and also displays these same errors in the log.

@openwrt-bot
Copy link
Author

danak6jq:

I am also seeing this on a WD MyNet N750, starting with 21.02.1. I made an attempt to build a kernel/image with ARM BCJ pinned to the kernel and it did not make a difference.

@ShapeShifter499
Copy link

I'm seeing this issue with a fresh download of 21.02.2 from https://firmware-selector.openwrt.org/?version=21.02.2&target=ath79%2Fgeneric&id=wd_mynet-n750

I also have a WD MyNet N750

@M95D
Copy link
Contributor

M95D commented Mar 7, 2022

Someone found the true problem:
https://forum.openwrt.org/t/patch-squashfs-data-probably-corrupt/70480

@EccoB
Copy link

EccoB commented Mar 10, 2022

I also ran in the issue after updating my WD MyNet N750 to 21.02.2 r16495-bf0c965af0 from an 19.x version. After now around five days I get a high CPU load and the same reading errors:

kern.err kernel: [ 1177.557521] SQUASHFS error: Unable to read fragment cache entry [270732]
kern.err kernel: [ 1177.564383] SQUASHFS error: Unable to read page, block 270732, size 137e8

I re-flashed the version and for the moment it works fine again.

@ShapeShifter499
Copy link

@EccoB have you power cycled it yet?

I find it weird that it can run initially but that, at least in my experience, a power cycle causes issue. Never had that issue with OpenWRT 19.X

@EccoB
Copy link

EccoB commented Mar 12, 2022

@ShapeShifter499 Till now, I did not and there were no errors so far.
Already after a reboot now, the logs look pretty bad, indicating a corrupted memory. If I find an old 19.x version, I will try to revert to that version. I would expect similar problems now with the old version if it is really due to the flash memory.

[    7.054434] IPv6: ADDRCONF(NETDEV_CHANGE): eth0.1: link becomes ready
[   11.305538] jffs2: error: (599) verify_xattr_ref: node CRC failed at 0x8089c8, read=0xfd7fff7b, calc=0x8445ca05
[   11.315861] jffs2: error: (599) verify_xattr_ref: node CRC failed at 0x8088f0, read=0xfd7aef7f, calc=0xae5ef7f6
[   11.326169] jffs2: error: (599) verify_xattr_ref: node CRC failed at 0x80876c, read=0xfd7be7fb, calc=0xa93333ef
[   11.336466] jffs2: error: (599) verify_xattr_ref: node CRC failed at 0x803dec, read=0xff7bfffe, calc=0xdec45f10
[   11.346762] jffs2: error: (599) verify_xattr_ref: node CRC failed at 0x803d14, read=0xfd7af7fb, calc=0xf3b2a6fa
[   11.357142] jffs2: error: (599) verify_xattr_ref: node CRC failed at 0x803b90, read=0xfdfeef7e, calc=0xf3b2a6fa
[   11.367775] jffs2: notice: (599) jffs2_build_xattr_subsystem: complete building xattr subsystem, 24 of xdatum (14 unchecked, 7 orphan) and 25 of xref (3 dead, 0 orphan) found.
[   11.384407] jffs2: notice: (599) jffs2_get_inode_nodes: Node header CRC failed at 0x8088ac. {fdff,e77a,fd7ae77e,fdffe77e}
[   11.397055] mount_root: switching to jffs2 overlay
[   11.404605] jffs2: error: (600) do_verify_xattr_datum: node CRC failed at 0x80897c, read=0xfdfef7ff, calc=0x21946102
[   11.415369] jffs2: error: (600) do_verify_xattr_datum: node CRC failed at 0x808720, read=0xfdfef7ff, calc=0xd89d70f5
[   11.426105] jffs2: error: (600) do_verify_xattr_datum: node CRC failed at 0x803da0, read=0xfdfef7ff, calc=0x4c9c347
[   11.436764] jffs2: error: (600) do_verify_xattr_datum: node CRC failed at 0x803b44, read=0xfdfef7ff, calc=0xf0b38b4a
[   11.448140] overlayfs: upper fs does not support tmpfile.
[   11.470901] jffs2: notice: (485) jffs2_get_inode_nodes: Node header CRC failed at 0x807520. {fdff,e77b,fd7ae77b,fd7eff7b}
[   11.482420] jffs2: notice: (485) jffs2_get_inode_nodes: Node header CRC failed at 0x807030. {fdff,e77b,fd7ae77e,ff7af7fa}
[...]
[   14.753171] crng init done
[   14.985636] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x803ad4. {fdff,e77a,fd7ae77e,fdffe77e}
[   14.996784] jffs2: warning: (600) jffs2_do_read_inode_internal: no data nodes found for ino #48
[   15.005623] jffs2: Returned error for crccheck of ino #48. Expect badness...
[   15.205575] jffs2: Node CRC 4e4d8e0c != calculated CRC a2ebca7d for node at 0080a4b4
[   15.293646] jffs2: Node CRC a25051a8 != calculated CRC af4fc269 for node at 00008e5c
[   15.769624] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x805c44. {fdff,e77b,fd7ae77a,fffee7fb}
[...]
[   27.067284] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x8053e4. {fdff,e77a,fd7ae77e,fdffe77e}
[   27.389615] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x8053a0. {fdff,e77a,fd7ae77e,fdffe77e}
[   27.652358] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x80535c. {fdff,e77a,fd7ae77e,fdffe77e}
[   27.877672] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x8052ac. {fdff,e77a,fd7ae7ff,fd7bfffa}
[   28.109609] jffs2: notice: (600) jffs2_get_inode_nodes: Node header CRC failed at 0x805230. {fdff,e77a,fd7ae77e,fdffe77e}
[   28.120748] jffs2: warning: (600) jffs2_do_read_inode_internal: no data nodes found for ino #60
[   28.129595] jffs2: Returned error for crccheck of ino #60. Expect badness...
[...]
[   30.220763] jffs2: warning: (600) jffs2_do_read_inode_internal: no data nodes found for ino #61
[...]
[   31.644770] jffs2: warning: (600) jffs2_do_read_inode_internal: no data nodes found for ino #62
[...]
[   33.164750] jffs2: warning: (600) jffs2_do_read_inode_internal: no data nodes found for ino #63
(at least 20 data nodes are bad)

@EccoB
Copy link

EccoB commented Mar 12, 2022

The router was screwed (see last post), Luci told that the password was not set (which shouldn't be the case), and lots of CRC errors.
Via the LuCI interface, flashing the sysupgrade did not have any effect, version stayed the same even after trying to flash an old sysupgrade of Openwrt 19.

  • Reflashed firmware 21.02 via Recovery mode (did not have an old firmware file with 19.x and was afraid of mixing old sysupgrade and new firmware ) and after installation sysupgrade via Luci
  • With default configuration: Checked the kernel logs, repeat power cycle at least 3 times and check again logs: Everything good
  • Restoring config file via Luci worked flawlessly. A already noticed one line in the Kernellog, but cannot judge if that's something:
    [ 11.296985] jffs2: notice: (599) jffs2_build_xattr_subsystem: complete building xattr subsystem, 13 of xdatum (7 unchecked, 2 orphan) and 16 of xref (2 dead, 0 orphan) found.
  • Power Cycle
    [ 11.304601] jffs2: notice: (599) jffs2_build_xattr_subsystem: complete building xattr subsystem, 14 of xdatum (7 unchecked, 3 orphan) and 17 of xref (3 dead, 0 orphan) found.
  • Power Cycle
    [ 11.298356] jffs2: notice: (599) jffs2_build_xattr_subsystem: complete building xattr subsystem, 15 of xdatum (7 unchecked, 4 orphan) and 18 of xref (4 dead, 0 orphan) found.
  • Power Cycle, put it back in Network
    [ 11.293568] jffs2: notice: (599) jffs2_build_xattr_subsystem: complete building xattr subsystem, 16 of xdatum (7 unchecked, 5 orphan) and 19 of xref (5 dead, 0 orphan) found.
    Installing mosquitto mqtt broker and spamming it with thousend messages (I had the impression that in former times, the router became more unstable with that installed)
  • Reboot
    [ 11.307845] jffs2: notice: (599) jffs2_build_xattr_subsystem: complete building xattr subsystem, 25 of xdatum (18 unchecked, 6 orphan) and 31 of xref (6 dead, 0 orphan) found.
  • Everything still fine at the moment.

Over the next days, I will monitor the behaviour and document if there are any issues. If there is something I can do for further investigation you may tell me.

@lpyparmentier
Copy link

Hello, I recently went through the same issue with my edgerouter-x:

root@edgerouterx:~# cat /etc/openwrt_release 
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='21.02.0'
DISTRIB_REVISION='r16279-5cc0535800'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt 21.02.0 r16279-5cc0535800'
DISTRIB_TAINTS=''

Sorry I reinstall everything and did not take time to log, I will come back if it happened again. Did nothing special except disable uhttpd service and reboot, then I noticed that clients don't get their ips (dns issue) and when I looked in the logs (dmesg) I had a lot of SQUASHFS errors.

@ynezz ynezz added release/21.02 pull request/issue targeted (also) for OpenWrt 21.02 release bug issue report with a confirmed bug labels Mar 14, 2022
@jlpapple
Copy link

jlpapple commented Mar 27, 2022

Examples of various SQUASHFS, jffs2 errors from my N750, running the March 7 snapshot. I do not encounter any errors running 19.07.X

04:31:17 2022 kern.notice kernel: [ 0.000000] Linux version 5.10.103 (builder@buildhost) (mips-openwrt-linux-musl-gcc (OpenWrt GCC 11.2.0 r19069-98113220fa) 11.2.0, GNU ld (GNU Binutils) 2.37) #0 Mon Mar 7 20:44:53 2022
04:31:20 2022 kern.err kernel: [ 25.484464] SQUASHFS error: xz decompression failed, data probably corrupt
04:31:20 2022 kern.err kernel: [ 25.491513] SQUASHFS error: Failed to read block 0x283332: -5
04:31:20 2022 kern.err kernel: [ 25.497372] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:20 2022 kern.err kernel: [ 25.504166] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:20 2022 kern.err kernel: [ 25.511079] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:20 2022 kern.err kernel: [ 25.517888] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:20 2022 kern.err kernel: [ 25.524800] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:20 2022 kern.err kernel: [ 25.531606] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:20 2022 kern.err kernel: [ 25.538529] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:20 2022 kern.err kernel: [ 25.545322] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:20 2022 kern.err kernel: [ 25.552245] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:20 2022 kern.err kernel: [ 25.559058] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:21 2022 kern.err kernel: [ 25.800381] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:21 2022 kern.err kernel: [ 25.807246] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:21 2022 kern.err kernel: [ 25.832165] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:21 2022 kern.err kernel: [ 25.839033] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:21 2022 kern.err kernel: [ 25.856978] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:21 2022 kern.err kernel: [ 25.863789] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:21 2022 kern.err kernel: [ 25.888582] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:21 2022 kern.err kernel: [ 25.895399] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:45 2022 kern.err kernel: [ 50.532357] SQUASHFS error: xz decompression failed, data probably corrupt
04:31:45 2022 kern.err kernel: [ 50.539418] SQUASHFS error: Failed to read block 0x283332: -5
04:31:45 2022 kern.err kernel: [ 50.545253] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:45 2022 kern.err kernel: [ 50.552073] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:45 2022 daemon.notice netifd: wan (2443): udhcpc: broadcasting select for 67.82.48.98, server 167.206.148.47
04:31:46 2022 kern.err kernel: [ 50.756863] SQUASHFS error: xz decompression failed, data probably corrupt
04:31:46 2022 kern.err kernel: [ 50.763889] SQUASHFS error: Failed to read block 0x2bf5a: -5
04:31:46 2022 kern.err kernel: [ 50.769666] SQUASHFS error: Unable to read fragment cache entry [2bf5a]
04:31:46 2022 kern.err kernel: [ 50.776388] SQUASHFS error: Unable to read page, block 2bf5a, size 11c14
04:31:46 2022 kern.err kernel: [ 50.783227] SQUASHFS error: Unable to read fragment cache entry [2bf5a]
04:31:46 2022 kern.err kernel: [ 50.789944] SQUASHFS error: Unable to read page, block 2bf5a, size 11c14
04:31:46 2022 user.notice ucitrack: Setting up /etc/config/network reload dependency on /etc/config/dhcp
04:31:46 2022 kern.err kernel: [ 50.909423] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:46 2022 kern.err kernel: [ 50.916238] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:46 2022 user.notice ucitrack: Setting up /etc/config/wireless reload dependency on /etc/config/network
04:31:46 2022 daemon.notice netifd: wan (2443): udhcpc: lease of 67.82.48.98 obtained from 167.206.148.47, lease time 43200
04:31:46 2022 kern.err kernel: [ 51.006440] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:46 2022 kern.err kernel: [ 51.013306] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:46 2022 kern.err kernel: [ 51.098070] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:46 2022 kern.err kernel: [ 51.104883] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:46 2022 kern.err kernel: [ 51.185265] SQUASHFS error: Unable to read fragment cache entry [283332]
04:31:46 2022 kern.err kernel: [ 51.192140] SQUASHFS error: Unable to read page, block 283332, size 16968
04:31:46 2022 user.notice ucitrack: Setting up /etc/config/firewall reload dependency on /etc/config/luci-splash
04:31:47 2022 user.notice ucitrack: Setting up /etc/config/firewall reload dependency on /etc/config/qos
04:31:47 2022 kern.err kernel: [ 52.021806] SQUASHFS error: Unable to read fragment cache entry [2bf5a]
04:31:47 2022 kern.err kernel: [ 52.028579] SQUASHFS error: Unable to read page, block 2bf5a, size 11c14
04:31:47 2022 kern.err kernel: [ 52.035389] SQUASHFS error: Unable to read fragment cache entry [2bf5a]
04:31:47 2022 kern.err kernel: [ 52.042122] SQUASHFS error: Unable to read page, block 2bf5a, size 11c14
04:31:47 2022 user.notice ucitrack: Setting up /etc/config/firewall reload dependency on /etc/config/miniupnpd
04:31:47 2022 daemon.notice netifd: wan (2443): /lib/netifd/dhcp.script: line 22: ipcalc.sh: I/O error
04:31:47 2022 kern.err kernel: [ 52.275504] SQUASHFS error: Unable to read fragment cache entry [2bf5a]
04:31:47 2022 kern.err kernel: [ 52.282274] SQUASHFS error: Unable to read page, block 2bf5a, size 11c14
04:31:47 2022 user.notice ucitrack: Setting up /etc/config/firewall reload dependency on /etc/config/sqm
04:31:47 2022 daemon.notice netifd: wan (2443): /lib/netifd/dhcp.script: line 27: ipcalc.sh: I/O error
13:54:50 2022 kern.notice kernel: [ 679.887110] jffs2: notice: (1078) jffs2_get_inode_nodes: Node header CRC failed at 0x007378. {0b00,ffff,00000044,a4ef223e}
13:54:50 2022 user.info : luci: accepted login on /admin/network/firewall for root from 192.168.1.174
13:54:50 2022 kern.notice kernel: [ 680.357551] jffs2: notice: (4090) jffs2_get_inode_nodes: Node header CRC failed at 0x95d248. {e595,ffff,00000044,a4ef223e}

@plunet
Copy link

plunet commented May 4, 2022

I'm seeing a similar thing on a Ubiquity ER-X which has been stable and running 21.02.1 for many months.

cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='21.02.1'
DISTRIB_REVISION='r16325-88151b8303'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt 21.02.1 r16325-88151b8303'
DISTRIB_TAINTS=''

root@OpenWrt:/tmp# uptime
 13:10:50 up 59 days,  8:38,  load average: 0.01, 0.03, 0.00

I noticed today that LUCI and uhttpd is not running

root@OpenWrt:/tmp# /etc/init.d/uhttpd start
root@OpenWrt:/tmp# Wed May  4 13:15:43 2022 kern.err kernel: [5129016.776653] SQUASHFS error: xz decompressi                                      on failed, data probably corrupt
Wed May  4 13:15:43 2022 kern.err kernel: [5129016.790750] SQUASHFS error: squashfs_read_data failed to read                                                   block 0x161c32
Wed May  4 13:15:43 2022 kern.err kernel: [5129016.851875] SQUASHFS error: xz decompression failed, data pro                                                  bably corrupt
Wed May  4 13:15:43 2022 kern.err kernel: [5129016.865978] SQUASHFS error: squashfs_read_data failed to read                                                   block 0x161c32
Wed May  4 13:15:43 2022 daemon.info procd: Instance uhttpd::instance1 s in a crash loop 10 crashes, 0 secon                                                  ds since last crash

@vertuxt
Copy link

vertuxt commented May 12, 2022

Same here, changed some config + reboot. Now the ER-X is stuck in Bootloop after running stable for ~2years

3: System Boot system code via Flash.
## Booting image at bfd40000 ...
   Image Name:   MIPS OpenWrt Linux-5.4.154
   Image Type:   MIPS Linux Kernel Image (uncompressed)
   Data Size:    2367034 Bytes =  2.3 MB
   Load Address: 80001000
   Entry Point:  80001000
.....................................   Verifying Checksum ... OK
OK
No initrd
## Transferring control to Linux (at address 80001000) ...
## Giving linux memsize in MB, 256

Starting kernel ...



OpenWrt kernel loader for MIPS based SoC
Copyright (C) 2011 Gabor Juhos <juhosg@openwrt.org>
Decompressing kernel... done!
Starting kernel at 80001000...

[    0.000000] Linux version 5.4.154 (builder@buildhost) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r16325-88151b8303)) #0 SMP Sun Oct 24 09:01:35 2021
[    0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[    0.000000] printk: bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc)
[    0.000000] MIPS: machine is Ubiquiti EdgeRouter X
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] VPE topology {2,2} total 4
[    0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000]   HighMem  empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000000fffffff]
[    0.000000] percpu: Embedded 14 pages/cpu s26768 r8192 d22384 u57344
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 64960
[    0.000000] Kernel command line: console=ttyS0,57600 rootfstype=squashfs,jffs2
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] Writing ErrCtl register=00049340
[    0.000000] Readback ErrCtl register=00049340
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 250792K/262144K available (6089K kernel code, 210K rwdata, 748K rodata, 1260K init, 238K bss, 11352K reserved, 0K cma-reserved, 0K highmem)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 256
[    0.000000] random: get_random_bytes called from 0x806e5a3c with crng_init=0
[    0.000000] CPU Clock: 880MHz
[    0.000000] clocksource: GIC: mask: 0xffffffffffffffff max_cycles: 0xcaf478abb4, max_idle_ns: 440795247997 ns
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 4343773742 ns
[    0.000009] sched_clock: 32 bits at 440MHz, resolution 2ns, wraps every 4880645118ns
[    0.015502] Calibrating delay loop... 583.68 BogoMIPS (lpj=1167360)
[    0.055845] pid_max: default: 32768 minimum: 301
[    0.065197] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.079603] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.097734] rcu: Hierarchical SRCU implementation.
[    0.107843] smp: Bringing up secondary CPUs ...
[    2.194788] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    2.194800] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    2.194813] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    2.194914] CPU1 revision is: 0001992f (MIPS 1004Kc)
[    0.145017] Synchronize counters for CPU 1: done.
[    2.285843] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    2.285853] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    2.285861] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    2.285918] CPU2 revision is: 0001992f (MIPS 1004Kc)
[    0.239469] Synchronize counters for CPU 2: done.
[    2.376968] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
[    2.376978] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
[    2.376987] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[    2.377048] CPU3 revision is: 0001992f (MIPS 1004Kc)
[    0.327069] Synchronize counters for CPU 3: done.
[    0.386680] smp: Brought up 1 node, 4 CPUs
[    0.399024] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.418312] futex hash table entries: 1024 (order: 3, 32768 bytes, linear)
[    0.432146] pinctrl core: initialized pinctrl subsystem
[    0.444102] NET: Registered protocol family 16
[    0.475286] workqueue: max_active 576 requested for napi_workq is out of range, clamping between 1 and 512
[    0.496036] clocksource: Switched to clocksource GIC
[    0.507386] thermal_sys: Registered thermal governor 'step_wise'
[    0.507842] NET: Registered protocol family 2
[    0.528464] IP idents hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.543595] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 6144 bytes, linear)
[    0.560338] TCP established hash table entries: 2048 (order: 1, 8192 bytes, linear)
[    0.575469] TCP bind hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.589641] TCP: Hash tables configured (established 2048 bind 2048)
[    0.602390] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.615281] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.629392] NET: Registered protocol family 1
[    0.637968] PCI: CLS 0 bytes, default 32
[    0.735983] 4 CPUs re-calibrate udelay(lpj = 1167360)
[    0.747494] workingset: timestamp_bits=14 max_order=16 bucket_order=2
[    0.765135] random: fast init done
[    0.773416] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.784910] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.806141] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
[    0.822666] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.833992] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.845314] mt7621_gpio 1e000600.gpio: registering 32 gpios
[    0.857176] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
[    0.873578] printk: console [ttyS0] disabled
[    0.882039] 1e000c00.uartlite: ttyS0 at MMIO 0x1e000c00 (irq = 19, base_baud = 3125000) is a 16550A
[    0.899958] printk: console [ttyS0] enabled
[    0.899958] printk: console [ttyS0] enabled
[    0.916512] printk: bootconsole [early0] disabled
[    0.916512] printk: bootconsole [early0] disabled
[    0.937833] mt7621-nand 1e003000.nand: Using programmed access timing: 31c07388
[    0.952680] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
[    0.965335] nand: Macronix MX30LF2G18AC
[    0.972974] nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
[    0.988055] mt7621-nand 1e003000.nand: ECC strength adjusted to 4 bits
[    1.001087] mt7621-nand 1e003000.nand: Using programmed access timing: 21005134
[    1.015647] mt7621-nand 1e003000.nand: Using programmed access timing: 21005134
[    1.030208] Scanning device for bad blocks
[    1.604645] Bad eraseblock 446 at 0x0000037c0000
[    3.014983] Bad eraseblock 1551 at 0x00000c1e0000
[    3.286361] Bad eraseblock 1758 at 0x00000dbc0000
[    3.305267] Bad eraseblock 1766 at 0x00000dcc0000
[    3.671255] 6 fixed-partitions partitions found on MTD device mt7621-nand
[    3.684784] Creating 6 MTD partitions on "mt7621-nand":
[    3.695200] 0x000000000000-0x000000080000 : "u-boot"
[    3.706593] 0x000000080000-0x0000000e0000 : "u-boot-env"
[    3.718474] 0x0000000e0000-0x000000140000 : "factory"
[    3.729978] 0x000000140000-0x000000440000 : "kernel1"
[    3.741378] 0x000000440000-0x000000740000 : "kernel2"
[    3.752983] 0x000000740000-0x00000ff00000 : "ubi"
[    3.766349] libphy: Fixed MDIO Bus: probed
[    3.802483] libphy: mdio: probed
[    3.809195] mt7530 mdio-bus:1f: MT7530 adapts as multi-chip module
[    3.825773] mtk_soc_eth 1e100000.ethernet dsa: mediatek frame engine at 0xbe100000, irq 20
[    3.846779] NET: Registered protocol family 10
[    3.857197] Segment Routing with IPv6
[    3.864666] NET: Registered protocol family 17
[    3.873620] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[    3.899664] 8021q: 802.1Q VLAN Support v1.8
[    3.909868] mt7530 mdio-bus:1f: MT7530 adapts as multi-chip module
[    3.931845] libphy: dsa slave smi: probed
[    3.940310] mt7530 mdio-bus:1f eth0 (uninitialized): PHY [dsa-0.0:00] driver [Generic PHY]
[    3.958261] mt7530 mdio-bus:1f eth1 (uninitialized): PHY [dsa-0.0:01] driver [Generic PHY]
[    3.976315] mt7530 mdio-bus:1f eth2 (uninitialized): PHY [dsa-0.0:02] driver [Generic PHY]
[    3.994207] mt7530 mdio-bus:1f eth3 (uninitialized): PHY [dsa-0.0:03] driver [Generic PHY]
[    4.012273] mt7530 mdio-bus:1f eth4 (uninitialized): PHY [dsa-0.0:04] driver [Generic PHY]
[    4.030272] mt7530 mdio-bus:1f: configuring for fixed/rgmii link mode
[    4.047924] DSA: tree 0 setup
[    4.055068] UBI: auto-attach mtd5
[    4.061721] ubi0: attaching mtd5
[    4.068498] mt7530 mdio-bus:1f: Link is Up - 1Gbps/Full - flow control off
[    6.587532] ubi0: scanning is finished
[    6.614055] ubi0: attached mtd5 (name "ubi", size 247 MiB)
[    6.625014] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[    6.638702] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[    6.652222] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[    6.666085] ubi0: good PEBs: 1978, bad PEBs: 4, corrupted PEBs: 0
[    6.678217] ubi0: user volume: 2, internal volumes: 1, max. volumes count: 128
[    6.692608] ubi0: max/mean erase counter: 2/0, WL threshold: 4096, image sequence number: 23324735
[    6.710448] ubi0: available PEBs: 0, total reserved PEBs: 1978, PEBs reserved for bad PEB handling: 36
[    6.729009] ubi0: background thread "ubi_bgt0d" started, PID 467
[    6.731441] block ubiblock0_0: created from ubi0:0(rootfs)
[    6.751934] ubiblock: device ubiblock0_0 (rootfs) set to be root filesystem
[    6.765816] hctosys: unable to open rtc device (rtc0)
[    6.782057] VFS: Mounted root (squashfs filesystem) readonly on device 254:0.
[    6.800700] Freeing unused kernel memory: 1260K
[    6.809742] This architecture does not have kernel memory protection.
[    6.822569] Run /sbin/init as init process
[    6.961655] SQUASHFS error: xz decompression failed, data probably corrupt
[    6.975369] SQUASHFS error: squashfs_read_data failed to read block 0x61422
[    7.011259] SQUASHFS error: xz decompression failed, data probably corrupt
[    7.024970] SQUASHFS error: squashfs_read_data failed to read block 0x61422
[    7.039002] Starting init: /sbin/init exists but couldn't execute it (error -5)
[    7.053610] Run /etc/init as init process
[    7.061766] Run /bin/init as init process
[    7.071906] Run /bin/sh as init process
[    7.233750] SQUASHFS error: xz decompression failed, data probably corrupt
[    7.247464] SQUASHFS error: squashfs_read_data failed to read block 0x61422
[    7.261506] Starting init: /bin/sh exists but couldn't execute it (error -5)
[    7.275591] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    7.303812] Rebooting in 1 seconds..

@vjorlikowski
Copy link

I am seeing the same sort of thing here on the GL.iNet GL-B1300 with 21.02.1

According to dmesg, my storage configuration looks like this:

[    0.616046] spi_qup 78b5000.spi: IN:block:16, fifo:64, OUT:block:16, fifo:64
[    0.618456] spi-nor spi0.0: mx25l25635e (32768 Kbytes)
[    0.624336] 9 fixed-partitions partitions found on MTD device spi0.0
[    0.629188] Creating 9 MTD partitions on "spi0.0":
[    0.635733] 0x000000000000-0x000000040000 : "SBL1"
[    0.641255] 0x000000040000-0x000000060000 : "MIBIB"
[    0.645880] 0x000000060000-0x0000000c0000 : "QSEE"
[    0.650752] 0x0000000c0000-0x0000000d0000 : "CDT"
[    0.655560] 0x0000000d0000-0x0000000e0000 : "DDRPARAMS"
[    0.660417] 0x0000000e0000-0x0000000f0000 : "APPSBLENV"
[    0.665309] 0x0000000f0000-0x000000170000 : "APPSBL"
[    0.670629] 0x000000170000-0x000000180000 : "ART"
[    0.675740] 0x000000180000-0x000002000000 : "firmware"
[    0.680854] 2 fit-fw partitions found on MTD device firmware
[    0.684571] Creating 2 MTD partitions on "firmware":
[    0.690460] 0x000000000000-0x000000390000 : "kernel"
[    0.696267] 0x0000003884d4-0x000001e80000 : "rootfs"
[    0.701138] mtd: device 10 (rootfs) set to be root filesystem
[    0.705488] 1 squashfs-split partitions found on MTD device rootfs
[    0.710928] 0x0000006f0000-0x000001e80000 : "rootfs_data"

My errors occur against the rootfs (mtdblock10).

When this occurs to me, I start seeing errors similar to:

Sun May 15 14:37:12 2022 kern.err kernel: [533320.106127] blk_update_request: I/O error, dev mtdblock10, sector 2594 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:15 2022 kern.err kernel: [533323.226313] blk_update_request: I/O error, dev mtdblock10, sector 3488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:16 2022 kern.err kernel: [533324.269099] blk_update_request: I/O error, dev mtdblock10, sector 2596 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:19 2022 kern.err kernel: [533327.385955] blk_update_request: I/O error, dev mtdblock10, sector 3490 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:20 2022 kern.err kernel: [533328.426176] blk_update_request: I/O error, dev mtdblock10, sector 2598 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:23 2022 kern.err kernel: [533331.546073] blk_update_request: I/O error, dev mtdblock10, sector 3492 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:25 2022 kern.err kernel: [533333.627892] blk_update_request: I/O error, dev mtdblock10, sector 2600 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:27 2022 kern.err kernel: [533335.707119] blk_update_request: I/O error, dev mtdblock10, sector 3494 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:28 2022 kern.err kernel: [533336.745936] blk_update_request: I/O error, dev mtdblock10, sector 2602 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:30 2022 kern.err kernel: [533338.825761] blk_update_request: I/O error, dev mtdblock10, sector 3496 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:31 2022 kern.err kernel: [533339.865779] blk_update_request: I/O error, dev mtdblock10, sector 2604 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Sun May 15 14:37:33 2022 kern.err kernel: [533340.905699] blk_update_request: I/O error, dev mtdblock10, sector 3498 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

which then progress to:

Sun May 15 14:37:33 2022 kern.err kernel: [533340.910849] SQUASHFS error: squashfs_read_data failed to read block 0x130f86
Sun May 15 14:37:33 2022 kern.err kernel: [533340.935308] SQUASHFS error: squashfs_read_data failed to read block 0x1b407e
Sun May 15 14:37:33 2022 kern.err kernel: [533340.935350] SQUASHFS error: Unable to read fragment cache entry [1b407e]
Sun May 15 14:37:33 2022 kern.err kernel: [533340.941430] SQUASHFS error: Unable to read page, block 1b407e, size c9c4
Sun May 15 14:37:33 2022 kern.err kernel: [533340.948634] SQUASHFS error: Unable to read fragment cache entry [1b407e]
Sun May 15 14:37:33 2022 kern.err kernel: [533340.955139] SQUASHFS error: Unable to read page, block 1b407e, size c9c4
Sun May 15 14:37:33 2022 kern.err kernel: [533340.975622] SQUASHFS error: xz decompression failed, data probably corrupt
Sun May 15 14:37:33 2022 kern.err kernel: [533340.975666] SQUASHFS error: squashfs_read_data failed to read block 0x130f86

Squashfs caches the read failures until the hardware is rebooted - whereupon everything is once again "fine"; I am able to perform read checks against the entire rootfs without encountering any obvious storage errors, after the reboot. The appearance of the read errors appears to be "random" - but, once squashfs caches them, only a reboot is able to resolve the situation.

This is clearly some issue with the storage controller, that the caching in squashfs makes worse.

@ynezz ynezz added target/ramips pull request/issue for ramips target kernel pull request/issue with Linux kernel related changes release/22.03 pull request/issue targeted (also) for OpenWrt 22.03 release labels May 23, 2022
@ynezz ynezz changed the title FS#4100 - SQUASHFS errors with OpenWrt 21.02 ramips/mt7621: SQUASHFS filesystem corruption May 23, 2022
@vjorlikowski
Copy link

@ynezz This issue is occurring for me as well, and not on hardware that is ramips-based (GL.iNet GL-B1300).
Squashfs starts reporting unrecoverable corruption (that resolves on reboot), due to some instability with the underlying storage.

Should I open a separate bug for the issue, for my hardware?

@dkadioglu
Copy link

dkadioglu commented Jun 16, 2022

The same happened to my Edgerouter X SFP on Monday. However, just came home today to investigate. A reboot didn't resolve it, still the same SQUASHFS errors. I then reflashed the same build and the router works again as it should, without SQUASHFS errors.
As far as I understand it, this error is difficult to reproduce and can vary in its expression. If there is anything I can do to help to further investigate, please ask. I am a little nervous that the next failure (and then maybe permanent) could come at any time...

DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='SNAPSHOT'
DISTRIB_REVISION='r18376-15d0c4d5cd'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt SNAPSHOT r18376-15d0c4d5cd'
DISTRIB_TAINTS='override'

@M95D
Copy link
Contributor

M95D commented Jun 16, 2022

@jlpapple
Copy link

@ynezz This issue is occurring for me as well, and not on hardware that is ramips-based (GL.iNet GL-B1300). Squashfs starts reporting unrecoverable corruption (that resolves on reboot), due to some instability with the underlying storage.

Should I open a separate bug for the issue, for my hardware?

I agree, this issue should not be exclusively title or tagged as a mt7621 platform issue. Frankly, I think the original title should be restored, as the issue also exists on Ath79/Atheros devices.

@Spudz76
Copy link
Contributor

Spudz76 commented Dec 14, 2022

The readahead stuff those patches fix weren't added until after 5.15 so they wouldn't apply.

I did find this reference to a possible earlier bug that was never fixed, and it also looks like it's within the xz decompressor, so my suspicion that switching to ZSTD gets rid of the problem by going around it could be true.

@532910
Copy link
Contributor

532910 commented Dec 14, 2022

Should the target/ramips tag be removed and the title updated?

@Spudz76
Copy link
Contributor

Spudz76 commented Dec 14, 2022

I don't know if non-MIPS32 or even non-mt7621 have the same problem, seems like we'd have heard of it a lot more if it was on all CPU types, or even all MIPS32 types. Changing the overall default compressor to ZSTD in general might happen sooner than figuring out the bug, it's almost always the same or better than XZ anyway, especially at 1024KB blocksize. Need at least one or two other acks that switching to ZSTD avoids the problem, and eliminate that it's a general squashfs bug, confirm it as a compressor bug in XZ. The idea of switching to ZSTD by default was already punted around, this would be as good a reason as any to just make that easy change.

@532910
Copy link
Contributor

532910 commented Dec 14, 2022

I have the same issue on er (target: octeon, platform:ubnt_edgerouter) with emmc

@Spudz76
Copy link
Contributor

Spudz76 commented Dec 14, 2022

If that's also with XZ, then try out ZSTD instead and see if it avoids the issue on that platform also (MIPS64 seems like). Could still be any-MIPS rather than any-cpu.

extroot on usb also works around the issue but possibly just another way to avoid the XZ compressor. it would probably still be buggy if it were squashfs+xz on usb storage.

@Mossop
Copy link

Mossop commented Dec 15, 2022

Can anyone explain how to build the firmware to use zstd?

@Spudz76
Copy link
Contributor

Spudz76 commented Dec 15, 2022

Oops, I build from my own set of patches on top of master, including #11381 which puts it all in the menuconfig

So I guess build from that PR because I forgot it's not selectable in regular master (thus default-for-all XZ)

@532910
Copy link
Contributor

532910 commented Dec 15, 2022

Is it possible to re-compress xz image into zstd?

@Spudz76
Copy link
Contributor

Spudz76 commented Dec 15, 2022

The kernel doesn't have the ZSTD decompressor without using the above PR, or if it gets merged.

So yes you could forcibly compress it with ZSTD but the kernel won't know how to unpack it unless you also hack the Kconfig defaults a bit to match. So you might as well use the PR which has that all done already and a menu option.

@csharper2005
Copy link
Contributor

@Spudz76 hi! The patch above was merged. Do you have any examples how to apply the fix for the certain devices?

@Spudz76
Copy link
Contributor

Spudz76 commented Jan 31, 2023

Oops the one that needs merged for this is #11328

That one was for initramfs where squashfs is not used. I think it's possible to switch to the other method, but this corruption occurs in squashfs mode so it needs the similar patch applied.

edit: I do need to rebase and clean that one up though so it might be accepted.

@Spudz76
Copy link
Contributor

Spudz76 commented Jan 31, 2023

For reference my config contains the relevant settings, that work with #11328 (which is about to be updated, doing a compile and run test now):

CONFIG_USES_INITRAMFS=y
# CONFIG_TARGET_ROOTFS_INITRAMFS is not set
CONFIG_USES_SQUASHFS=y
CONFIG_TARGET_ROOTFS_SQUASHFS=y
# CONFIG_BUSYBOX_DEFAULT_FEATURE_VOLUMEID_SQUASHFS is not set
# CONFIG_KERNEL_SQUASHFS_COMP_DEFAULT_UNSPEC is not set
CONFIG_KERNEL_SQUASHFS_COMP_DEFAULT_ZSTD=y
CONFIG_KERNEL_SQUASHFS_FRAGMENT_CACHE_SIZE=3
# CONFIG_KERNEL_SQUASHFS_GZIP is not set
# CONFIG_KERNEL_SQUASHFS_LZ4 is not set
# CONFIG_KERNEL_SQUASHFS_LZO is not set
# CONFIG_KERNEL_SQUASHFS_XATTR is not set
# CONFIG_KERNEL_SQUASHFS_XZ is not set
CONFIG_KERNEL_SQUASHFS_ZSTD=y
CONFIG_TARGET_SQUASHFS_BLOCK_SIZE=1024
# CONFIG_TOOLS_SQUASHFS_GZIP is not set
# CONFIG_TOOLS_SQUASHFS_LZ4 is not set
# CONFIG_TOOLS_SQUASHFS_LZO is not set
# CONFIG_TOOLS_SQUASHFS_XATTR is not set
# CONFIG_TOOLS_SQUASHFS_XZ is not set
CONFIG_TOOLS_SQUASHFS_ZSTD=y

@dmsza
Copy link

dmsza commented Mar 28, 2023

I was experiencing this problem on a TP-Link RE350 v3 (mt76x8) with OpenWrt 22.03 (it was so bad I had to revert to factory firmware). Today I've noticed the following commit in OpenWrt git master from 3/27:

mediatek: add support for SPI calibration

Newer MediaTek's SoCs need SPI calibration routines for SPI to work
reliably. Import patches for that from MediaTek's SDK.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=540377010532d9840ebe0778c2af991bd4b67052

I just did a new build from master and after 4 hours of uptime the issue has not yet appeared.

Hopefully the above commit may have finally fixed this issue.

@dkadioglu
Copy link

dkadioglu commented Apr 5, 2023

Hopefully the above commit may have finally fixed this issue.

@d-me3 How is it so far? Has the commit improved anything?

@dmsza
Copy link

dmsza commented Apr 5, 2023

@dkadioglu Yes, I ran master snapshot build on the RE305v3 that had this issue with the above commit for ˜5 days and the issue disappeared. I can say it did improve a lot or even solved it.

@rnhmjoj
Copy link

rnhmjoj commented Apr 28, 2023

@dangowrt Can you confirm 5403770 solves the squashfs corruption and close the issue?

@Djfe
Copy link
Contributor

Djfe commented Apr 28, 2023

Would it make sense to backport this patch or is 5.10 too old? The metioned commit adds patches for 5.15 only

@jlpapple
Copy link

@dangowrt Can you confirm 5403770 solves the squashfs corruption and close the issue?

Reminder, this issue affects other platforms, including ath79. That commit only addresses mediatek.

@DocSniper
Copy link

The problem still seems to be there. We are currently testing the latest snapshots on the Archer AX23 (mt7621) and despite this patch we continue to have the SquashFS corruption issue. A user just created an image with a different compression of the SquashFS and so far it looks good. Since the relation of the error to compression has already been mentioned several times, it might make sense to switch from xz to zstd throughout the project.

@csharper2005
Copy link
Contributor

@dangowrt Can you confirm 5403770 solves the squashfs corruption and close the issue?

This commit about SPI. The issue affects nand routers too. Some users faced with with squashfs corruption after several months or even year.

@Spudz76
Copy link
Contributor

Spudz76 commented Apr 29, 2023

I have not had any corruption since I switched to ZSTD.

@csharper2005
Copy link
Contributor

I have not had any corruption since I switched to ZSTD.

Do we have any devices switched to ZSTD in official OpenWrt? I would like to take a look at the example and switch to ZSTD some Xiaomi and Sercomm-based devices too.

@532910
Copy link
Contributor

532910 commented Jul 14, 2023

unfortunately zstd doesn't solve my issue:

[   14.634616] SQUASHFS error: zstd decompression error: 14
[   14.639973] SQUASHFS error: zstd decompression failed, data probably corrupt
[   14.647053] SQUASHFS error: Failed to read block 0x66: -5
[   14.652479] SQUASHFS error: Unable to read data cache entry [66]
[   14.658502] SQUASHFS error: Unable to read page, block 66, size 1ce97
[   14.665015] SQUASHFS error: Unable to read data cache entry [66]
[   14.671049] SQUASHFS error: Unable to read page, block 66, size 1ce97
[   14.677543] SQUASHFS error: Unable to read data cache entry [66]
[   14.683582] SQUASHFS error: Unable to read page, block 66, size 1ce97
[   14.944738] random: jshn: uninitialized urandom read (4 bytes read)

Though it's cavium octeon in my case, and not mt7621.

@jlpapple
Copy link

Update: I installed today’s SNAPSHOT (Aug 14 2023) on a WD MyNet N750 and the SquashFS errors persist.

It survived about 4 reboots in a 12 hour period before the errors appeared. Running kernel 5.15.126, default install, no overlay.

The persistence of this issue without a solution (other than a USB overlay) is a real shame.

@Spudz76
Copy link
Contributor

Spudz76 commented Aug 15, 2023

There is a solution, use ZSTD. Unfortunately my patches which make it easier to select have not been updated in a while, mainly because I kept them updated for months without acceptance so I decided nobody cares, and I have better things to do. Default squashfs compression method zlib seems buggy on this hardware for some unknown reason (missing cache flushes?). It did not seem worth it at all to find the zlib bug since ZSTD is superior in nearly all ways anyhow.

Of course you could also rebase my slightly outdated PR #11328 for yourself (package renaming and acceptance of some of my other patches in the interim made it a mess), or examine it for how to just hardcode-switch the compression method. I had nothing but random corruption over time with the default compression, and zero issues once I switched to ZSTD.

Probably this platform should just be hard switched to ZSTD even if my menu-config stuff isn't assimilated, then no end-user has to do anything special to "fix" it. But I'm not sure if that's even that easy to do in a platform/board profile without half the menu patches that add the knobs.

Perhaps I'll blow a day on rebasing the thing someday. Or perhaps not...

@ShapeShifter499
Copy link

That's a shame, I have three WD MyNet N750s laying around. I'm not confident at my skills to figure it out. In the meantime I've just been using the bootstrap method by having most of the install on a connected USB drive.

@jlpapple
Copy link

FYI, a commit to address this issue on the ipq40xx platform was submitted in September, see below.

Can someone review the code and submit commits for other platforms like ath79, etc? I'd do it, but I don't have the requisite coding skills.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=98d325aaf8bef992cc92e94feb14fe271d370dc0

@ShapeShifter499
Copy link

has anyone been able to test @jlpapple's idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue report with a confirmed bug flyspray kernel pull request/issue with Linux kernel related changes release/21.02 pull request/issue targeted (also) for OpenWrt 21.02 release release/22.03 pull request/issue targeted (also) for OpenWrt 22.03 release target/ramips pull request/issue for ramips target
Projects
None yet
Development

No branches or pull requests