New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ramips/mt7621: SQUASHFS filesystem corruption #9085
Comments
crowston: I tried installing on a different router and after a few powercycles saw the same SQUASHFS errors, suggesting it's not just bad memory: Fri Oct 22 11:30:14 2021 kern.err kernel: [ 97.569402] SQUASHFS error: xz decompression failed, data probably corrupt But most of the time it seems to work fine. |
M95D: I have this exact problem with WRT1900ACv1, OpenWRT built from git master. It won't boot at all with the new firmware. |
M95D: More debugging: Apparently, the image is not correctly written to flash. Reading back the squashfs and trying to mount it on a x86 Gentoo linux gives the same decompression errors. See attachment for details. |
M95D: Even more debugging: I extracted the squashfs from the original firmware image that was uploaded to the router. They are identical, except for some extra 0xFF at the end (ubifs read back from the router's mtd is larger, probably because it extends until the end of the erase block). So, it's not a flash write issue, and it's not a hardware defect. |
M95D: It seems that ARM BCJ filter decoder is needed in kernel, even on the desktop. Having only x86 BCJ filter decoder won't help. Maybe there should be a warning put somwhere to alert users that alter the default kernel config. |
brianmercer: My WD Mynet N750 is also unstable and also displays these same errors in the log. |
danak6jq: I am also seeing this on a WD MyNet N750, starting with 21.02.1. I made an attempt to build a kernel/image with ARM BCJ pinned to the kernel and it did not make a difference. |
I'm seeing this issue with a fresh download of 21.02.2 from https://firmware-selector.openwrt.org/?version=21.02.2&target=ath79%2Fgeneric&id=wd_mynet-n750 I also have a WD MyNet N750 |
Someone found the true problem: |
I also ran in the issue after updating my WD MyNet N750 to 21.02.2 r16495-bf0c965af0 from an 19.x version. After now around five days I get a high CPU load and the same reading errors: kern.err kernel: [ 1177.557521] SQUASHFS error: Unable to read fragment cache entry [270732] I re-flashed the version and for the moment it works fine again. |
@EccoB have you power cycled it yet? I find it weird that it can run initially but that, at least in my experience, a power cycle causes issue. Never had that issue with OpenWRT 19.X |
@ShapeShifter499 Till now, I did not and there were no errors so far.
|
The router was screwed (see last post), Luci told that the password was not set (which shouldn't be the case), and lots of CRC errors.
Over the next days, I will monitor the behaviour and document if there are any issues. If there is something I can do for further investigation you may tell me. |
Hello, I recently went through the same issue with my edgerouter-x: root@edgerouterx:~# cat /etc/openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='21.02.0'
DISTRIB_REVISION='r16279-5cc0535800'
DISTRIB_TARGET='ramips/mt7621'
DISTRIB_ARCH='mipsel_24kc'
DISTRIB_DESCRIPTION='OpenWrt 21.02.0 r16279-5cc0535800'
DISTRIB_TAINTS='' Sorry I reinstall everything and did not take time to log, I will come back if it happened again. Did nothing special except disable uhttpd service and reboot, then I noticed that clients don't get their ips (dns issue) and when I looked in the logs (dmesg) I had a lot of SQUASHFS errors. |
Examples of various SQUASHFS, jffs2 errors from my N750, running the March 7 snapshot. I do not encounter any errors running 19.07.X
|
I'm seeing a similar thing on a Ubiquity ER-X which has been stable and running 21.02.1 for many months.
I noticed today that LUCI and uhttpd is not running
|
Same here, changed some config + reboot. Now the ER-X is stuck in Bootloop after running stable for ~2years
|
I am seeing the same sort of thing here on the GL.iNet GL-B1300 with 21.02.1 According to dmesg, my storage configuration looks like this:
My errors occur against the rootfs (mtdblock10). When this occurs to me, I start seeing errors similar to:
which then progress to:
Squashfs caches the read failures until the hardware is rebooted - whereupon everything is once again "fine"; I am able to perform read checks against the entire rootfs without encountering any obvious storage errors, after the reboot. The appearance of the read errors appears to be "random" - but, once squashfs caches them, only a reboot is able to resolve the situation. This is clearly some issue with the storage controller, that the caching in squashfs makes worse. |
@ynezz This issue is occurring for me as well, and not on hardware that is ramips-based (GL.iNet GL-B1300). Should I open a separate bug for the issue, for my hardware? |
The same happened to my Edgerouter X SFP on Monday. However, just came home today to investigate. A reboot didn't resolve it, still the same SQUASHFS errors. I then reflashed the same build and the router works again as it should, without SQUASHFS errors.
|
See if this helps: |
I agree, this issue should not be exclusively title or tagged as a mt7621 platform issue. Frankly, I think the original title should be restored, as the issue also exists on Ath79/Atheros devices. |
The readahead stuff those patches fix weren't added until after 5.15 so they wouldn't apply. I did find this reference to a possible earlier bug that was never fixed, and it also looks like it's within the xz decompressor, so my suspicion that switching to ZSTD gets rid of the problem by going around it could be true. |
Should the |
I don't know if non-MIPS32 or even non-mt7621 have the same problem, seems like we'd have heard of it a lot more if it was on all CPU types, or even all MIPS32 types. Changing the overall default compressor to ZSTD in general might happen sooner than figuring out the bug, it's almost always the same or better than XZ anyway, especially at 1024KB blocksize. Need at least one or two other acks that switching to ZSTD avoids the problem, and eliminate that it's a general squashfs bug, confirm it as a compressor bug in XZ. The idea of switching to ZSTD by default was already punted around, this would be as good a reason as any to just make that easy change. |
I have the same issue on er (target: octeon, platform:ubnt_edgerouter) with emmc |
If that's also with XZ, then try out ZSTD instead and see if it avoids the issue on that platform also (MIPS64 seems like). Could still be any-MIPS rather than any-cpu. extroot on usb also works around the issue but possibly just another way to avoid the XZ compressor. it would probably still be buggy if it were squashfs+xz on usb storage. |
Can anyone explain how to build the firmware to use zstd? |
Oops, I build from my own set of patches on top of So I guess build from that PR because I forgot it's not selectable in regular |
Is it possible to re-compress xz image into zstd? |
The kernel doesn't have the ZSTD decompressor without using the above PR, or if it gets merged. So yes you could forcibly compress it with ZSTD but the kernel won't know how to unpack it unless you also hack the Kconfig defaults a bit to match. So you might as well use the PR which has that all done already and a menu option. |
@Spudz76 hi! The patch above was merged. Do you have any examples how to apply the fix for the certain devices? |
Oops the one that needs merged for this is #11328 That one was for initramfs where squashfs is not used. I think it's possible to switch to the other method, but this corruption occurs in squashfs mode so it needs the similar patch applied. edit: I do need to rebase and clean that one up though so it might be accepted. |
For reference my config contains the relevant settings, that work with #11328 (which is about to be updated, doing a compile and run test now):
|
I was experiencing this problem on a TP-Link RE350 v3 (mt76x8) with OpenWrt 22.03 (it was so bad I had to revert to factory firmware). Today I've noticed the following commit in OpenWrt git master from 3/27:
I just did a new build from master and after 4 hours of uptime the issue has not yet appeared. Hopefully the above commit may have finally fixed this issue. |
@d-me3 How is it so far? Has the commit improved anything? |
@dkadioglu Yes, I ran master snapshot build on the RE305v3 that had this issue with the above commit for ˜5 days and the issue disappeared. I can say it did improve a lot or even solved it. |
Would it make sense to backport this patch or is 5.10 too old? The metioned commit adds patches for 5.15 only |
The problem still seems to be there. We are currently testing the latest snapshots on the Archer AX23 (mt7621) and despite this patch we continue to have the SquashFS corruption issue. A user just created an image with a different compression of the SquashFS and so far it looks good. Since the relation of the error to compression has already been mentioned several times, it might make sense to switch from xz to zstd throughout the project. |
I have not had any corruption since I switched to ZSTD. |
Do we have any devices switched to ZSTD in official OpenWrt? I would like to take a look at the example and switch to ZSTD some Xiaomi and Sercomm-based devices too. |
unfortunately zstd doesn't solve my issue:
Though it's cavium octeon in my case, and not mt7621. |
Update: I installed today’s SNAPSHOT (Aug 14 2023) on a WD MyNet N750 and the SquashFS errors persist. It survived about 4 reboots in a 12 hour period before the errors appeared. Running kernel 5.15.126, default install, no overlay. The persistence of this issue without a solution (other than a USB overlay) is a real shame. |
There is a solution, use ZSTD. Unfortunately my patches which make it easier to select have not been updated in a while, mainly because I kept them updated for months without acceptance so I decided nobody cares, and I have better things to do. Default squashfs compression method zlib seems buggy on this hardware for some unknown reason (missing cache flushes?). It did not seem worth it at all to find the zlib bug since ZSTD is superior in nearly all ways anyhow. Of course you could also rebase my slightly outdated PR #11328 for yourself (package renaming and acceptance of some of my other patches in the interim made it a mess), or examine it for how to just hardcode-switch the compression method. I had nothing but random corruption over time with the default compression, and zero issues once I switched to ZSTD. Probably this platform should just be hard switched to ZSTD even if my menu-config stuff isn't assimilated, then no end-user has to do anything special to "fix" it. But I'm not sure if that's even that easy to do in a platform/board profile without half the menu patches that add the knobs. Perhaps I'll blow a day on rebasing the thing someday. Or perhaps not... |
That's a shame, I have three WD MyNet N750s laying around. I'm not confident at my skills to figure it out. In the meantime I've just been using the bootstrap method by having most of the install on a connected USB drive. |
FYI, a commit to address this issue on the ipq40xx platform was submitted in September, see below. Can someone review the code and submit commits for other platforms like ath79, etc? I'd do it, but I don't have the requisite coding skills. https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=98d325aaf8bef992cc92e94feb14fe271d370dc0 |
has anyone been able to test @jlpapple's idea? |
crowston:
Supply the following if possible:
Western Digital My Net N750
openwrt-21.02.0
strongswan, dnscrypt-proxy2, avahi-utils, luci-app-ddns
I installed openwrt-21.02.0-ath79-generic-wd_mynet-n750-squashfs-sysupgrade.bin on a Western Digital My Net N750 that had been running openwrt-19.
The router seemed okay initially but after power cycling, it started reporting errors:
Oct 17 12:20:37 router2 kernel: [ 38.613970] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 12:20:37 router2 kernel: [ 38.621029] SQUASHFS error: squashfs_read_data failed to read block 0x23686e
Oct 17 12:20:37 router2 kernel: [ 38.628199] SQUASHFS error: Unable to read fragment cache entry [23686e]
Oct 17 12:20:37 router2 kernel: [ 38.635010] SQUASHFS error: Unable to read page, block 23686e, size 16b28
The filesystem problem would leave some random file damaged, so different services would fail. Over time, the router became less and less functional as various files became inaccessible and after a few cycles, wouldn't boot at all.
I wondered if there was a problem with my old configuration on the new release (though I'm not sure how that could damage the squashfs), so I reinstalled a few more times in different ways, e.g., doing a factory install (openwrt-21.02.0-ath79-generic-wd_mynet-n750-squashfs-factory.bin and then the upgrade) instead of just the upgrade, and configuring from scratch rather than from the backup. But each time I had the same problem with the router.
It wasn't the same block on different installs, I noticed, but it seemed to be consistent for a particular installation attempt.
Oct 17 16:11:14 router2 kernel: [ 53.182571] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:11:14 router2 kernel: [ 53.189582] SQUASHFS error: squashfs_read_data failed to read block 0x21e9e6
Oct 17 16:11:14 router2 kernel: [ 53.196749] SQUASHFS error: Unable to read fragment cache entry [21e9e6]
Oct 17 16:11:14 router2 kernel: [ 53.203559] SQUASHFS error: Unable to read page, block 21e9e6, size fd9c
Once there were two blocks (I think this is a reboot of the install above):
Oct 17 16:29:04 router2 kernel: [ 78.505075] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:29:04 router2 kernel: [ 78.512103] SQUASHFS error: squashfs_read_data failed to read block 0x1e6e76
Oct 17 16:29:05 router2 kernel: [ 79.111366] SQUASHFS error: xz decompression failed, data probably corrupt
Oct 17 16:29:05 router2 kernel: [ 79.118386] SQUASHFS error: squashfs_read_data failed to read block 0x21e9e6
Oct 17 16:29:05 router2 kernel: [ 79.125565] SQUASHFS error: Unable to read fragment cache entry [21e9e6]
Oct 17 16:29:05 router2 kernel: [ 79.132445] SQUASHFS error: Unable to read page, block 21e9e6, size fd9c
One time there was first a jffs error, followed by lots of squashfs errors. Sorry, I don't have the log for that one.
I now realize that I should have tried power cycling a clean install a few times to see if there were errors right away or if they only happened after files were installed/changed.
To check whether the router was just having a hardware problem, I reinstalled openwrt-19.07.8 and configured it the same. I have not seen any errors after a few power cycles, which points to a problem with the new release. I did not see any bug reports on this tracker that mention squashfs problems and googling, I did not find any useful discussions, hence this bug report.
I guess it could be that the new release uses a bad bit of memory that the earlier release managed to miss. I looked for but didn't find a memory test utility, so I don't know how to examine that possibility. Though the fact that it was different blocks each time makes it not sound like a hardware problem.
The text was updated successfully, but these errors were encountered: