New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#1242 - SATA broken on kernel 4.9 on mt7621 #6488
Comments
valdi74: I can confirm this bug on Xiaomi Mi Router 3G and USB SATA HDD. Big files (10 GB) downloaded are sometimes (20-30% files) broken - md5sum don't match. There was no log entry when the error occurred. Tested with:
|
neheb: Let's see. Not a SATA issue. Not a pcie issue (USB is not connected through pcie). Sounds like a bug introduced in the port to 4.9. Maybe a CPU issue? |
HeadLessHUN: Hi there! I also faced this bug on xiaomi mi Router 3g on different HDDs with ext4 filesystem. I've OpenWrt SNAPSHOT r5629-23bba9c release, which equipped with 4.9.72 kernel. I haven't seen any kernel log which might be relevant to this problem only when mysql tries to acces some block and it can't read...
It is very annoying bug, i hope it will be fixed ASAP. |
neheb: Kernel 4.14 should be coming soon. Hopefully it fixes this issue. For all I know, the kernel config could be the issue. Testing is needed... |
neheb: Can you guys test http://lists.infradead.org/pipermail/lede-dev/2018-January/010795.html ? |
HeadLessHUN: i'll try it out but it shouldn't have any impact because it was added to the generic config in june.[[https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;h=b47fd7656336162360ebf66147326763ddae3f8d;hp=415c47de79ada7496c39f435df0b0523472aee58|External Link]], did you change anything else to the master branch? |
neheb: Yeah I did a diff between config-4.4 and config-4.9 and removed newly introduced CONFIGs. It worked. I have firmware on 4.9 that does not show this issue. Unfortunately, I lost the exact config. I'm currently testing a new one but unfortunately, this testing of bad kernels destroyed my btrfs array. Now I need to rebuild it... diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9 and yes, I attributed the error to the wrong CONFIG. |
HeadLessHUN: I commented out these lines
and inserted this line to the target/linux/ramips/mt7621/config-4.9.
Build it and the problem didn't get solved....There are lots of corruption in few hour uptimm. |
neheb: I got rid of a bunch of CONFIG settings out of confg-4.9 but after observing the actual generated .config file in the build directory, there's no difference. So it seems this is a dead-end... In other news, I seem not to have these issues anymore. I don't know why. The only answer I have is that it was fixed upstream. I can't see what would have done that though... I have working firmware from 4.9.75. I need to do more testing, but this seems to be gone. Even if placebo, try this patch. It may work, may not...
|
HeadLessHUN: I'm on 4.9.77 r5917-36f1978 and there is still issue with that... These config removes doesn't needed by anything? openvpn for example crypto support |
neheb: Like I said, this is no-op as all of those options end up in the resulting kernel .config anyway. But I tried it on one of my builds and it seems to have worked? If something breaks you'll instantly know. |
neheb: I gave up. What I did was probably placebo. Just gonna keep ramips at 4.4 in my tree. Hoping 4.14 (which should come soon) fixes it but I wouldn't hold my breath. If you can, run a ramips unit for several days and compare "md5sum /dev/mtdblock[0123456] |
HeadLessHUN: i will try to run it through several days and save the md5 from all mtdblock, and will share with you, but it should increase the priority... but it should change for example because of the overlayfs it should be tested on drives which is not changing... |
HeadLessHUN: nah it's getting corrupted (i mean my hdd-s), is it possible to build a snapshot image with 4.4 kernel? Or my only chance is to backport the device to lede 17.01-stable? |
neheb: I'm using 4.4 with trunk. Just copy patches-4.4 and config-4.4 from 17.01 and change the Makefile to use 4.4. |
neheb: @HeadLessHUN a little birdie told me that disabling CONFIG_HIGHMEM fixes this. Could be good to try out. diff --git a/target/linux/ramips/mt7621/config-4.9 b/target/linux/ramips/mt7621/config-4.9 |
easyteacher: @neheb Does disabling CONFIG_HIGHMEM really work? Have you tested it? I found a new config introduced in kernel 4.5 [[https://cateee.net/lkddb/web-lkddb/IO_STRICT_DEVMEM.html|CONFIG_IO_STRICT_DEVMEM: Filter I/O access to /dev/mem]] And will enabling CONFIG_DM_VERITY help? |
neheb: No idea. I've tried it on the 4.4 kernel and it seems to work well. I'm using it for the sd card though (the mmc driver breaks when using the HighMem zone). Could also help here since the issue for me happens after 15+ hours. Maybe when something else tries using the HighMem zone. I don't think those two options have any impact. |
easyteacher: [[https://events.static.linuxfound.org/sites/events/files/slides/Shuah_Khan_dma_map_error.pdf|Detecting silent data corruptionsand memory leaks using DMA Debug API]] I found a document possibly related to the bug. To debug, set CONFIG_DMA_API_DEBUG=y. Currently I have no idea how to use it. |
neheb: It seems drivers must be manually modified to use it. |
valdi74: Maybe [[https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=79126770868995faa8656f6687a88d385802e34b|this]] is the solution to our problem? |
neheb: Yes. |
neheb:
Supply the following if possible:
Basically, with kernel 4.9 there's some weird issue where after several hours (around 18), the SATA controller starts returning bad data. On 4.4, this is not a problem.
I've avoided reporting this problem to kernel.org since ramips is quite LEDE specific. Could be a pcie issue for all I know.
The data on the actual hard drive is fine. It's just bad data that's being returned. Maybe bit errors or something.
The way I test this is by using transmission with its Verify feature. Last I tested with adm + ext4, a torrent that verified at 100% verified at 91% 3 days later.
btrfs is more vocal since it reports silent data corruption and throws checksum mismatch errors in dmesg quite frequently after a few hours.
I currently work around the issue by running kernel 4.4, but this is not a long term solution.
The text was updated successfully, but these errors were encountered: