New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#1926 - MTD partition offset not correctly mapped when bad eraseblocks present #7298
Comments
jwh7: Given the severity of this bug, shouldn't the priority be raised? |
rmilecki:
This is very vague. All this report is about problem with reading WiFi EEPROM data from a flash partition "factory". R6220 uses following entries in the DTS: If you look at the mt76_get_of_eeprom() you'll see it simply uses mtd_read() to read flash content with EEPROM. So in R6220 case it gets translated into: Now, mtd internally will calculate an absolute offset and will ask NAND driver to read flash content from 0x2e00000 and 0x2e08000. The real problem is the NAND flash driver. It contains something called BMT which is some crazy translation of NAND pages. It tries to be smart and handle bad block magically as if they didn't exist. It completely doesn't fit the fixed partitioning layout that R6220 uses. Apparently when the NAND flash driver gets a request for reading a page with 0x2e00000 flash data it returns page that contains 0x2e20000 data. There is nothing wrong with the mt76 driver of mtd subsystem. All "solutions" like adjusting partitions or mediatek,mtd-eeprom offset are only hacky workarounds for the unexpected NAND driver behavior. |
jow-: Can you please do a local build with the following change applied and see if if it fixes the issue? Before flashing an image with this change, make sure you're able to recover the device through TFTP if needed.
|
dksl3: This patch solved the issue!!! |
jow-: Great - thanks for confirming. I am still waiting for my test hardware to arrive in order to do some more thorough fixing of the NAND driver. I am not sure if there's still some shifting / retry logic left in the write path. |
ptpt52: disable shift_on_bbt on nand flash is a bad idea This may cause the data written on the flash to fail or be corrupted. It depends on the location of the bad block and the location where the data is written |
jogo:
This is just how NAND flash works. NAND aware filesystems (and FTL) expect this to happen and can handle it. And they expect the NAND controller driver to write/read where they are told to, and report if it fails, or if they needed to do correct bit errors. Not internally remap to some random location. |
jwh7: //@Jo-Philipp Wich:// were you able to use that test hardware for further investigation/improvements of this fix? Thanks! |
superfes: Just FYI, I've applied this patch to my R6220 and it is able to keep my 5G network up. Eventually the router is a little unstable and I have to reboot it (though I built from master, so it may have nothing to do with this patch). I don't know what everybody else's experience with this bug, but for me my 5G network would shutdown after ~20 minutes or so and to get it back I'd have to reboot the router. |
frost242: Hello, |
jwh7:
//@Jo-Philipp Wich// Just checking in if this testing has been planned? Thanks! |
th0m4s: I can also confirm that firmware from : https://github.com/jayanta525/openwrt-netgear-r6220-100ins is working fine. |
ptpt52: test ok with this patch |
bjonglez: The patch was committed to master: https://git.openwrt.org/527832e54bf3bc4d699a145ae66f34230246f0a9 It probably needs a backport to 19.07, and possibly also 18.06? |
jwh7: This is ported to 19.07 now: |
Ingvix: Hey, I'd like to know if this patch is already in some prebuild packages I can use to update my router with or would I currently need to do some building mambojambo — which isn't really in my comfort zone — to get it to my system? |
jwh7: @Ingvix You can use the daily snapshots, following the R6220 device page install instructions, and then manually install LuCI (see docs) and whatever other needed packages. |
jwh7: More links... |
Ingvix: Thanks a lot @jeremy. I'll look into it. |
Colani1200: The problem is back in recent snapshot with kernel 5.4: |
dksl3:
Device problem occurs on
Netgear R6220
Software versions of OpenWrt/LEDE
OpenWrt 18.06.1, r7258-5eb055306f
When OpenWrt detects a bad eraseblock, all following offsets are sifted by one.
I'll try to explain better this issue with an example.
We have this situation in kernel log:
[ 2.853468] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xf1
[ 2.866112] nand: Macronix NAND 128MiB 3,3V 8-bit
[ 2.875473] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB si4
[ 2.890555] Scanning device for bad blocks
[ 2.969549] Bad eraseblock 266 at 0x000002140000
[ 3.096049] Bad eraseblock 708 at 0x000005880000
[ 3.189001] 6 fixed-partitions partitions found on MTD device MT7621-NAND
[ 3.202518] Creating 6 MTD partitions on "MT7621-NAND":
[ 3.212922] 0x000000000000-0x000000100000 : "u-boot"
[ 3.223925] 0x000000100000-0x000000200000 : "SC PID"
[ 3.234878] 0x000000200000-0x000000600000 : "kernel"
[ 3.245854] 0x000000600000-0x000002200000 : "ubi"
[ 3.256476] 0x000002e00000-0x000002f00000 : "factory"
[ 3.267585] 0x000004200000-0x000007e00000 : "reserved"
[ 3.279423] [mtk_nand] probe successfully!
As you can see there are 2 bad eraseblocks. Let's leave the last one, since it is at the end of the flash.
The kernel states that the 'factory' partition starts at 0x2e00000 (that's correct), but in reality OpenWrt will search for the partition at 0x2e20000 (2e00000 + (1 * 128KiB)).
People that have 3 bad eraseblocks before the factory partition reported that their mtd4 (factory) partition content reflects what is in nand at 0x2e60000 (0x2e00000 + (3 * 128KiB)).
This issue led to the wrong belief that there is more than one flash layout for this device, as reported in [[https://openwrt.org/toh/netgear/netgear_r6220|OpenWrt device page]] too.
A rapid check with
sc_nand r
from U-boot prompt can confirm this behavior.The text was updated successfully, but these errors were encountered: