Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#1926 - MTD partition offset not correctly mapped when bad eraseblocks present #7298

Closed
openwrt-bot opened this issue Nov 2, 2018 · 20 comments
Labels

Comments

@openwrt-bot
Copy link

dksl3:

  • Device problem occurs on
    Netgear R6220

  • Software versions of OpenWrt/LEDE
    OpenWrt 18.06.1, r7258-5eb055306f

When OpenWrt detects a bad eraseblock, all following offsets are sifted by one.

I'll try to explain better this issue with an example.
We have this situation in kernel log:

[ 2.853468] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xf1
[ 2.866112] nand: Macronix NAND 128MiB 3,3V 8-bit
[ 2.875473] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB si4
[ 2.890555] Scanning device for bad blocks
[ 2.969549] Bad eraseblock 266 at 0x000002140000
[ 3.096049] Bad eraseblock 708 at 0x000005880000
[ 3.189001] 6 fixed-partitions partitions found on MTD device MT7621-NAND
[ 3.202518] Creating 6 MTD partitions on "MT7621-NAND":
[ 3.212922] 0x000000000000-0x000000100000 : "u-boot"
[ 3.223925] 0x000000100000-0x000000200000 : "SC PID"
[ 3.234878] 0x000000200000-0x000000600000 : "kernel"
[ 3.245854] 0x000000600000-0x000002200000 : "ubi"
[ 3.256476] 0x000002e00000-0x000002f00000 : "factory"
[ 3.267585] 0x000004200000-0x000007e00000 : "reserved"
[ 3.279423] [mtk_nand] probe successfully!

As you can see there are 2 bad eraseblocks. Let's leave the last one, since it is at the end of the flash.
The kernel states that the 'factory' partition starts at 0x2e00000 (that's correct), but in reality OpenWrt will search for the partition at 0x2e20000 (2e00000 + (1 * 128KiB)).
People that have 3 bad eraseblocks before the factory partition reported that their mtd4 (factory) partition content reflects what is in nand at 0x2e60000 (0x2e00000 + (3 * 128KiB)).
This issue led to the wrong belief that there is more than one flash layout for this device, as reported in [[https://openwrt.org/toh/netgear/netgear_r6220|OpenWrt device page]] too.
A rapid check with sc_nand r from U-boot prompt can confirm this behavior.

@openwrt-bot
Copy link
Author

jwh7:

Given the severity of this bug, shouldn't the priority be raised?

@openwrt-bot
Copy link
Author

rmilecki:

OpenWrt will search for the partition at 0x2e20000 (2e00000 + (1 * 128KiB))

This is very vague.


All this report is about problem with reading WiFi EEPROM data from a flash partition "factory". R6220 uses following entries in the DTS:
mediatek,mtd-eeprom = <&factory 0x0000>;
mediatek,mtd-eeprom = <&factory 0x8000>;

If you look at the mt76_get_of_eeprom() you'll see it simply uses mtd_read() to read flash content with EEPROM. So in R6220 case it gets translated into:
mtd_read(mtd, 0x0000, len, ..., ...)
mtd_read(mtd, 0x8000, len, ..., ...)


Now, mtd internally will calculate an absolute offset and will ask NAND driver to read flash content from 0x2e00000 and 0x2e08000.

The real problem is the NAND flash driver. It contains something called BMT which is some crazy translation of NAND pages. It tries to be smart and handle bad block magically as if they didn't exist. It completely doesn't fit the fixed partitioning layout that R6220 uses.

Apparently when the NAND flash driver gets a request for reading a page with 0x2e00000 flash data it returns page that contains 0x2e20000 data. There is nothing wrong with the mt76 driver of mtd subsystem.

All "solutions" like adjusting partitions or mediatek,mtd-eeprom offset are only hacky workarounds for the unexpected NAND driver behavior.

@openwrt-bot
Copy link
Author

jow-:

Can you please do a local build with the following change applied and see if if it fixes the issue?

Before flashing an image with this change, make sure you're able to recover the device through TFTP if needed.

diff --git a/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch b/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch
index d50e689110..03b2b36db9 100644
--- a/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch
+++ b/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch
@@ -3578,7 +3578,7 @@ Signed-off-by: John Crispin blogic@openwrt.org

  •  if (!err) {
    
  •          MSG(INIT, "[mtk_nand] probe successfully!\n");
    
  •          nand_disable_clock();
    

-+ shift_on_bbt = 1;
++ shift_on_bbt = 0;

  •          if (load_fact_bbt(mtd) == 0) {
    
  •                  int i;
    
  •                  for (i = 0; i < 0x100; i++)
    

@openwrt-bot
Copy link
Author

dksl3:

This patch solved the issue!!!

@openwrt-bot
Copy link
Author

jow-:

Great - thanks for confirming. I am still waiting for my test hardware to arrive in order to do some more thorough fixing of the NAND driver. I am not sure if there's still some shifting / retry logic left in the write path.

@openwrt-bot
Copy link
Author

ptpt52:

disable shift_on_bbt on nand flash is a bad idea

This may cause the data written on the flash to fail or be corrupted.

It depends on the location of the bad block and the location where the data is written

@openwrt-bot
Copy link
Author

jogo:

This may cause the data written on the flash to fail or be corrupted.

This is just how NAND flash works. NAND aware filesystems (and FTL) expect this to happen and can handle it. And they expect the NAND controller driver to write/read where they are told to, and report if it fails, or if they needed to do correct bit errors. Not internally remap to some random location.

@openwrt-bot
Copy link
Author

jwh7:

//@Jo-Philipp Wich:// were you able to use that test hardware for further investigation/improvements of this fix? Thanks!

@openwrt-bot
Copy link
Author

superfes:

Just FYI, I've applied this patch to my R6220 and it is able to keep my 5G network up.

Eventually the router is a little unstable and I have to reboot it (though I built from master, so it may have nothing to do with this patch).

I don't know what everybody else's experience with this bug, but for me my 5G network would shutdown after ~20 minutes or so and to get it back I'd have to reboot the router.

@openwrt-bot
Copy link
Author

frost242:

Hello,
I've applied this patch too, on the 18.06.2 branch. The bad eraseblock problem went away and the OpenWrt was able to read the MAC addresses in the flashrom.
The router is rock also stable and works flawlessly as our home WiFi AP.

@openwrt-bot
Copy link
Author

jwh7:

I am still waiting for my test hardware to arrive in order to do some more thorough fixing of the NAND driver.

//@Jo-Philipp Wich// Just checking in if this testing has been planned? Thanks!

@openwrt-bot
Copy link
Author

th0m4s:

I can also confirm that firmware from : https://github.com/jayanta525/openwrt-netgear-r6220-100ins is working fine.

@openwrt-bot
Copy link
Author

ptpt52:

test ok with this patch

@openwrt-bot
Copy link
Author

bjonglez:

The patch was committed to master: https://git.openwrt.org/527832e54bf3bc4d699a145ae66f34230246f0a9

It probably needs a backport to 19.07, and possibly also 18.06?

@openwrt-bot
Copy link
Author

jwh7:

This is ported to 19.07 now:
https://git.openwrt.org/b8b62b8506f5465331e749799c36ef49160036f4

@openwrt-bot
Copy link
Author

Ingvix:

Hey, I'd like to know if this patch is already in some prebuild packages I can use to update my router with or would I currently need to do some building mambojambo — which isn't really in my comfort zone — to get it to my system?

@openwrt-bot
Copy link
Author

jwh7:

@Ingvix You can use the daily snapshots, following the R6220 device page install instructions, and then manually install LuCI (see docs) and whatever other needed packages.

@openwrt-bot
Copy link
Author

jwh7:

More links...

@openwrt-bot
Copy link
Author

Ingvix:

Thanks a lot @jeremy. I'll look into it.

@openwrt-bot
Copy link
Author

Colani1200:

The problem is back in recent snapshot with kernel 5.4:

https://bugs.openwrt.org/index.php?do=details&task_id=3582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant