OpenWrt/LEDE Project

Attached to Project: OpenWrt/LEDE Project
Opened by Marco - 02.11.2018

FS#1926 - MTD partition offset not correctly mapped when bad eraseblocks present

- Device problem occurs on

        Netgear R6220

- Software versions of OpenWrt/LEDE

        OpenWrt 18.06.1, r7258-5eb055306f

When OpenWrt detects a bad eraseblock, all following offsets are sifted by one.

I’ll try to explain better this issue with an example.
We have this situation in kernel log:

[    2.853468] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xf1         
[    2.866112] nand: Macronix NAND 128MiB 3,3V 8-bit                            
[    2.875473] nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB si4
[    2.890555] Scanning device for bad blocks                                   
[    2.969549] Bad eraseblock 266 at 0x000002140000                             
[    3.096049] Bad eraseblock 708 at 0x000005880000                             
[    3.189001] 6 fixed-partitions partitions found on MTD device MT7621-NAND    
[    3.202518] Creating 6 MTD partitions on "MT7621-NAND":                      
[    3.212922] 0x000000000000-0x000000100000 : "u-boot"                         
[    3.223925] 0x000000100000-0x000000200000 : "SC PID"                         
[    3.234878] 0x000000200000-0x000000600000 : "kernel"                         
[    3.245854] 0x000000600000-0x000002200000 : "ubi"                            
[    3.256476] 0x000002e00000-0x000002f00000 : "factory"                        
[    3.267585] 0x000004200000-0x000007e00000 : "reserved"                       
[    3.279423] [mtk_nand] probe successfully!                                   
 

As you can see there are 2 bad eraseblocks. Let’s leave the last one, since it is at the end of the flash.
The kernel states that the ‘factory’ partition starts at 0x2e00000 (that’s correct), but in reality OpenWrt will search for the partition at 0x2e20000 (2e00000 + (1 * 128KiB)).
People that have 3 bad eraseblocks before the factory partition reported that their mtd4 (factory) partition content reflects what is in nand at 0x2e60000 (0x2e00000 + (3 * 128KiB)).
This issue led to the wrong belief that there is more than one flash layout for this device, as reported in OpenWrt device page too.
A rapid check with

sc_nand r

from U-boot prompt can confirm this behavior.


Jeremy commented on 09.01.2019 20:33

Given the severity of this bug, shouldn't the priority be raised?

Project Manager
Rafał Miłecki commented on 10.01.2019 11:55
OpenWrt will search for the partition at 0x2e20000 (2e00000 + (1 * 128KiB))

This is very vague.


All this report is about problem with reading WiFi EEPROM data from a flash partition "factory". R6220 uses following entries in the DTS:

mediatek,mtd-eeprom = <&factory 0x0000>;
mediatek,mtd-eeprom = <&factory 0x8000>;

If you look at the mt76_get_of_eeprom() you'll see it simply uses mtd_read() to read flash content with EEPROM. So in R6220 case it gets translated into:

mtd_read(mtd, 0x0000, len, ..., ...)
mtd_read(mtd, 0x8000, len, ..., ...)

Now, mtd internally will calculate an absolute offset and will ask NAND driver to read flash content from 0x2e00000 and 0x2e08000.

The real problem is the NAND flash driver. It contains something called BMT which is some crazy translation of NAND pages. It tries to be smart and handle bad block magically as if they didn't exist. It completely doesn't fit the fixed partitioning layout that R6220 uses.

Apparently when the NAND flash driver gets a request for reading a page with 0x2e00000 flash data it returns page that contains 0x2e20000 data. There is nothing wrong with the mt76 driver of mtd subsystem.

All "solutions" like adjusting partitions or mediatek,mtd-eeprom offset are only hacky workarounds for the unexpected NAND driver behavior.

Admin
Jo-Philipp Wich commented on 10.01.2019 16:56

Can you please do a local build with the following change applied and see if if it fixes the issue?

Before flashing an image with this change, make sure you're able to recover the device through TFTP if needed.

diff --git a/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch b/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch
index d50e689110..03b2b36db9 100644
--- a/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch
+++ b/target/linux/ramips/patches-4.14/0039-mtd-add-mt7621-nand-support.patch
@@ -3578,7 +3578,7 @@ Signed-off-by: John Crispin <blogic@openwrt.org>
 +      if (!err) {
 +              MSG(INIT, "[mtk_nand] probe successfully!\n");
 +              nand_disable_clock();
-+              shift_on_bbt = 1;
++              shift_on_bbt = 0;
 +              if (load_fact_bbt(mtd) == 0) {
 +                      int i;
 +                      for (i = 0; i < 0x100; i++)
Marco commented on 14.01.2019 17:26

This patch solved the issue!!!

Admin
Jo-Philipp Wich commented on 16.01.2019 06:32

Great - thanks for confirming. I am still waiting for my test hardware to arrive in order to do some more thorough fixing of the NAND driver. I am not sure if there's still some shifting / retry logic left in the write path.

Chen Minqiang commented on 20.01.2019 06:25

disable shift_on_bbt on nand flash is a bad idea

This may cause the data written on the flash to fail or be corrupted.

It depends on the location of the bad block and the location where the data is written

Project Manager
Jonas Gorski commented on 21.01.2019 10:28
This may cause the data written on the flash to fail or be corrupted.

This is just how NAND flash works. NAND aware filesystems (and FTL) expect this to happen and can handle it. And they expect the NAND controller driver to write/read where they are told to, and report if it fails, or if they needed to do correct bit errors. Not internally remap to some random location.

Jeremy commented on 20.02.2019 18:05

@Jo-Philipp Wich: were you able to use that test hardware for further investigation/improvements of this fix? Thanks!

Aaron Nixon commented on 18.03.2019 23:28

Just FYI, I've applied this patch to my R6220 and it is able to keep my 5G network up.

Eventually the router is a little unstable and I have to reboot it (though I built from master, so it may have nothing to do with this patch).

I don't know what everybody else's experience with this bug, but for me my 5G network would shutdown after ~20 minutes or so and to get it back I'd have to reboot the router.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing