Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#4158 - Master regression - boot loop due to kernel panic (mt7621) #9143

Closed
openwrt-bot opened this issue Nov 27, 2021 · 8 comments
Closed
Labels
flyspray kernel pull request/issue with Linux kernel related changes

Comments

@openwrt-bot
Copy link

dsouza:

I just did a build yesterday for the Archer C6 v3.2 from master, and now it is not booting anymore due to a boot loop issue (kernel panic).

Running OpenWRT SNAPSHOT custom build from master (r18195-d1c7df9c4b)

To reproduce, just install the build and reboot the device.

Attaching a UART I can see the kernel panic error below:

(...) [ 0.797651] CPU 1 Unable to handle kernel paging request at virtual address 5050404, epc == 80588ef8, ra == 801fe360 [ 0.808162] Oops[#1]: [ 0.810387] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.4.159 #0 [ 0.816345] $ 0 : 00000000 00000001 8fc304a4 00000108 [ 0.821525] $ 4 : 05050404 80621000 8064d18f 00000061 [ 0.826712] $ 8 : fffffffc 80594b3c 00000045 006d6873 [ 0.831893] $12 : 015ede76 08fca8f3 9715a5ed 5c2e1039 [ 0.837079] $16 : 8ff9cc00 8fc2523c 05050404 8064d184 [ 0.842261] $20 : 0000000b 8fc06e00 8ff9cc8c 806ebd24 [ 0.847448] $24 : 00000010 76ec2f43 [ 0.852629] $28 : 8fc40000 8fc41d50 38e38e39 801fe360 [ 0.857816] Hi : 00000000 [ 0.860665] Lo : 006c0400 [ 0.863546] epc : 80588ef8 strlen+0x0/0x2c [ 0.867771] ra : 801fe360 insert_header+0x140/0x4f8 [ 0.872847] Status: 11007c03 KERNEL EXL IE [ 0.876997] Cause : 40800008 (ExcCode 02) [ 0.880969] BadVA : 05050404 [ 0.883821] PrId : 0001992f (MIPS 1004Kc) [ 0.887882] Modules linked in: [ 0.890910] Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), ts=00000000) [ 0.898940] Stack : 08fca8f3 00000000 2ab4a599 00000dc0 00000000 8fc06e30 8f06e00 801fc880 [ 0.907235] 80862190 00000000 00000000 8fe57007 00000000 8ff9cc00 8f06e00 80860000 [ 0.915528] 8fc06e00 00000001 00000000 801feaec 806f0000 80830000 8024aec 00000000 [ 0.923822] 8064d008 8fd64a00 806eb18c 8fc06e00 8063ab2c 8063ab50 0000001 8fc06e00 [ 0.932118] 8fe57007 806ebc4c 8fe57000 806ebc04 00000001 806eb18c 8030000 80830000 [ 0.940412] ... [ 0.942832] Call Trace: [ 0.945261] [<80588ef8>] strlen+0x0/0x2c [ 0.949146] [<801fe360>] insert_header+0x140/0x4f8 [ 0.953897] [<801feaec>] __register_sysctl_table+0x30c/0x630 [ 0.959516] [<801ff154>] __register_sysctl_paths+0xf4/0x1e8 [ 0.965067] [<8070de10>] ipc_sysctl_init+0x14/0x24 [ 0.969793] [<800015c8>] do_one_initcall+0x50/0x1a8 [ 0.974641] [<806fbeec>] kernel_init_freeable+0x1ec/0x2d0 [ 0.979997] [<80594e78>] kernel_init+0x10/0xf8 [ 0.984398] [<80006478>] ret_from_kernel_thread+0x14/0x1c [ 0.989755] Code: a066ffff 1000fff7 00000000 <80820000> 10400007 00000000 00801025 80430001 1460fffe [ 0.999424] [ 1.000995] ---[ end trace d1818afedd9795ac ]--- ```

Reverting to a build I did a couple of weeks ago (also from master) solves the problem.

I believe I have already identified the root cause.

It seems that the amount of RAM memory sometimes is not correctly identified. When the boot fails, the boot loader seems to be identifying a "HighMem" memory that does not exist in this device:

Wrong HighMem Memory Detected causes Kernel Panic during boot:

(...)
[ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff]
[ 0.000000] HighMem [mem 0x0000000010000000-0x0000000023ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x000000001bffffff]
[ 0.000000] node 0: [mem 0x0000000020000000-0x0000000023ffffff]
(...)
[ 0.000000] Memory: 510004K/524288K available (5739K kernel code, 200K rwdata, 1196K rodata, 1236K init, 226K bss, 14284K reserved, 0K cma-reserved, 262144K highmem)

Per log above the identified amount of RAM memory is 512MB, when in fact this device has only 128MB of RAM. When the above situation happens the boot fails with kernel panic.

After a couple of power cycles, the memory is correctly identified as 128MB (no HighMem) per below and the device boots OK:

Correct Memory Size Detected boots OK:

(...)
[ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes.
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000000000000-0x0000000007ffffff]
[ 0.000000] HighMem empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x0000000007ffffff]
(...)
[ 0.000000] Memory: 120916K/131072K available (5739K kernel code, 200K rwdata, 1196K rodata, 1236K init, 226K bss, 10156K reserved, 0K cma-reserved, 0K highmem)
(...)

Full log is attached.

@openwrt-bot
Copy link
Author

dsouza:

I just did a test, and as expected disabling HighMem support in **./target/linux/ramips/mt7621/config-5.4 **is a temporary workaround:

CONFIG_HIGHMEM=n

This was enabled in the previous builds and it was working. I believe there is some change in the kernel or some patch that is causing HighMem detection to fail. I've tracked this to the following code, but the code below was not changed and I have no understanding about the logic of the code below to proceed with the investigation to understand what caused it started failing (since it hasn't changed):

/arch/mips/ralink/mt7621.c

static void __init mt7621_memory_detect(void)
{
void *dm = &detect_magic;
phys_addr_t size;

for (size = 32 * SZ_1M; size < 256 * SZ_1M; size <<= 1) {
	if (!__builtin_memcmp(dm, dm + size, sizeof(detect_magic)))
		break;
}

if ((size == 256 * SZ_1M) &&
    (CPHYSADDR(dm + size) < MT7621_LOWMEM_MAX_SIZE) &&
    __builtin_memcmp(dm, dm + size, sizeof(detect_magic))) {
	memblock_add(MT7621_LOWMEM_BASE, MT7621_LOWMEM_MAX_SIZE);
	memblock_add(MT7621_HIGHMEM_BASE, MT7621_HIGHMEM_SIZE);
} else {
	memblock_add(MT7621_LOWMEM_BASE, size);
}

}

@openwrt-bot
Copy link
Author

dsouza:

Per @lutchann instructions in the forum, I've re-enabled "CONFIG_HIGHMEM=y" and removed the patch target/linux/ramips/patches-5.4/105-mt7621-memory-detect.patch.

This solved this issue.

Therefore it seems that in fact this patch is the culprit.

@openwrt-bot
Copy link
Author

lutchann:

For the record here, as I mentioned in the forum, without further testing we don't know if the memory detection patch is buggy or if removing it simply perturbs the kernel build in a way that avoids running into an unrelated problem. Unless this can be reproduced reliably (or sharper eyes than mine see a bug in the implementation) I don't know if we can draw any conclusions.

@openwrt-bot
Copy link
Author

dsouza:

Right. If there is any additional test I can do just let me know.

For now it's no big deal since it only affects the snapshot builds for Archer C6 v3 and very likely affects also Archer A6 v3 (which are the same hardware, just different branding).

This might become a more widespread issue if this patch makes into the stable branch and potentially may (or may not) impact other mt7621 based devices.

@openwrt-bot
Copy link
Author

dsouza:

Just posting here what I just posted in the forum.

I think I've spoken too soon that removing the patch fixed the problem. In a rush last night to test it, I just removed the patch and rebooted the router. And it rebooted OK, without boot loop, and I concluded it was OK.

But today after a closer inspection in the log file I've noticed that the RAM size is being incorrectly detected, 256MB instead of 128MB (see below).

I will try now going back to my previous fix by setting "CONFIG_HIGHMEM=n" to check if it has a different outcome.

[ 0.000000] Linux version 5.4.162 (dsouza@dsouza00) (gcc version 11.2.0 (OpenWrt GCC 11.2.0 r18233-0a4f5d06c2)) #0 SMP Sun Nov 28 20:15:10 2021 [ 0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3 [ 0.000000] printk: bootconsole [early0] enabled [ 0.000000] CPU0 revision is: 0001992f (MIPS 1004Kc) [ 0.000000] MIPS: machine is TP-Link Archer C6 v3 [ 0.000000] Initrd not found or empty - disabling initrd [ 0.000000] VPE topology {2,2} total 4 [ 0.000000] Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes. [ 0.000000] Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes [ 0.000000] MIPS secondary cache 256kB, 8-way, linesize 32 bytes. [ 0.000000] Zone ranges: [ 0.000000] Normal [mem 0x0000000000000000-0x000000000fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000000000-0x000000001bffffff] [ 0.000000] node 0: [mem 0x0000000020000000-0x0000000023ffffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000023ffffff] [ 0.000000] On node 0 totalpages: 65536 [ 0.000000] Normal zone: 576 pages used for memmap [ 0.000000] Normal zone: 0 pages reserved [ 0.000000] Normal zone: 65536 pages, LIFO batch:15 [ 0.000000] percpu: Embedded 14 pages/cpu s26480 r8192 d22672 u57344 [ 0.000000] pcpu-alloc: s26480 r8192 d22672 u57344 alloc=14*4096 [ 0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 64960 [ 0.000000] Kernel command line: console=ttyS0,115200n8 rootfstype=squashfs,jffs2 [ 0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear) [ 0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear) [ 0.000000] Writing ErrCtl register=0005a010 [ 0.000000] Readback ErrCtl register=0005a010 [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off [ 0.000000] Memory: 250320K/262144K available (6107K kernel code, 201K rwdata, 1240K rodata, 1272K init, 206K bss, 11824K reserved, 0K cma-reserved)

@openwrt-bot
Copy link
Author

dsouza:

I did a new build and tested with "CONFIG_HIGHMEM=n" and while 3 of 4 devices correctly detected 128MiB of RAM, 1 device wrongly detected 256MiB of RAM.

So for now unfortunately there is no reliable workaround. I'm reverting back to the previous build that was working OK (r18104-2f95dd8ff0).

@openwrt-bot
Copy link
Author

dsouza:

As @981213 [[https://forum.openwrt.org/t/master-regression-boot-loop-due-to-kernel-panic-on-latest-snapshot-mt7621-archer-c6-v3/113081/12?u=dsouza|posted in the forum]], the solution below fixed the problem by hardcoding the memory size to 128MB in the .dtsi:

--- a/target/linux/ramips/dts/mt7621_tplink_archer-x6-v3.dtsi +++ b/target/linux/ramips/dts/mt7621_tplink_archer-x6-v3.dtsi @@ -18,6 +18,11 @@ bootargs = "console=ttyS0,115200n8"; };
  • memory@0 {
  •   device_type = "memory";
    
  •   reg = <0x0 0x8000000>;
    
  • };
  • keys {
    compatible = "gpio-keys";

Since memory size detection started failing on this device, perhaps the above workaround should be a permanent fix for Archer A6/C6 v3.x.

@aparcar aparcar added the kernel pull request/issue with Linux kernel related changes label Feb 22, 2022
@981213
Copy link
Member

981213 commented Mar 7, 2022

Should be fixed by 2f024b7

@981213 981213 closed this as completed Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flyspray kernel pull request/issue with Linux kernel related changes
Projects
None yet
Development

No branches or pull requests

3 participants