OpenWrt/LEDE Project

  • Status Closed
  • Percent Complete
    100%
  • Task Type Bug Report
  • Category Base system
  • Assigned To No-one
  • Operating System All
  • Severity High
  • Priority Very Low
  • Reported Version Trunk
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Russell Senior - 25.03.2020
Last edited by Hauke Mehrtens - 04.07.2020

FS#2928 - TP-Link TL-WDR3600 v1 on kernel 5.4 boot-loops since change to GCC 8.4.0

- Device problem occurs on

TP-Link TL-WDR3600 v1

- Software versions of OpenWrt/LEDE release, packages, etc.

Since reboot-12646-gdb70077668 “toolchain: Update GCC 8 to version 8.4.0” and kernel 5.4, WDR3600 boot-loops with the following message:

Starting kernel ...

[    0.000000] Linux version 5.4.24 (openwrt@hawg) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r12683-8c33debb52)) #0 Sat Mar 21 21:35:45 2020
[    0.000000] printk: bootconsole [early0] enabled
[    0.000000] CPU0 revision is: 0001974c (MIPS 74Kc)
[    0.000000] MIPS: machine is TP-Link TL-WDR3600 v1
[    0.000000] SoC: Atheros AR9344 rev 2
[    0.000000] Initrd not found or empty - disabling initrd
[    0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
[    0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 32480
[    0.000000] Kernel command line: console=ttyS0,115200 rootfstype=squashfs,jffs2
[    0.000000] Dentry cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] Inode-cache hash table entries: 8192 (order: 3, 32768 bytes, linear)
[    0.000000] Writing ErrCtl register=00000000
[    0.000000] Readback ErrCtl register=00000000
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 122384K/131072K available (4681K kernel code, 187K rwdata, 1080K rodata, 1212K init, 196K bss, 8688K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] NR_IRQS: 51
[    0.000000] random: get_random_bytes called from start_kernel+0x32c/0x51c with crng_init=0
[    0.000000] CPU clock: 560.000 MHz
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6825930166 ns
[    0.000009] sched_clock: 32 bits at 280MHz, resolution 3ns, wraps every 7669584382ns
[    0.008305] Calibrating delay loop... 278.93 BogoMIPS (lpj=1394688)
[    0.084927] pid_max: default: 32768 minimum: 301
[    0.089999] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.097796] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.107070] Kernel panic - not syncing: Unexpected DSP exception
[    0.113470] Rebooting in 1 seconds..
Closed by  Hauke Mehrtens
04.07.2020 14:13
Reason for closing:  Fixed
Additional comments about closing:  

This was fixed in https:/ /git.openwrt.org/4bb5e331a781c2d4f3040c7 0df328b1ef90f1871 The DSPen bit in the c0_status register was not set because of a hazard between mtc0 and mfc0.

realmicu commented on 27.03.2020 23:11

I'm experiencing the same issue on Netgear WNDR4300 (SoC: ar9344). Images compiled with previous version 8.3.0 were OK while 8.4.0 produces invalid code. What worked for me was switching GCC from version 8.4.0 to 9.3.0 :

CONFIG_TARGET_ath79=y
CONFIG_TARGET_ath79_nand=y
CONFIG_TARGET_ath79_nand_DEVICE_netgear_wndr4300=y
CONFIG_DEVEL=y
CONFIG_TOOLCHAINOPTS=y
CONFIG_CCACHE=y
CONFIG_COLLECT_KERNEL_DEBUG=y
# CONFIG_GCC_USE_VERSION_8 is not set
CONFIG_GCC_USE_VERSION_9=y
CONFIG_GCC_VERSION="9.3.0"
CONFIG_GCC_VERSION_9=y
CONFIG_IMAGEOPT=y
CONFIG_LINUX_5_4=y
CONFIG_TESTING_KERNEL=y
Steve Brown commented on 29.03.2020 15:32

Reverting 7000f11c23e23cf11f96 toolchain: Update GCC 8 to version 8.4.0

Fixes the problem on my TP-Link archer a7-v5

Russell Senior commented on 29.03.2020 21:32

Fwiw, this is the .config stub I used while bisecting:

CONFIG_TARGET_ath79=y
CONFIG_TARGET_ath79_generic=y
CONFIG_TARGET_ath79_generic_DEVICE_tplink_tl-wdr3600-v1=y
CONFIG_DEVEL=y
CONFIG_BUILD_LOG=y
# CONFIG_BUSYBOX_CONFIG_BRCTL is not set
# CONFIG_BUSYBOX_CONFIG_FREE is not set
# CONFIG_BUSYBOX_CONFIG_PGREP is not set
# CONFIG_BUSYBOX_CONFIG_TOP is not set
# CONFIG_BUSYBOX_CONFIG_UPTIME is not set
# CONFIG_PACKAGE_6relayd is not set
# CONFIG_PACKAGE_firewall is not set
# CONFIG_PACKAGE_firewall3 is not set
CONFIG_PACKAGE_iptables-mod-ipopt=y
CONFIG_PACKAGE_iptables-mod-nat-extra=y
# CONFIG_PACKAGE_odhcp6c is not set
# CONFIG_PACKAGE_ppp is not set
# CONFIG_PACKAGE_ppp-mod-pppoe is not set
CONFIG_TESTING_KERNEL=y
Project Manager
Hauke Mehrtens commented on 29.03.2020 22:34

I can reproduce it on a TP-Link TL-WDR4300 v1 with a AR9344.

It is happening in the save_dsp() function:
https://elixir.bootlin.com/linux/v5.4.28/source/arch/mips/include/asm/dsp.h#L50 which is called by arch_dup_task_struct()
https://elixir.bootlin.com/linux/v5.4.28/source/arch/mips/kernel/process.c#L110

The AR9344 says it supports the DSP extension:

root@OpenWrt:/# cat /proc/cpuinfo 
system type             : Atheros AR9344 rev 2
machine                 : TP-Link TL-WDR4300 v1
processor               : 0
cpu model               : MIPS 74Kc V4.12
BogoMIPS                : 278.78
wait instruction        : yes
microsecond timers      : yes
tlb_entries             : 32
extra interrupt vector  : yes
hardware watchpoint     : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa                     : mips1 mips2 mips32r1 mips32r2
ASEs implemented        : mips16 dsp dsp2
Options implemented     : tlb 4kex 4k_cache prefetch mcheck ejtag llsc dc_aliases perf_cntr_intr_bit nan_legacy nan_2008 perf
shadow register sets    : 1
kscratch registers      : 0
package                 : 0
core                    : 0
VCED exceptions         : not available
VCEI exceptions         : not available

root@OpenWrt:/# 

I added this function in between:

void my_save_dsp(void)
{
	save_dsp(current);
}

The working assembler for kernel 4.19 looks like this:

80067b40 <my_save_dsp.part.8>:
80067b40:       8f830000        lw      v1,0(gp)
80067b44:       00202810        mfhi    a1,$ac1
80067b48:       00202012        mflo    a0,$ac1
80067b4c:       ac65057c        sw      a1,1404(v1)
80067b50:       8f830000        lw      v1,0(gp)
80067b54:       00403810        mfhi    a3,$ac2
80067b58:       00403012        mflo    a2,$ac2
80067b5c:       ac640580        sw      a0,1408(v1)
80067b60:       8f830000        lw      v1,0(gp)
80067b64:       00602810        mfhi    a1,$ac3
80067b68:       00602012        mflo    a0,$ac3
80067b6c:       ac670584        sw      a3,1412(v1)
80067b70:       8f830000        lw      v1,0(gp)
80067b74:       ac660588        sw      a2,1416(v1)
80067b78:       8f830000        lw      v1,0(gp)
80067b7c:       ac65058c        sw      a1,1420(v1)
80067b80:       8f830000        lw      v1,0(gp)
80067b84:       ac640590        sw      a0,1424(v1)
80067b88:       7c3f1cb8        rddsp   v1,0x3f
80067b8c:       8f820000        lw      v0,0(gp)
80067b90:       03e00008        jr      ra
80067b94:       ac430594        sw      v1,1428(v0)

The crashing assembler for kernel 5.4 looks like this:

80066db0 <my_save_dsp.part.7>:
80066db0:       8f830000        lw      v1,0(gp)
80066db4:       00202810        mfhi    a1,$ac1
80066db8:       00202012        mflo    a0,$ac1
80066dbc:       ac65048c        sw      a1,1164(v1)
80066dc0:       8f830000        lw      v1,0(gp)
80066dc4:       00403810        mfhi    a3,$ac2
80066dc8:       00403012        mflo    a2,$ac2
80066dcc:       ac640490        sw      a0,1168(v1)
80066dd0:       8f830000        lw      v1,0(gp)
80066dd4:       00602810        mfhi    a1,$ac3
80066dd8:       00602012        mflo    a0,$ac3
80066ddc:       ac670494        sw      a3,1172(v1)
80066de0:       8f830000        lw      v1,0(gp)
80066de4:       ac660498        sw      a2,1176(v1)
80066de8:       8f830000        lw      v1,0(gp)
80066dec:       ac65049c        sw      a1,1180(v1)
80066df0:       8f830000        lw      v1,0(gp)
80066df4:       ac6404a0        sw      a0,1184(v1)
80066df8:       7c3f1cb8        rddsp   v1,0x3f
80066dfc:       8f820000        lw      v0,0(gp)
80066e00:       03e00008        jr      ra
80066e04:       ac4304a4        sw      v1,1188(v0)

This looks very similar, Is there some initialization for the DSP extension needed?

This commit from Linux 4.20 looks interesting:
https://git.kernel.org/linus/edbb4233e7efc37dbebb10f7774b38c64080dd66

Project Manager
Hauke Mehrtens commented on 30.03.2020 22:44

I did a git bisect and it breaks since this kernel commit:
http://git.kernel.org/linus/9012d011660ea5cf2a623e1de207a2bc0ca6936d

As this is changing some compiler optimizations I assume this is related to some compiler bug.

Bjørn Mork commented on 03.04.2020 15:24

I can confirm this issue on a Ubiquiti UniFi AP AC Pro, so I don't think there is any reason to limit this bug to a specific device. It's probably target wide.

Based on the kernel commit and error location @Hauke pointed out, I tried forcibly inlining the dsp_init functions. And that solved the problem for me. See attached patch.

Still needs someone to figure out why, and write a proper commit message explaining it all...

Project Manager
Hauke Mehrtens commented on 05.04.2020 13:44

Thank you Bjørn Mork that is helpful.

We backported the CONFIG_OPTIMIZE_INLINING function already to kernel 4.19, but we did not see the problem there. I did an additional bisect and found that these two changes are also needed to cause this problem:

Since this commit the system hangs like this:

[    0.000000] CPU clock: 560.000 MHz
[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6825930166 ns
[    0.000008] sched_clock: 32 bits at 280MHz, resolution 3ns, wraps every 7669584382ns

https://git.kernel.org/linus/172dcd935c34b022729f45a7bbaae5cc05231533

This hang was fixed in this commit added 3 commits later we see the DSP exception
https://git.kernel.org/linus/de56d4c1da3e68f0ca468a55f6677bef3cee6e10

This was both done by manually applying this poach from OpenWrt:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/generic/pending-4.19/220-optimize_inlining.patch;h=ae032709d2729d23c3485c0a4e9ecbbfebd6d6a6;hb=HEAD

My assumption is that the kernel did not handle DSP exceptions correctly before and this was fixed by these patches from Paul.

Project Manager
Hauke Mehrtens commented on 06.04.2020 21:35

When I revert this GCC commit it works again:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=9fe0f3b6468871448bf40751a4f30cf20118ce6a

I created a bug report for GCC:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94506

I reverted a commit in GCC here:
https://patchwork.ozlabs.org/patch/1267087/

I also see this problem with an unmodified upstream kernel.

Thomas Walker commented on 16.04.2020 17:34

fwiw- 100% confirmed the problem and that reverting to gcc 8.3 works on an Archer C7 v2. Is anyone following through with Jakub in the gcc bugzilla?

xnoreq commented on 23.04.2020 19:21

Can confirm as well. Tried flashing master with 5.4 on an Archer C7 v5 last week. Didn't boot.

Admin
Petr Štetiar commented on 03.07.2020 07:24

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing