OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Very Low
  • Reported Version openwrt-18.06
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Dmitry Ershov - 22.10.2018

FS#1905 - Kernel bug: brcm47xx/mips74k reboots after: bgmac_bcma bcma0:1 eth0: Found oversized packet...

Device

OpenWrt version

Stable Release OpenWrt 18.06.1

# cat /etc/openwrt_release 
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='18.06.1'
DISTRIB_REVISION='r7258-5eb055306f'
DISTRIB_TARGET='brcm47xx/mips74k'
DISTRIB_ARCH='mipsel_74kc'
DISTRIB_DESCRIPTION='OpenWrt 18.06.1 r7258-5eb055306f'
DISTRIB_TAINTS=''
Linux OpenWrt 4.14.63 #0 Wed Aug 15 20:42:39 2018 mips GNU/Linux

What does it do that it should not do

Sometimes the router hangup and reboots.
This can happen several times a day.
The log contains entries:

Oct 22 12:50:50 OpenWrt kernel: [249137.210802] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 56, DMA issue!
Oct 22 12:50:50 OpenWrt kernel: [249137.254221] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 121, DMA issue!
Oct 22 12:50:50 OpenWrt kernel: [249137.295788] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 314, DMA issue!
Oct 22 12:50:50 OpenWrt kernel: [249137.320500] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 509, DMA issue!

Few entries from crashlog:

<3>[249137.254221] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 121, DMA issue!
<3>[249137.295788] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 314, DMA issue!
<3>[249137.320500] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 509, DMA issue!
<0>[249137.334933] skbuff: skb_over_panic: text:80238000 len:1753 put:1753 head:86862a80 data:86862a80 tail:0x86863159 end:0x868630e0 dev:<NULL>
<4>[249137.347856] Kernel bug detected[#1]:
<4>[249137.351653] CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G        W       4.14.63 #0

Full crashlog attached.

Steps to reproduce

To reproduce the bug I try to fill the nf_conntrack table with thousands of connections.

Several seconds it was kept filled:

net.netfilter.nf_conntrack_max = 16384
net.netfilter.nf_conntrack_count = 16383

log:

nf_conntrack: nf_conntrack: table full, dropping packet
nf_conntrack: nf_conntrack: table full, dropping packet
...

then several messages:

bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 262, DMA issue!
...

and the router rebooted.

Another way to reproduce...

...same error messages in the log - run bandwidth test http://www.speedtest.net

...
Oct 22 14:46:08 OpenWrt kernel: [ 6908.757065] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 118, DMA issue!
Oct 22 14:46:08 OpenWrt kernel: [ 6908.770969] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 183, DMA issue!
Oct 22 14:46:08 OpenWrt kernel: [ 6908.800488] bgmac_bcma bcma0:1 eth0: Found oversized packet at slot 442, DMA issue!

What I have already done to workaround/fix the problem

I try to decrease value of the nf_conntrack_tcp_timeout_established
from

net.netfilter.nf_conntrack_tcp_timeout_established = 7440

to

net.netfilter.nf_conntrack_tcp_timeout_established = 900

typical value:

net.netfilter.nf_conntrack_count = 5130 (...7000)

but it still reboots sometimes.

Additional info

  • Wireless disabled
  • IPv6 disabled
  • VLANs used
Project Manager
Koen Vandeputte commented on 23.10.2018 09:06

Does this also occur on the latest 18.06 or master states?

Thanks.

Dmitry Ershov commented on 23.10.2018 13:02

Sorry, I'm afraid to brick the router and check only the stable release 18.06.1 now.
I can check release candidate or something similar.

Jaroslav Škarvada commented on 29.10.2018 23:07

I am not sure whether I am also affected by this, because I have different crashlog, but the reproducer is the same - just running speedtest.net and the router reboots. I have 18.06.1 @ Asus Wl-500gp. The reboot can happen dozen of times per day which is really annoying. I will try downgrading to 17 release. My crashlog:

<1>[ 665.717726] Data bus error, epc == 8000cb10, ra == 80005ff4
<4>[ 665.723387] Oops[#1]:
<4>[ 665.725710] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.14.63 #0
<4>[ 665.731961] task: 81822100 task.stack: 81840000
<4>[ 665.736541] $ 0 : 00000000 1000dc01 00000000 00000000
<4>[ 665.741852] $ 4 : 81840000 00800010 00000000 fffffffe
<4>[ 665.747159] $ 8 : 1000dc01 1000001e 00000001 00000200
<4>[ 665.752465] $12 : ffffffff 00009d60 00001d00 0000058f
<4>[ 665.757772] $16 : 81841d30 8025a1f4 8c820014 8a6eb109
<4>[ 665.763077] $20 : 00000000 8025d5d0 80463f60 80418dd8
<4>[ 665.768383] $24 : 00000000 802b5bcc
<4>[ 665.773690] $28 : 81840000 81841d00 0000000a 80005ff4
<4>[ 665.778999] Hi : 01c7b4a1
<4>[ 665.781915] Lo : 6c4117c0
<4>[ 665.784839] epc : 8000cb10 0x8000cb10
<4>[ 665.788728] ra : 80005ff4 0x80005ff4
<4>[ 665.792603] Status: 1000dc03 KERNEL EXL IE
<4>[ 665.796857] Cause : 0080001c (ExcCode 07)
<4>[ 665.800917] PrId : 00029006 (Broadcom BMIPS3300)
<4>[ 665.805666] Modules linked in: pppoe ppp_async b43 pppox ppp_generic nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables usb_storage uhci_hcd sd_mod scsi_mod ext4 jbd2 mbcache crc16 crc32c_generic crypto_hash leds_gpio ohci_platform ohci_hcd ehci_platform ehci_hcd gpio_button_hotplug
<4>[ 665.877876] usbcore nls_base usb_common ssb_hcd
<4>[ 665.882580] Process ksoftirqd/0 (pid: 7, threadinfo=81840000, task=81822100, tls=00000000)
<4>[ 665.890904] Stack : 00000008 802b09b4 00000720 00000720 00000000 80ecb94c 80ecb920 81826f00
<4>[ 665.899376] 00000001 00000100 80463f60 80005ff4 8042a3d8 802b0178 8092a240 80412a3c
<4>[ 665.907851] 8194d000 80269e48 00000160 00000160 00000000 1000dc01 00000001 00000009
<4>[ 665.916325] 8a6eb0f5 0000000d 80421dfc 00000000 00000002 81841e10 00000001 00000200
<4>[ 665.924800] ffffffff 00009d60 00001d00 0000058f 00000000 80ecb94c 80ecb920 81826f00
<4>[ 665.933273] ...
<4>[ 665.935761] Call Trace:
<4>[ 665.935786] [<802b09b4>] 0x802b09b4
<4>[ 665.941845] [<80005ff4>] 0x80005ff4
<4>[ 665.945383] [<802b0178>] 0x802b0178
<4>[ 665.948942] [<80269e48>] 0x80269e48
<4>[ 665.952536] [<802b5bcc>] 0x802b5bcc
<4>[ 665.956073] [<800c9028>] 0x800c9028
<4>[ 665.959635] [<8025d5d0>] 0x8025d5d0
<4>[ 665.963170] [<8025a1f4>] 0x8025a1f4
<4>[ 665.966731] [<80259534>] 0x80259534
<4>[ 665.970292] [<8025d5d0>] 0x8025d5d0
<4>[ 665.973853] [<8026b1d0>] 0x8026b1d0
<4>[ 665.977388] [<8039d2f0>] 0x8039d2f0
<4>[ 665.980945] [<8003bc90>] 0x8003bc90
<4>[ 665.984499] [<8003bc90>] 0x8003bc90
<4>[ 665.988033] [<80020f00>] 0x80020f00
<4>[ 665.991594] [<8003bdf8>] 0x8003bdf8
<4>[ 665.995129] [<80399d68>] 0x80399d68
<4>[ 665.998723] [<80038aa0>] 0x80038aa0
<4>[ 666.002259] [<80038974>] 0x80038974
<4>[ 666.005819] [<80038974>] 0x80038974
<4>[ 666.009371] [<80005688>] 0x80005688
<4>[ 666.012912]
<4>[ 666.014429] Code: 00431024 144000ca 00000000 <8a760003> 9a760000 24130000 166000b6 00000000 1000000f
<4>[ 666.024326]
<4>[ 666.026090] —[ end trace 38aad1d57e98abb5 ]—

Jaroslav Škarvada commented on 30.10.2018 00:00

I downgraded to 17.01.6 and it seems it works, i.e. it no more reboots when running speedtest.net.

Jaroslav Škarvada commented on 30.10.2018 01:04

Sorry for noise in my case it's https://dev.archive.openwrt.org/ticket/11091

Dmitry Ershov commented on 30.10.2018 10:16

Jaroslav Škarvada, thank you for the info and ticket.

I just attach another two more crashlogs for comparison.

Jaroslav Škarvada commented on 30.10.2018 12:36

Hmm, I don't see the 'Found oversized packet at slot' in the log. It's probably because your device uses BGMAC_BCMA and mine b44 driver. Maybe it's related (e.g. the core of the problem lies somewhere higher and is common for BRCM SoCs) and maybe it is completely unrelated problem (I cannot judge at the moment). The fact is that mine device starts rebooting under heavy network load on the WAN which even the speedtest.net can trigger (the 17.01.6 seems a bit more stable for me, but the problem still occurs). I temporally workaround the problem by shaping the WAN speed to 20 MBit. I am going to bisect, because it worked OK with the ancient 2.4 kernels and I have some reports that it also worked with some 2.6 kernels. I will probably open another bug report for it (because the archived ticket 11091 seems no longer valid).

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing