Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#2026 - kernel ipq806x Oops #8517

Closed
openwrt-bot opened this issue Dec 26, 2018 · 5 comments
Closed

FS#2026 - kernel ipq806x Oops #8517

openwrt-bot opened this issue Dec 26, 2018 · 5 comments
Labels

Comments

@openwrt-bot
Copy link

por:

Device: netgear,r7800
Version: 18.06.1

Steps to reproduce:
The crash will happen with 100% certainty, but what exactly causes the problem is a little hard to say. It looks like large internet frames on WAN interface somehow trigger the error.

The WAN interface (eth.2) is untagged, on the LAN ports besides 1 untagged VLAN, 3 untagged VLANs exist. Only (possibly) relevant extra installed package is ip-tiny; it isn't used though for any custom configuration though.

From the syslog (from boot till Oops, seemingly non-related stuff redacted out):

Wed Dec 26 15:18:27 2018 kern.info kernel: [ 0.000000] Booting Linux on physical CPU 0x0 Wed Dec 26 15:18:27 2018 kern.notice kernel: [ 0.000000] Linux version 4.14.63 (buildbot@builds-03.infra.lede-project.org) (gcc version 7.3.0 (OpenWrt GCC 7.3.0 r7102-3f3a2c9)) #0 SMP Thu Aug 16 07:51:15 2018 Wed Dec 26 15:18:27 2018 kern.info kernel: [ 0.000000] CPU: ARMv7 Processor [512f04d0] revision 0 (ARMv7), cr=10c5787d Wed Dec 26 15:18:27 2018 kern.info kernel: [ 0.000000] CPU: div instructions available: patching division code Wed Dec 26 15:18:27 2018 kern.info kernel: [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache Wed Dec 26 15:18:27 2018 kern.info kernel: [ 0.000000] OF: fdt: Machine model: Netgear Nighthawk X4S R7800 ... Wed Dec 26 15:18:27 2018 kern.info kernel: [ 1.304591] libphy: GPIO Bitbanged MDIO: probed Wed Dec 26 15:18:27 2018 kern.info kernel: [ 1.325983] switch0: Atheros AR8337 rev. 2 switch registered on gpio-0 Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.210243] libphy: Fixed MDIO Bus: probed Wed Dec 26 15:18:27 2018 kern.warn kernel: [ 2.212378] ipq806x-gmac-dwmac 37200000.ethernet: PTP uses main clock Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.213549] stmmac - user ID: 0x10, Synopsys ID: 0x37 Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.219747] ipq806x-gmac-dwmac 37200000.ethernet: Ring mode enabled Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.224876] ipq806x-gmac-dwmac 37200000.ethernet: DMA HW capability register supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.230859] ipq806x-gmac-dwmac 37200000.ethernet: Enhanced/Alternate descriptors Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.238924] ipq806x-gmac-dwmac 37200000.ethernet: Enabled extended descriptors Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.246483] ipq806x-gmac-dwmac 37200000.ethernet: RX Checksum Offload Engine supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.253511] ipq806x-gmac-dwmac 37200000.ethernet: COE Type 2 Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.261323] ipq806x-gmac-dwmac 37200000.ethernet: TX Checksum insertion supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.267227] ipq806x-gmac-dwmac 37200000.ethernet: Wake-Up On Lan supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.274601] ipq806x-gmac-dwmac 37200000.ethernet: Enable RX Mitigation via HW Watchdog Timer Wed Dec 26 15:18:27 2018 kern.warn kernel: [ 2.282886] ipq806x-gmac-dwmac 37400000.ethernet: PTP uses main clock Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.290081] stmmac - user ID: 0x10, Synopsys ID: 0x37 Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.296320] ipq806x-gmac-dwmac 37400000.ethernet: Ring mode enabled Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.301257] ipq806x-gmac-dwmac 37400000.ethernet: DMA HW capability register supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.307399] ipq806x-gmac-dwmac 37400000.ethernet: Enhanced/Alternate descriptors Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.315397] ipq806x-gmac-dwmac 37400000.ethernet: Enabled extended descriptors Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.322956] ipq806x-gmac-dwmac 37400000.ethernet: RX Checksum Offload Engine supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.329903] ipq806x-gmac-dwmac 37400000.ethernet: COE Type 2 Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.337865] ipq806x-gmac-dwmac 37400000.ethernet: TX Checksum insertion supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.343699] ipq806x-gmac-dwmac 37400000.ethernet: Wake-Up On Lan supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.350995] ipq806x-gmac-dwmac 37400000.ethernet: Enable RX Mitigation via HW Watchdog Timer ... Wed Dec 26 15:18:27 2018 kern.info kernel: [ 2.428253] 8021q: 802.1Q VLAN Support v1.8 ... Wed Dec 26 15:18:27 2018 user.info kernel: [ 3.839793] init: - watchdog - ... Wed Dec 26 15:18:27 2018 user.info kernel: [ 4.407941] kmodloader: loading kernel modules from /etc/modules-boot.d/* ... Wed Dec 26 15:18:27 2018 user.info kernel: [ 5.088295] kmodloader: done loading kernel modules from /etc/modules-boot.d/* Wed Dec 26 15:18:27 2018 user.info kernel: [ 5.098476] init: - preinit - ... Wed Dec 26 15:18:27 2018 kern.info kernel: [ 6.799339] Generic PHY fixed-0:01: attached PHY driver [Generic PHY] (mii_bus:phy_addr=fixed-0:01, irq=POLL) Wed Dec 26 15:18:27 2018 kern.info kernel: [ 6.800863] dwmac1000: Master AXI performs any burst length Wed Dec 26 15:18:27 2018 kern.info kernel: [ 6.808373] ipq806x-gmac-dwmac 37400000.ethernet eth1: IEEE 1588-2008 Advanced Timestamp supported Wed Dec 26 15:18:27 2018 kern.info kernel: [ 6.813856] ipq806x-gmac-dwmac 37400000.ethernet eth1: registered PTP clock Wed Dec 26 15:18:27 2018 kern.info kernel: [ 7.834378] ipq806x-gmac-dwmac 37400000.ethernet eth1: Link is Up - 1Gbps/Full - flow control off ... Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.015710] Generic PHY fixed-0:01: attached PHY driver [Generic PHY] (mii_bus:phy_addr=fixed-0:01, irq=POLL) Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.016654] dwmac1000: Master AXI performs any burst length Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.016672] ipq806x-gmac-dwmac 37400000.ethernet eth1: IEEE 1588-2008 Advanced Timestamp supported Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.016873] ipq806x-gmac-dwmac 37400000.ethernet eth1: registered PTP clock ... Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.020482] device eth1.1 entered promiscuous mode Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.020487] device eth1 entered promiscuous mode Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.066212] device eth1.127 entered promiscuous mode Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.120009] device eth1.34 entered promiscuous mode Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.162731] device eth1.10 entered promiscuous mode ... Wed Dec 26 15:18:31 2018 daemon.notice netifd: Interface 'wan' is enabled Wed Dec 26 15:18:31 2018 daemon.notice netifd: Interface 'wan' is setting up now Wed Dec 26 15:18:31 2018 daemon.notice netifd: Interface 'wan' is now up ... Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.174821] Generic PHY fixed-0:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=fixed-0:00, irq=POLL) Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.175880] dwmac1000: Master AXI performs any burst length Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.175902] ipq806x-gmac-dwmac 37200000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported Wed Dec 26 15:18:31 2018 kern.info kernel: [ 35.176061] ipq806x-gmac-dwmac 37200000.ethernet eth0: registered PTP clock ... Wed Dec 26 15:18:32 2018 kern.info kernel: [ 36.072469] ipq806x-gmac-dwmac 37400000.ethernet eth1: Link is Up - 1Gbps/Full - flow control off Wed Dec 26 15:18:32 2018 daemon.notice netifd: Network device 'eth1' link is up Wed Dec 26 15:18:32 2018 daemon.notice netifd: VLAN 'eth1.1' link is up Wed Dec 26 15:18:32 2018 daemon.notice netifd: VLAN 'eth1.10' link is up Wed Dec 26 15:18:32 2018 daemon.notice netifd: VLAN 'eth1.34' link is up Wed Dec 26 15:18:32 2018 daemon.notice netifd: VLAN 'eth1.127' link is up ... Wed Dec 26 15:18:32 2018 daemon.notice netifd: Network device 'eth0' link is up Wed Dec 26 15:18:32 2018 daemon.notice netifd: Interface 'wan' has link connectivity Wed Dec 26 15:18:32 2018 kern.info kernel: [ 36.232391] ipq806x-gmac-dwmac 37200000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off ... Wed Dec 26 15:18:47 2018 daemon.info procd: - init complete - ... Wed Dec 26 18:09:06 2018 kern.err kernel: [ 1488.045291] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:06 2018 kern.err kernel: [ 1488.546134] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:07 2018 kern.err kernel: [ 1489.546246] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:09 2018 kern.err kernel: [ 1491.546705] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:13 2018 kern.err kernel: [ 1495.548558] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:17 2018 kern.err kernel: [ 1499.548584] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:21 2018 kern.err kernel: [ 1503.548245] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:25 2018 kern.err kernel: [ 1507.549643] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:29 2018 kern.err kernel: [ 1511.551169] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:33 2018 kern.err kernel: [ 1515.552148] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:09:37 2018 kern.err kernel: [ 1519.553088] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1990 larger than size (1536) Wed Dec 26 18:24:05 2018 kern.alert kernel: [ 2387.041393] Unable to handle kernel paging request at virtual address 64616f86 Wed Dec 26 18:24:05 2018 kern.alert kernel: [ 2387.041483] pgd = c0204000 Wed Dec 26 18:24:05 2018 kern.alert kernel: [ 2387.047548] [64616f86] *pgd=00000000
@openwrt-bot
Copy link
Author

por:

Curious detail: the ipq806x-gmac-dwmac refer to faults with eth0, but these fault do occur only when the switch is connected to eth1 ...

please IGNORE

@openwrt-bot
Copy link
Author

por:

This OOPS also happens on a tplink-c2600, so it concerns more devices of the ipq806x target.

Also on the c2600 the OOPS is preceded by a number of messages like
kern.err kernel: [ 92.235675] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1994 larger than size (1536)

The error happens within a minute after WAN is connected to the modem.

A test where WAN was mapped to a port on eth1 also resulted in the OOPS.

This OOPS is triggered when WAN is connected to a UBEE cable modem - presumably that device outputs frames with an incorrect len. Within a few days the ISP will change the modem for another DOCSIS3 modem. (A test behind a Compal modem did not trigger the OOPS).

@openwrt-bot
Copy link
Author

por:

According to the OpenWrt forum the error also hits the NBG6817.

@openwrt-bot
Copy link
Author

por:

In drivers/net/ethernet/stmicro/stmmac/stmmac_main.c, when length of received frame is larger then the size of the DMA buffer, the message mentioned above is logged and the code breaks from the loop processing arrived frames. [1].

This break was introduced in a commit named "stmmac: fix oversized frame reception" (commit 527c4a769d375ac0472450c52bde29087f49cd9) [2].

Speculation 1:
Some of the other conditions are followed by an else branch within the loop in stead of a break out of the loop, so maybe breaking out is not a correct step.
Since the OOPS is always preceded by several occurrences of oversized frames, possibly some needed resource handling (DMA ?) is skipped.

Aside, in the code directly following the test on the frame length follows a test (LLC frame type) that subtracts a few bytes from the length. Could it be that this test should be above the test on the frame length ?

Speculation 2:
Possibly the frame length is not correctly determined.

[1] https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L3398

[2] torvalds/linux@e527c4a

@openwrt-bot
Copy link
Author

hnyman:

Just adding here the best analysis & possible fixes that I have seen so far:

discussion by "ewald" in forum:

https://forum.openwrt.org/t/nbg6817-openwrt-rebooting-constantly/23036/51

(and a few earlier messages in the same thread)

I have been trying to fix the stmmac_main.c by adding code to properly deal with this unexpected larger size packages, but it requires a lot of code changes due to the way the rx ring buffer sizes are (pre)allocated based on MTU size.
The kernel panic that I hit was due to a missing call to set the proper dma size (basically dma would overrun the buffer based on the real size of the packet) size after which an skb buffer free would free illegal memory. There is now a patch for this problem: here 2. I managed to fix this and 2 other issues, but kept hitting new bugs like starvation (basically ethernet port hangs).
I believe the correct fix for this is now posted here 2. Most hangs went away, but some remained...
When you configure a larger MTU, the driver allocates a 2K, 4K or 16K buffer and in this way manages to bypass a number of defects due to the extra headroom.
That is why with MTU=4088 my router been performing flawlessly for the last 18 days, despite all the stress tests thrown it it.

I think it's worth to back-port the master branch patches (a long list...) to 4.14.93 (or whatever is the current release). There are 20 or so major code changes/patches submitted 2 since the 4.14 version. If I have some time I will try to generate a patch set and build a kernel.

Possible fixes:

torvalds/linux@fa0be0a#diff-11cf855fef243c84239e6e5a90c50fd1

torvalds/linux@4205c88#diff-11cf855fef243c84239e6e5a90c50fd1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant