New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#3054 - TL-WR841ND v5 (AR7240 rev 2) eth0: tx timeout #8372
Comments
thedukesd: the last person dear to me died tonight. |
thedukesd: Even without Took considerable longer to trigger it (more than 24h compared to 4 min - 3h) and it looks like it recovered without affecting the switch normal operation. I will just keep it under observation for more and see if the switch end up stuck again or not. So far after 50 hours of uptime I triggered only 1 time and it didn't look to affect the lan switch normal operation. L.E. #1: L.E. #2: tx timeout history:
It's clear for me that removing pdata->use_flow_control = 1; is improving a lot the situation but it looks like there is still something wrong (either it's a hardware bug or a software bug). There is basicaly nothing else I can add to this problem. |
klukonin: I can confirm this issue for 19.07 branch and current master with UniFi AC Mesh device.| With such error in dmesg. WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x26c/0x274 I see And after some seconds router stops responding any network packet. |
klukonin: UPD. This issue is related to 100BASE-TX ethernet mode and heavy tcp multistream traffic ONLY. 1000baseT is working smooth without any errors. |
klukonin: Here is my log with this issue. Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.074088] ------------[ cut here ]------------ |
thedukesd:
Device: TL-WR841ND v5 (SoC AR7240 rev 2)
Firmware: OpenWrt 18.06-SNAPSHOT r8004-1f0679f54d / LuCI openwrt-18.06 branch (git-20.115.52343-1e59ed1)
I use this router as wireless client (it's only client, it's not repeater). There are 2 clients connected on the 841ND v5 lan switch (the clients are plugged before the router is powerup), nothing connected on the wan port. High sustained traffic (main router <-> 841ND v5 as wireless client <-> 1 client connected to the 841ND v5 lan switch) is triggering the following on the 841ND v5:
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.693553] ------------[ cut here ]------------
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.698221] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
Sun Apr 26 11:48:59 2020 kern.info kernel: [25447.705315] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.711854] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.764877] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.770718] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.779167] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.787617] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.796095] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.804546] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.812979] ...
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.815465] Call Trace:
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.817785] [<8006ab8c>] 0x8006ab8c
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.821283] [<8006ab8c>] 0x8006ab8c
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.824808] [<8007f9a4>] 0x8007f9a4
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.828310] [<802a9738>] 0x802a9738
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.831824] [<8007f9dc>] 0x8007f9dc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.835360] [<802a9738>] 0x802a9738
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.838862] [<80092d00>] 0x80092d00
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.842368] [<802a95c4>] 0x802a95c4
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.845880] [<800b36fc>] 0x800b36fc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.849387] [<8006d690>] 0x8006d690
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.852894] [<800b3944>] 0x800b3944
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.856407] [<800aa878>] 0x800aa878
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.859918] [<800823cc>] 0x800823cc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.863417] [<800ae4bc>] 0x800ae4bc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.866934] [<800a9fbc>] 0x800a9fbc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.870439] [<801e3eb8>] 0x801e3eb8
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.873947] [<800660b8>] 0x800660b8
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.877438]
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.878936] ---[ end trace 3c20d976bd0e1077 ]---
Sun Apr 26 11:48:59 2020 kern.info kernel: [25447.883580] eth0: tx timeout
after this the router lan switch is unusable. After the crash there is no useful traffic going on the lan switch. In some cases after some minutes (10+ minutes) it recovers but usualy the switch just remain stuck, unpluging and repluging the eth cable in the lan switch or reseting/powering off has no effect (the switch remain unusable; client is unable to get an ip from dhcp and manual setting one in the right class changes nothing, you can't ping the 841ND v5 or the main router, clearly the 841ND v5 switch is stuck) The 841ND V5 can be accessed from wireless, internet addresses can be pinged from the router (the ideea is that only the eth switch is affected). Only way to fix it is to issue
ifup lan
from wireless. But it's just matter of time until the switch will crash again. After second crash only
eth0: tx timeout
is visible in dmesg/sys log.
For example a speedtest on https://ipv6-test.com/speedtest/ can trigger it. Downloading torrents at high speed can trigger it. It's not an 100% success rate to trigger it but it will happen sooner or later. In 12 hours I can trigger it at least 3 times by just doing my normal stuff on the internet.
Further investigation pointed to this bug report: https://bugs.openwrt.org/index.php?do=details&task_id=106
After that we had the following commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=83997146e76d4097e30facf6ad89e5fa3bd7c65b
that disabled flow control for AR934x (not for ar7240).
Flow control was enabled for multiple SoCs in the following commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=26b8db253745b0591bfffa21f02323428f11a88f
Taking into consideration the age of the commit that enabled flow control all OpenWRT 18.6.x are affected.
I decided to remove:
pdata->use_flow_control = 1;
from the following section of the file /openwrt/target/linux/ar71xx/files/arch/mips/ath79/dev-eth.c
case ATH79_SOC_AR7240:
if (id == 0) {
pdata->reset_bit |= AR71XX_RESET_GE0_PHY;
pdata->set_speed = ath79_set_speed_dummy;
and compile OpenWRT 18.06 for my TL-WR841ND v5 and flash it. Atm with the firmware with pdata->use_flow_control = 1; removed from the above section I have 23 hours uptime and the eth0: tx timeout issue looks to be gone, I'm no longer capable to trigger it no matter what I try.
By looking at the source code in OpenWRT 19.07 ar71xx flow control is enable there too so OpenWRT 19.07 ar71xx is probably affected also. I don't know where to look for ath79.
The text was updated successfully, but these errors were encountered: