Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#3054 - TL-WR841ND v5 (AR7240 rev 2) eth0: tx timeout #8372

Open
openwrt-bot opened this issue Apr 27, 2020 · 5 comments
Open

FS#3054 - TL-WR841ND v5 (AR7240 rev 2) eth0: tx timeout #8372

openwrt-bot opened this issue Apr 27, 2020 · 5 comments
Labels
flyspray kernel pull request/issue with Linux kernel related changes

Comments

@openwrt-bot
Copy link

thedukesd:

Device: TL-WR841ND v5 (SoC AR7240 rev 2)
Firmware: OpenWrt 18.06-SNAPSHOT r8004-1f0679f54d / LuCI openwrt-18.06 branch (git-20.115.52343-1e59ed1)

I use this router as wireless client (it's only client, it's not repeater). There are 2 clients connected on the 841ND v5 lan switch (the clients are plugged before the router is powerup), nothing connected on the wan port. High sustained traffic (main router <-> 841ND v5 as wireless client <-> 1 client connected to the 841ND v5 lan switch) is triggering the following on the 841ND v5:

Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.693553] ------------[ cut here ]------------
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.698221] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
Sun Apr 26 11:48:59 2020 kern.info kernel: [25447.705315] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.711854] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.764877] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.770718] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.779167] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.787617] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.796095] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.804546] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.812979] ...
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.815465] Call Trace:
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.817785] [<8006ab8c>] 0x8006ab8c
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.821283] [<8006ab8c>] 0x8006ab8c
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.824808] [<8007f9a4>] 0x8007f9a4
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.828310] [<802a9738>] 0x802a9738
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.831824] [<8007f9dc>] 0x8007f9dc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.835360] [<802a9738>] 0x802a9738
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.838862] [<80092d00>] 0x80092d00
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.842368] [<802a95c4>] 0x802a95c4
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.845880] [<800b36fc>] 0x800b36fc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.849387] [<8006d690>] 0x8006d690
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.852894] [<800b3944>] 0x800b3944
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.856407] [<800aa878>] 0x800aa878
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.859918] [<800823cc>] 0x800823cc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.863417] [<800ae4bc>] 0x800ae4bc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.866934] [<800a9fbc>] 0x800a9fbc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.870439] [<801e3eb8>] 0x801e3eb8
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.873947] [<800660b8>] 0x800660b8
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.877438]
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.878936] ---[ end trace 3c20d976bd0e1077 ]---
Sun Apr 26 11:48:59 2020 kern.info kernel: [25447.883580] eth0: tx timeout

after this the router lan switch is unusable. After the crash there is no useful traffic going on the lan switch. In some cases after some minutes (10+ minutes) it recovers but usualy the switch just remain stuck, unpluging and repluging the eth cable in the lan switch or reseting/powering off has no effect (the switch remain unusable; client is unable to get an ip from dhcp and manual setting one in the right class changes nothing, you can't ping the 841ND v5 or the main router, clearly the 841ND v5 switch is stuck) The 841ND V5 can be accessed from wireless, internet addresses can be pinged from the router (the ideea is that only the eth switch is affected). Only way to fix it is to issue
ifup lan
from wireless. But it's just matter of time until the switch will crash again. After second crash only
eth0: tx timeout
is visible in dmesg/sys log.
For example a speedtest on https://ipv6-test.com/speedtest/ can trigger it. Downloading torrents at high speed can trigger it. It's not an 100% success rate to trigger it but it will happen sooner or later. In 12 hours I can trigger it at least 3 times by just doing my normal stuff on the internet.

Further investigation pointed to this bug report: https://bugs.openwrt.org/index.php?do=details&task_id=106
After that we had the following commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=83997146e76d4097e30facf6ad89e5fa3bd7c65b
that disabled flow control for AR934x (not for ar7240).
Flow control was enabled for multiple SoCs in the following commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=26b8db253745b0591bfffa21f02323428f11a88f
Taking into consideration the age of the commit that enabled flow control all OpenWRT 18.6.x are affected.
I decided to remove:
pdata->use_flow_control = 1;
from the following section of the file /openwrt/target/linux/ar71xx/files/arch/mips/ath79/dev-eth.c

case ATH79_SOC_AR7240:
if (id == 0) {
pdata->reset_bit |= AR71XX_RESET_GE0_PHY;
pdata->set_speed = ath79_set_speed_dummy;

		pdata->phy_mask = BIT(4);
	} else {
		pdata->reset_bit |= AR71XX_RESET_GE1_PHY;
		pdata->set_speed = ath79_set_speed_dummy;

		pdata->speed = SPEED_1000;
		pdata->duplex = DUPLEX_FULL;
		pdata->switch_data = &ath79_switch_data;
		pdata->use_flow_control = 1;

		ath79_switch_data.phy_poll_mask |= BIT(4);
	}
	pdata->has_gbit = 1;
	pdata->is_ar724x = 1;
	if (ath79_soc == ATH79_SOC_AR7240)
		pdata->is_ar7240 = 1;
	break;</code>

and compile OpenWRT 18.06 for my TL-WR841ND v5 and flash it. Atm with the firmware with pdata->use_flow_control = 1; removed from the above section I have 23 hours uptime and the eth0: tx timeout issue looks to be gone, I'm no longer capable to trigger it no matter what I try.

By looking at the source code in OpenWRT 19.07 ar71xx flow control is enable there too so OpenWRT 19.07 ar71xx is probably affected also. I don't know where to look for ath79.

@openwrt-bot
Copy link
Author

thedukesd:

the last person dear to me died tonight.
taking into account the situation I will not be able to check for some time (I'm sorry but I can't specify how long, this event is marked with the highest risk for me) what is happening in this bug report.
I hope you understand the situation.

@openwrt-bot
Copy link
Author

thedukesd:

Even without
pdata->use_flow_control = 1;
I still got the tx timeout today but it didn't look to affect the eth switch normal operation, in fact I didn't notice it until now when I decided to check the 841ND v5 logs:
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.480324] ------------[ cut here ]------------
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.485116] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
Tue Apr 28 14:43:02 2020 kern.info kernel: [152891.492355] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.498981] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.552102] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.558037] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.566578] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.575116] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.583646] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.592192] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.600730] ...
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.603284] Call Trace:
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.605720] [<8006ab8c>] 0x8006ab8c
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.609312] [<8006ab8c>] 0x8006ab8c
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.612925] [<8007f9a4>] 0x8007f9a4
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.616512] [<802a9738>] 0x802a9738
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.620110] [<8007f9dc>] 0x8007f9dc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.623729] [<802a9738>] 0x802a9738
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.627317] [<80092d88>] 0x80092d88
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.630930] [<802a95c4>] 0x802a95c4
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.634517] [<800b36fc>] 0x800b36fc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.638108] [<8006d690>] 0x8006d690
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.641717] [<800b3944>] 0x800b3944
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.645304] [<800aa878>] 0x800aa878
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.648904] [<800823cc>] 0x800823cc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.652507] [<800ae4bc>] 0x800ae4bc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.656092] [<800a9fbc>] 0x800a9fbc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.659683] [<801e3eb8>] 0x801e3eb8
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.663288] [<800660b8>] 0x800660b8
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.666865]
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.668451] ---[ end trace 059bf2a9bcb83f00 ]---
Tue Apr 28 14:43:02 2020 kern.info kernel: [152891.673185] eth0: tx timeout

Took considerable longer to trigger it (more than 24h compared to 4 min - 3h) and it looks like it recovered without affecting the switch normal operation.

I will just keep it under observation for more and see if the switch end up stuck again or not.

So far after 50 hours of uptime I triggered only 1 time and it didn't look to affect the lan switch normal operation.

L.E. #1:
2 more eth0: tx timeout but none of them affected the normal switch operation (only when I checked the router log I noticed it happened):
Thu Apr 30 14:52:59 2020 kern.info kernel: [326285.474663] eth0: tx timeout
Thu Apr 30 15:27:19 2020 kern.info kernel: [328345.438975] eth0: tx timeout

uptime now 91 hours.

L.E. #2:
uptime: 14d 17h 56m 42s
5d 20h 7m 47s ago switch hanged without an tx timeout error. I had to ifup lan from wireless to fix it.

tx timeout history:

[130930.802621] ------------[ cut here ]------------
[130930.807411] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
[130930.814604] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
[130930.821225] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
[130930.874415] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
[130930.880350] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
[130930.888892] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.897429] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.905960] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.914497] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.923027] ...
[130930.925580] Call Trace:
[130930.928020] [<8006ab8c>] 0x8006ab8c
[130930.931608] [<8006ab8c>] 0x8006ab8c
[130930.935228] [<8007f9a4>] 0x8007f9a4
[130930.938816] [<802a9738>] 0x802a9738
[130930.942427] [<8007f9dc>] 0x8007f9dc
[130930.946076] [<802a9738>] 0x802a9738
[130930.949664] [<80092d88>] 0x80092d88
[130930.953285] [<802a95c4>] 0x802a95c4
[130930.956874] [<800b36fc>] 0x800b36fc
[130930.960470] [<8006d690>] 0x8006d690
[130930.964089] [<800b3944>] 0x800b3944
[130930.967679] [<800aa878>] 0x800aa878
[130930.971286] [<800823cc>] 0x800823cc
[130930.974905] [<800ae4bc>] 0x800ae4bc
[130930.978492] [<800a9fbc>] 0x800a9fbc
[130930.982084] [<801e3eb8>] 0x801e3eb8
[130930.985688] [<800660b8>] 0x800660b8
[130930.989273]
[130930.990860] ---[ end trace a181cb91ce161968 ]---
[130930.995595] eth0: tx timeout
[301667.859350] eth0: tx timeout
[397771.241942] eth0: tx timeout
....
[794204.433351] eth0: tx timeout

It's clear for me that removing pdata->use_flow_control = 1; is improving a lot the situation but it looks like there is still something wrong (either it's a hardware bug or a software bug).
The remaining reasons that are causing the tx timeout are outside of what I can debug on my own.
While current situation is way better than before (having to connect to router every couple of hours daily to do an ifup was totaly not acceptable, now I have to do that 1 time every couple of days) it's still not ideal.

There is basicaly nothing else I can add to this problem.

@openwrt-bot
Copy link
Author

klukonin:

I can confirm this issue for 19.07 branch and current master with UniFi AC Mesh device.|

With such error in dmesg.

WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x26c/0x274

I see
eth0: tx timeout

And after some seconds router stops responding any network packet.

@openwrt-bot
Copy link
Author

klukonin:

UPD. This issue is related to 100BASE-TX ethernet mode and heavy tcp multistream traffic ONLY.

1000baseT is working smooth without any errors.

@openwrt-bot
Copy link
Author

klukonin:

Here is my log with this issue.

Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.074088] ------------[ cut here ]------------
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.078899] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x164/0x274
Mon Dec 7 03:42:31 2020 kern.info kernel: [ 1151.087467] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.094211] Modules linked in: ath9k ath9k_common ath9k_hw ath10k_pci ath10k_core ath mac80211 cfg80211 compat ohci_platform ohci_hcd ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.113015] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.209 #0
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.119132] Stack : 80520000 804dbbbc 00000000 00000000 804b1edc 87c07e04 80505c8c 805058a7
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.127780] 804adfb0 00000000 80653670 00000140 80510eb4 00000001 87c07db8 214d684b
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.136425] 00000000 00000000 80650000 00004ed8 00000000 00000000 00000007 00000000
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.145071] 000000ef f07a2f11 000000ee 00000000 00000000 00000009 00000000 804d5f34
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.153709] 80325350 00000140 80510eb4 80510e74 00000002 80272984 00000000 80650000
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.162346] ...
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.164880] Call Trace:
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.167426] [<8006aa9c>] show_stack+0x58/0x100
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.172032] [<800852a0>] __warn+0xe4/0x13c
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.176273] [<80085328>] warn_slowpath_fmt+0x30/0x3c
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.181410] [<80325350>] dev_watchdog+0x164/0x274
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.186302] [<800bed0c>] call_timer_fn.isra.26+0x24/0x84
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.191788] [<800beee0>] run_timer_softirq+0x174/0x1ec
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.197125] [<80417448>] __do_softirq+0xe8/0x2bc
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.201909] [<80221890>] plat_irq_dispatch+0xc0/0x120
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.207139] [<800658d8>] handle_int+0x138/0x144
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.211824] [<800673cc>] r4k_wait_irqoff+0x18/0x24
Mon Dec 7 03:42:31 2020 kern.warn kernel: [ 1151.216781] ---[ end trace ce3288942b2c25d6 ]---
Mon Dec 7 03:42:31 2020 kern.info kernel: [ 1151.221548] eth0: tx timeout
Mon Dec 7 03:42:41 2020 kern.info kernel: [ 1161.074089] eth0: tx timeout
Mon Dec 7 03:42:51 2020 kern.info kernel: [ 1171.074089] eth0: tx timeout
Mon Dec 7 03:44:01 2020 kern.info kernel: [ 1241.074087] eth0: tx timeout
Mon Dec 7 03:44:16 2020 kern.info kernel: [ 1256.114092] eth0: tx timeout
client_loop: send disconnect: Broken pipe

@aparcar aparcar added the kernel pull request/issue with Linux kernel related changes label Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flyspray kernel pull request/issue with Linux kernel related changes
Projects
None yet
Development

No branches or pull requests

2 participants