OpenWrt/LEDE Project

  • Status Unconfirmed
  • Percent Complete
    0%
  • Task Type Bug Report
  • Category Kernel
  • Assigned To No-one
  • Operating System All
  • Severity Medium
  • Priority Very Low
  • Reported Version openwrt-18.06
  • Due in Version Undecided
  • Due Date Undecided
  • Private
Attached to Project: OpenWrt/LEDE Project
Opened by Serbanescu Daniel - 27.04.2020

FS#3054 - TL-WR841ND v5 (AR7240 rev 2) eth0: tx timeout

Device: TL-WR841ND v5 (SoC AR7240 rev 2)
Firmware: OpenWrt 18.06-SNAPSHOT r8004-1f0679f54d / LuCI openwrt-18.06 branch (git-20.115.52343-1e59ed1)

I use this router as wireless client (it’s only client, it’s not repeater). There are 2 clients connected on the 841ND v5 lan switch (the clients are plugged before the router is powerup), nothing connected on the wan port. High sustained traffic (main router ↔ 841ND v5 as wireless client ↔ 1 client connected to the 841ND v5 lan switch) is triggering the following on the 841ND v5:

Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.693553] ------------[ cut here ]------------
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.698221] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
Sun Apr 26 11:48:59 2020 kern.info kernel: [25447.705315] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.711854] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.764877] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.770718] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.779167]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.787617]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.796095]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.804546]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.812979]         ...
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.815465] Call Trace:
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.817785] [<8006ab8c>] 0x8006ab8c
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.821283] [<8006ab8c>] 0x8006ab8c
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.824808] [<8007f9a4>] 0x8007f9a4
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.828310] [<802a9738>] 0x802a9738
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.831824] [<8007f9dc>] 0x8007f9dc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.835360] [<802a9738>] 0x802a9738
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.838862] [<80092d00>] 0x80092d00
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.842368] [<802a95c4>] 0x802a95c4
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.845880] [<800b36fc>] 0x800b36fc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.849387] [<8006d690>] 0x8006d690
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.852894] [<800b3944>] 0x800b3944
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.856407] [<800aa878>] 0x800aa878
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.859918] [<800823cc>] 0x800823cc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.863417] [<800ae4bc>] 0x800ae4bc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.866934] [<800a9fbc>] 0x800a9fbc
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.870439] [<801e3eb8>] 0x801e3eb8
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.873947] [<800660b8>] 0x800660b8
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.877438]
Sun Apr 26 11:48:59 2020 kern.warn kernel: [25447.878936] ---[ end trace 3c20d976bd0e1077 ]---
Sun Apr 26 11:48:59 2020 kern.info kernel: [25447.883580] eth0: tx timeout

after this the router lan switch is unusable. After the crash there is no useful traffic going on the lan switch. In some cases after some minutes (10+ minutes) it recovers but usualy the switch just remain stuck, unpluging and repluging the eth cable in the lan switch or reseting/powering off has no effect (the switch remain unusable; client is unable to get an ip from dhcp and manual setting one in the right class changes nothing, you can’t ping the 841ND v5 or the main router, clearly the 841ND v5 switch is stuck) The 841ND V5 can be accessed from wireless, internet addresses can be pinged from the router (the ideea is that only the eth switch is affected). Only way to fix it is to issue

ifup lan

from wireless. But it’s just matter of time until the switch will crash again. After second crash only

eth0: tx timeout

is visible in dmesg/sys log.
For example a speedtest on https://ipv6-test.com/speedtest/ can trigger it. Downloading torrents at high speed can trigger it. It’s not an 100% success rate to trigger it but it will happen sooner or later. In 12 hours I can trigger it at least 3 times by just doing my normal stuff on the internet.

Further investigation pointed to this bug report: https://bugs.openwrt.org/index.php?do=details&task_id=106 After that we had the following commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=83997146e76d4097e30facf6ad89e5fa3bd7c65b that disabled flow control for AR934x (not for ar7240).
Flow control was enabled for multiple SoCs in the following commit: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=26b8db253745b0591bfffa21f02323428f11a88f Taking into consideration the age of the commit that enabled flow control all OpenWRT 18.6.x are affected.
I decided to remove:

pdata->use_flow_control = 1;

from the following section of the file /openwrt/target/linux/ar71xx/files/arch/mips/ath79/dev-eth.c

	case ATH79_SOC_AR7240:
		if (id == 0) {
			pdata->reset_bit |= AR71XX_RESET_GE0_PHY;
			pdata->set_speed = ath79_set_speed_dummy;

			pdata->phy_mask = BIT(4);
		} else {
			pdata->reset_bit |= AR71XX_RESET_GE1_PHY;
			pdata->set_speed = ath79_set_speed_dummy;

			pdata->speed = SPEED_1000;
			pdata->duplex = DUPLEX_FULL;
			pdata->switch_data = &ath79_switch_data;
			pdata->use_flow_control = 1;

			ath79_switch_data.phy_poll_mask |= BIT(4);
		}
		pdata->has_gbit = 1;
		pdata->is_ar724x = 1;
		if (ath79_soc == ATH79_SOC_AR7240)
			pdata->is_ar7240 = 1;
		break;

and compile OpenWRT 18.06 for my TL-WR841ND v5 and flash it. Atm with the firmware with pdata→use_flow_control = 1; removed from the above section I have 23 hours uptime and the eth0: tx timeout issue looks to be gone, I’m no longer capable to trigger it no matter what I try.

By looking at the source code in OpenWRT 19.07 ar71xx flow control is enable there too so OpenWRT 19.07 ar71xx is probably affected also. I don’t know where to look for ath79.

Serbanescu Daniel commented on 27.04.2020 20:13

the last person dear to me died tonight.
taking into account the situation I will not be able to check for some time (I'm sorry but I can't specify how long, this event is marked with the highest risk for me) what is happening in this bug report.
I hope you understand the situation.

Serbanescu Daniel commented on 28.04.2020 19:36

Even without

pdata->use_flow_control = 1;


I still got the tx timeout today but it didn't look to affect the eth switch normal operation, in fact I didn't notice it until now when I decided to check the 841ND v5 logs:

Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.480324] ------------[ cut here ]------------
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.485116] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
Tue Apr 28 14:43:02 2020 kern.info kernel: [152891.492355] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.498981] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.552102] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.558037] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.566578]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.575116]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.583646]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.592192]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.600730]         ...
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.603284] Call Trace:
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.605720] [<8006ab8c>] 0x8006ab8c
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.609312] [<8006ab8c>] 0x8006ab8c
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.612925] [<8007f9a4>] 0x8007f9a4
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.616512] [<802a9738>] 0x802a9738
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.620110] [<8007f9dc>] 0x8007f9dc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.623729] [<802a9738>] 0x802a9738
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.627317] [<80092d88>] 0x80092d88
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.630930] [<802a95c4>] 0x802a95c4
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.634517] [<800b36fc>] 0x800b36fc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.638108] [<8006d690>] 0x8006d690
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.641717] [<800b3944>] 0x800b3944
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.645304] [<800aa878>] 0x800aa878
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.648904] [<800823cc>] 0x800823cc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.652507] [<800ae4bc>] 0x800ae4bc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.656092] [<800a9fbc>] 0x800a9fbc
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.659683] [<801e3eb8>] 0x801e3eb8
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.663288] [<800660b8>] 0x800660b8
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.666865]
Tue Apr 28 14:43:02 2020 kern.warn kernel: [152891.668451] ---[ end trace 059bf2a9bcb83f00 ]---
Tue Apr 28 14:43:02 2020 kern.info kernel: [152891.673185] eth0: tx timeout

Took considerable longer to trigger it (more than 24h compared to 4 min - 3h) and it looks like it recovered without affecting the switch normal operation.

I will just keep it under observation for more and see if the switch end up stuck again or not.

So far after 50 hours of uptime I triggered only 1 time and it didn't look to affect the lan switch normal operation.

L.E. #1:
2 more eth0: tx timeout but none of them affected the normal switch operation (only when I checked the router log I noticed it happened):

Thu Apr 30 14:52:59 2020 kern.info kernel: [326285.474663] eth0: tx timeout
Thu Apr 30 15:27:19 2020 kern.info kernel: [328345.438975] eth0: tx timeout

uptime now 91 hours.

L.E. #2:
uptime: 14d 17h 56m 42s
5d 20h 7m 47s ago switch hanged without an tx timeout error. I had to ifup lan from wireless to fix it.

tx timeout history:

[130930.802621] ------------[ cut here ]------------
[130930.807411] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 0x802a9738
[130930.814604] NETDEV WATCHDOG: eth0 (ag71xx): transmit queue 0 timed out
[130930.821225] Modules linked in: ath9k ath9k_common ath9k_hw ath nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables gpio_button_hotplug
[130930.874415] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.219 #0
[130930.880350] Stack : 80447522 00000032 00000000 00000001 00000000 00000000 00000000 00000000
[130930.888892]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.897429]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.905960]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.914497]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[130930.923027]         ...
[130930.925580] Call Trace:
[130930.928020] [<8006ab8c>] 0x8006ab8c
[130930.931608] [<8006ab8c>] 0x8006ab8c
[130930.935228] [<8007f9a4>] 0x8007f9a4
[130930.938816] [<802a9738>] 0x802a9738
[130930.942427] [<8007f9dc>] 0x8007f9dc
[130930.946076] [<802a9738>] 0x802a9738
[130930.949664] [<80092d88>] 0x80092d88
[130930.953285] [<802a95c4>] 0x802a95c4
[130930.956874] [<800b36fc>] 0x800b36fc
[130930.960470] [<8006d690>] 0x8006d690
[130930.964089] [<800b3944>] 0x800b3944
[130930.967679] [<800aa878>] 0x800aa878
[130930.971286] [<800823cc>] 0x800823cc
[130930.974905] [<800ae4bc>] 0x800ae4bc
[130930.978492] [<800a9fbc>] 0x800a9fbc
[130930.982084] [<801e3eb8>] 0x801e3eb8
[130930.985688] [<800660b8>] 0x800660b8
[130930.989273] 
[130930.990860] ---[ end trace a181cb91ce161968 ]---
[130930.995595] eth0: tx timeout
[301667.859350] eth0: tx timeout
[397771.241942] eth0: tx timeout
....
[794204.433351] eth0: tx timeout

It's clear for me that removing pdata→use_flow_control = 1; is improving a lot the situation but it looks like there is still something wrong (either it's a hardware bug or a software bug).
The remaining reasons that are causing the tx timeout are outside of what I can debug on my own.
While current situation is way better than before (having to connect to router every couple of hours daily to do an ifup was totaly not acceptable, now I have to do that 1 time every couple of days) it's still not ideal.

There is basicaly nothing else I can add to this problem.

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing