Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#9 - Kernel panic with SQM scripts #7087

Closed
openwrt-bot opened this issue Jun 13, 2016 · 23 comments
Closed

FS#9 - Kernel panic with SQM scripts #7087

openwrt-bot opened this issue Jun 13, 2016 · 23 comments
Labels

Comments

@openwrt-bot
Copy link

Ansuel:

When i select a script different than simplest.qos in the sqm list i get the log full of kernel panic message
If you need more information tell me.
To reproduce i have a wdr3600
just install the sqm package and select the script and watch the log
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.147796] ------------[ cut here ]------------
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.147817] WARNING: CPU: 0 PID: 20621 at net/sched/sch_hfsc.c:1426 0x871e9e6c()
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.147825] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_NETMAP xt_LOG xt_HL xt_DSCMon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148143] CPU: 0 PID: 20621 Comm: dropbear Tainted: G W 4.4.12 #1
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148151] Stack : 803dc584 00000000 00000001 80430000 8782a080 80426f63 803bdcb0 0000508d
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148151] 80493790 873e9f9 873e9c68 0000000a 00000100 800a6854 803c32bc 80420000
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148151] 00000003 873e9f9 803c16c8 8500fad4 00000100 800a4820 00000000 00000000
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148151] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148151] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148151] ...
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148240] Call Trace:
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148255] [<80071a94>] show_stack+0x50/0x84
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148271] [<80081628>] warn_slowpath_common+0xa0/0xd0
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148286] [<800816dc>] warn_slowpath_null+0x18/0x24
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148315] [<871e9e6c>] 0x871e9e6c
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148326]
Mon Jun 13 21:13:07 2016 kern.warn kernel: [30780.148335] ---[ end trace ce6ca764afdc3e14 ]---

here the log i get a lot

@openwrt-bot
Copy link
Author

neheb:

i think i got a similar issue on my ramips device when it switched from the 3.18 kernel to 4.4. you should try using sched_cake and see if it causes the same problem.

@openwrt-bot
Copy link
Author

trismo:

With kernel 4.4.13 for Ar71xx/wdr4300 same HW dint have any problem.
REVISION='[[https://git.lede-project.org/?p=source.git;a=commit;h=3ee6c17cd14ec1fed0b0491542c499c03fc6d211|r707]]'

@openwrt-bot
Copy link
Author

Ansuel:

Ok today i have updated the router i have the same problem... any idea?

@openwrt-bot
Copy link
Author

mirlang:

i can reproduce it on my rspro (ar71xx) and on my wdr4900 (mpc85xx), it just needs a certain amount of UDP traffic (not sure, if it depends on UDP or just small packets)... lots of warnings in the log, and after some time the router becomes inaccessible, then stops forwarding and sometimes even reboots

it's definitively related to HFSC (on linux-4.4)... not sure where to reports this bug :(

@openwrt-bot
Copy link
Author

Ansuel:

exactly me too... so i reported it right at the source of the project
if you set the scrpt nothing happen but when you start with internet traffic the log full with that

@openwrt-bot
Copy link
Author

neheb:

This is a kernel problem which has been reported upstream: https://bugzilla.kernel.org/show_bug.cgi?id=109581

If you want to keep using SQM, switch to using the cake shaper(requires kmod-sched-cake). it replaces htsc as well as the other ones(except fq_codel). No crashes here.

edit: could also try disabling TSO on your ethernet interfaces. It might work. Requires ethtool.

@openwrt-bot
Copy link
Author

Ansuel:

For me the TSO it's arleady disabled
i will try the cacke shaper... what are the difference between them ?

@openwrt-bot
Copy link
Author

Ansuel:

Ok i have tried now the cake shaper and i instal the extra experimental sqm script...
Now i have cacke shaper and test triple wan script and i don't have any error at all.
I get A with bufferbloat and i think it's working because i was downloading while i was doing the test.

So now how to alert the dev's that they need to set cake as default couse others are broken?

@openwrt-bot
Copy link
Author

neheb:

cake is still out of tree and you need to install it separately from sqm. won't be default soon.

the difference between cake and the default SQM setup is that cake has lower CPU usage.

@openwrt-bot
Copy link
Author

Ansuel:

if it better than the default one why it's not included?

@openwrt-bot
Copy link
Author

moeller0:

Dear All,

Please let me try to answer a few of the questions above:

Why is cake not the default?
Cake is not per-se "better" than htb+fq_codel and cake is still under (more or less) active development so it certainly is not yet ready for becoming the default. Simple.qos with its combination of HTB and fq_codel still is the recommended default, so unless you want to participate in active debugging and development please stick to simple.qos or simplest.qos.

Why cake is not included?
Cake just became available as an easy to install kernel module in LEDE to allow wider testing. But since it is not (yet) recommended as default it also is not installed by default (people might be unhappy if sqm-scripts would install unnecessary modules wasting their space).

Does cake use less CPU than HTB+fq_codel?
Some time last year tests with an earlier version of cake indicated that in a CPU limited situation cake might allow a higher overall shaper bandwidth than the default HTB+fq_codel combination. More recent tests are not that conclusive. Especially it was shown that HTB behaves differently from HFSC and cake in CPU limited mode: HTB will keep the added latency low, sacrificing more bandwidth (sometimes considerably more bandwidth) while the other two shaper sacrifice less bandwidth but will also increase the latency under load (or show more bufferbloat).

What about sqm-scripts-extra scripts?
The sqm-scripts extra scripts are really just for wider testing, please do not relay on them being available for longer times/.

[OT] SQM-scripts-exts: What is the difference between the LAN and WAN variants?
Cake promises much better isolation of internal host IPs versus each other than the other qdiscs. But to be able to implement per-internal-host-IP fairness cake needs to see the internal IPs, in the typical home situation sqm gets instantiated on the WAN interface that typically also performs NAT for IPv4. Cake will only be able to see one internal address if instantiated there, making per-internal-IP isolation degrade into the default per-flw isolation. To allow to test whether cake's two relevant isolation options (triple and dual) actually work in the real world the LAN scripts are prepared to be instantiated on internal LAN interfaces of a home router, since on the LAN ports the internal IPs are still visible. Please note that typically the bridged WLAN interfaces will not be covered by the shaping, making the LAN variant scripts not generally recommended solutions, but pure testing devices. The ideal test would be to hook upp another switch/dumbAP behind the shaped LAN port and try to mix traffic from different host and see if for example heavy bit-torrenting still badly affects the connections of other internal hosts. If anybody actually tests this, please report any results as issues under https://github.com/tohojo/sqm-scripts thanks in advance
[/OT]
Best Regards
M.

@openwrt-bot
Copy link
Author

Ansuel:

thx for the explanation. Currently i'm using the triple wan script with cake
My connetion is a pppoe with atm overhead so is it wrong how i set the sqm settings?

And you did'nt explain the wan variants. It's the same?

@openwrt-bot
Copy link
Author

diizzyy:

Also affects qos-scripts on trunk r1242 (ramips, MT7621, DIR-860L B1)

[ 3034.313000] ------------[ cut here ]------------ [ 3034.322000] WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1426 0x86921ea0() [ 3034.336000] Modules linked in: ifb qcserial pppoe ppp_async option iptable_nat cdc_mbim usb_wwan sierra_net sierra rndis_host qmi_wwan pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE huawei_cdc_ncm cdc_subset cdc_ncm cdc_ether cdc_eem xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY usbserial usbnet usblp slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt cdc_wdm act_connmark nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress mt7603e mt76x2e mt76 mac80211 cfg80211 compat ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables nfsd nfsv3 nfs tun loop vfat fat lockd sunrpc grace nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp437 usb_storage leds_gpio xhci_mtk xhci_plat_hcd xhci_pci xhci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 jbd2 mbcache exfat usbcore nls_base usb_common mii crypto_hash [last unloaded: ifb] [ 3034.573000] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.15 #5 [ 3034.585000] Stack : 00000000 00000000 804b6862 00000033 00000000 00000000 80460000 804d0000 [ 3034.585000] 8783bf10 8045dc83 803db648 00000003 00000000 804b367c 86d9dc68 86b6a480 [ 3034.585000] 00000008 8006349c 80460000 804d0000 804621d8 804621dc 803dff70 87865c04 [ 3034.585000] 00000003 80061228 86d9dc68 86b6a480 00000008 00000000 00000000 00865c04 [ 3034.585000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 3034.585000] ... [ 3034.656000] Call Trace: [ 3034.661000] [<800165c8>] show_stack+0x50/0x84 [ 3034.669000] [<801b5200>] dump_stack+0x84/0xbc [ 3034.678000] [<8002be00>] warn_slowpath_common+0xa0/0xd0 [ 3034.688000] [<8002beb4>] warn_slowpath_null+0x18/0x24 [ 3034.698000] [<86921ea0>] 0x86921ea0

@openwrt-bot
Copy link
Author

diizzyy:

Doesn't occur on trunk r1122 (ar71xx, AR7242, Mikrotik RB750GL) running qos-scripts.

@openwrt-bot
Copy link
Author

mirlang:

nack, diizzyy, i can still kill my router (means: it stops forwarding traffic) on 4.4.19 by just downloading a well seeded torrent, and slowpath-warnings are still there... doesn't happen with HTB instead of HFSC

@openwrt-bot
Copy link
Author

diizzyy:

I didn't say it was fixed, it does seem to only occur on certain hardware/SoCs. Also, please state version of trunk and hardware.

@openwrt-bot
Copy link
Author

dtaht:

I have never been huge on hfsc. HTB is much better tested, as is "cake".

@openwrt-bot
Copy link
Author

eTomm:

Hello I posted a crash in FS#277. I was using fq_codel with nxt_routed_hfsc.qos. Unfortunately I'm not able to use cake or simple.qos because they cut my bandwidth from 20 to 2 mbit. I could not find any configuration to don't allow this apart using the above one.

Obviously on Linksys EA8500 nxt_routed_hfsc.qos crashes. Same error here and then the router stop to accept traffic from WAN and LAN interfaces

@openwrt-bot
Copy link
Author

moeller0:

@tommaso Ercole HTB should be similar enough to hfsc that a drop from 20 to 2 might indicate some other bug in sqm-scripts. Maybe you could help me debug this? Potentially through the sqm-scripts github site?

@openwrt-bot
Copy link
Author

eTomm:

If it is not difficult... I think my wife gave me an ultimatum for my "networking" tests

@openwrt-bot
Copy link
Author

hnyman:

One more me-too report of HFSC crashes. I tested new R7800 with different qdiscs and HFSC seems to cause problems.

Netgear R7800, IPQ8065 SoC (ipq806x platform in LEDE).
LEDE Reboot r2154, kernel 4.4.30

A few slightly different variations of the crash, but are at net/sched/sch_hfsc.c:1426 hfsc_dequeue+0x188/0x568

Example below:

[67086.803277] ------------[ cut here ]------------
[67086.806968] WARNING: CPU: 0 PID: 3 at net/sched/sch_hfsc.c:1426 hfsc_dequeue+0x188/0x568 sch_hfsc
[67086.811517] Modules linked in: pppoe ppp_async iptable_nat ip6table_nat pptp pppox ppp_mppe ppp_generic nf_nat_ipv6 nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY usbserial slhc nf_reject_ipv4 nf_nat_rtsp nf_nat_redirect nf_nat_masquerade_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtsp nf_conntrack_rtcache iptable_raw iptable_mangle iptable_filter ipt_ah ipt_ECN ip_tables crc_ccitt sch_cake em_cmp sch_teql em_nbyte sch_htb sch_tbf sch_dsmark sch_pie sch_gred em_meta cls_basic act_ipt sch_prio em_text sch_codel sch_sfq act_police sch_fq sch_red act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ath10k_pci ath10k_core ath mac80211 cfg80211 compat ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6t_NPT ip6t_MASQUERADE nf_nat_masquerade_ipv6 nf_nat nf_conntrack ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables msdos ip_gre gre ifb sit tunnel4 ip_tunnel tun vfat fat ntfs hfsplus cifs nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp850 nls_cp437 nls_cp1250 sha256_generic sha1_generic md5 md4 hmac ecb des_generic usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom dwc3_of_simple ohci_platform ohci_hcd phy_qcom_dwc3 ahci ehci_platform ehci_hcd sd_mod ahci_platform libahci_platform libahci libata scsi_mod gpio_button_hotplug ext4 jbd2 mbcache usbcore nls_base usb_common cryptomgr aead crypto_null crc32c_generic crypto_hash
[67086.999791] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G W 4.4.30 #0
[67087.000143] Hardware name: Qualcomm (Flattened Device Tree)
[67087.007284] [] (unwind_backtrace) from [] (show_stack+0x14/0x20)
[67087.012830] [] (show_stack) from [] (dump_stack+0x8c/0xa0)
[67087.020816] [] (dump_stack) from [] (warn_slowpath_common+0xa4/0xd0)
[67087.027845] [] (warn_slowpath_common) from [] (warn_slowpath_null+0x1c/0x24)
[67087.036121] [] (warn_slowpath_null) from [] (hfsc_dequeue+0x188/0x568 [sch_hfsc])
[67087.044895] [] (hfsc_dequeue [sch_hfsc]) from [] (__qdisc_run+0xcc/0x1b4)
[67087.053983] [] (__qdisc_run) from [] (net_tx_action+0xf4/0x180)
[67087.062488] [] (net_tx_action) from [] (__do_softirq+0xdc/0x230)
[67087.069948] [] (__do_softirq) from [] (run_ksoftirqd+0x34/0x64)
[67087.077937] [] (run_ksoftirqd) from [] (smpboot_thread_fn+0x190/0x1b8)
[67087.085321] [] (smpboot_thread_fn) from [] (kthread+0xf8/0x100)
[67087.093648] [] (kthread) from [] (ret_from_fork+0x14/0x3c)
[67087.101259] ---[ end trace 829b74d04ace8079 ]---

Ps. Somebody with edit rights might add "HFSC qdisc" to be visible in the bug title.

@openwrt-bot
Copy link
Author

moeller0:

@tommaso ah, I have the same issue, my family appreciates me not taking down our internet connection for testing, so you have my sympathies. So if this is too inconvienent stick to HFSC (but du\o try cake if possible, while far from perfect it still has a number of great ideas making it worth testing)

@hannum Nyman: I concur, since HFSC is the root cause and SQM is only implicated because one of its (non-default) scripts actually sets up a HFSC instance, so it might be the messenger but it is not the cause. Maybe "Kernel panic with HFSC (triggered by SQM scripts)" would be a better name ;)

@openwrt-bot
Copy link
Author

DoubleQ:

Looks like im on the same problems - but router looks like stable.
In not sure how to get other times in the front row.
Im on a early archerc5/c7 with lede:
Linux Archer-Lede 4.4.30 #0 Wed Nov 9 11:17:52 2016 mips GNU/Linux

Found the following in my dmesg.
[54837.883304] ------------[ cut here ]------------
[54837.888030] WARNING: CPU: 0 PID: 3 at net/core/dev.c:4837 net_rx_action+0x138/0x2c8()
[54837.896015] Modules linked in: pppoe ppp_async iptable_nat ath9k pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE ath9k_common xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt ath9k_hw ath10k_pci ath10k_core ath mac80211 cfg80211 compat ledtrig_usbport ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
[54837.965308] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.4.30 #0
[54837.971493] Stack : 803ec784 00000000 00000001 80440000 87c28c80 80438e63 803cde2c 00000003
[54837.971493] 804a379c 00000040 00000042 00000102 00000001 800a72f0 803d3490 80430000
[54837.971493] 00000003 00000040 803d189c 87c41d3c 00000001 800a526c 00000000 00000000
[54837.971493] 00000001 801f4b00 00000000 00000000 00000000 00000000 00000000 00000000
[54837.971493] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[54837.971493] ...
[54838.007839] Call Trace:
[54838.010339] [<80071c38>] show_stack+0x50/0x84
[54838.014768] [<800819b8>] warn_slowpath_common+0xa0/0xd0
[54838.020077] [<80081a70>] warn_slowpath_null+0x18/0x24
[54838.025208] [<8027810c>] net_rx_action+0x138/0x2c8
[54838.030085] [<80083f34>] __do_softirq+0x250/0x298
[54838.034861] [<80083fa4>] run_ksoftirqd+0x28/0x60
[54838.039555] [<8009a96c>] smpboot_thread_fn+0x158/0x188
[54838.044776] [<8009839c>] kthread+0xd8/0xec
[54838.048941] [<80060878>] ret_from_kernel_thread+0x14/0x1c
[54838.054416]
[54838.055932] ---[ end trace 53e447d7b63cfb5f ]---

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant