Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#2943 - Kernel panic with dev_net(skb->dev)==NULL in nf_xfrm_me_harder #7717

Open
openwrt-bot opened this issue Mar 29, 2020 · 5 comments
Open
Labels
flyspray kernel pull request/issue with Linux kernel related changes

Comments

@openwrt-bot
Copy link

tete1030:

Hi, I have some questions regarding debugging of random reboot problem caused by kernel panic.

I'm getting random reboot when running a software inserting prerouting iptables rules in order to redirect traffic. My device is a x86_64 router. My openwrt release is compiled by myself from a forked openwrt source at https://github.com/coolsnowwolf/lede . Its kernel version is 4.19.108.

The software causing this problem is called OpenClash. It acts as a transparent proxy. It inserts prerouting rules to redirect all tcp traffic from computers in LAN to its own listening port and sends the traffic through a proxy.

Whenever this software is started, I get random reboots at 1-2 times/day. There was not any abnormal in saved log files because the crash happend in kernel and it caused reboot quickly. So I had to compile the NetConsole kernel module to capture the dmesg when crash happened. You can see the crash logs in crash_dmesg.txt.

The crash happens in the nf_xfrm_me_harder function. Decompiling the crash code, I get crash_code.png (The highlighted line is the crash instruction). The crash log mentions illegal memory access at 000000000000113c and the crash code shows that the kernel was accessing [rax+0x113c], so I think the problem is rax==0, which should not be happening.

The source code causing crash actually locates in a patch, which is also included in trunk OpenWrt: https://github.com/openwrt/openwrt/blob/master/target/linux/generic/pending-4.19/616-net_optimize_xfrm_calls.patch

After patched, function nf_xfrm_me_harder looks like

int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
{
struct flowi fl;
unsigned int hh_len;
struct dst_entry *dst;
struct sock *sk = skb->sk;
int err;

    if (skb->dev && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT]) // <-------crash
            return 0;

    err = xfrm_decode_session(skb, &fl, family);
    if (err < 0)
            return err;
This means dev_net(skb->dev) sometimes equals to `NULL` .

I'm not familiar with the network mechanism in linux kernel, so I'm not sure how I can find the reason of it being NULL. Is this problem something we can safely ignore by checking its validity like this?

if (skb->dev && dev_net(skb->dev) && !dev_net(skb->dev)->xfrm.policy_count[XFRM_POLICY_OUT])

If not, can anyone give me some advice on how I can debug this problem? I understand this may be difficult for you developers to figure out what's happening by merely reading my description, especially when I'm not using the trunk OpenWrt. So I would love to dig it by myself.

You can see other information of my router in dmesg.txt.

Thanks a lot.

@openwrt-bot
Copy link
Author

tete1030:

openclash.png shows iptables rules added by OpenClash.

@openwrt-bot
Copy link
Author

tete1030:

Another crash showing shared backtrace from //tcp_xmit_retransmit_queue// to //nf_xfrm_me_harder//

@openwrt-bot
Copy link
Author

ynezz:

Proposed fix https://patchwork.ozlabs.org/patch/1264576/

@openwrt-bot
Copy link
Author

tete1030:

Hi,

Thank you for your help. I'll give it a try and tell you my result.

I have found another similar ticket, which was also caused by this patch (it was a previous version of this patch) and also happening when heavy redirect iptables rules is used:
https://dev.archive.openwrt.org/ticket/18462.html

And here is its fix:
https://dev.archive.openwrt.org/changeset/43567.html

This makes me curious: is //skb->dev// and //dev_net(skb->dev)// being NULL a normal thing? Isn't this some bug deep in the other parts of kernel?

@openwrt-bot
Copy link
Author

tete1030:

Hi Petr,

I can confirm that it has fixed my problem. No performance degradation has been observed.

Thanks.

@aparcar aparcar added the kernel pull request/issue with Linux kernel related changes label Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flyspray kernel pull request/issue with Linux kernel related changes
Projects
None yet
Development

No branches or pull requests

2 participants