Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#766 - Intermittent SIGSEGV crash of dnsmasq-full #5741

Closed
openwrt-bot opened this issue May 7, 2017 · 8 comments
Closed

FS#766 - Intermittent SIGSEGV crash of dnsmasq-full #5741

openwrt-bot opened this issue May 7, 2017 · 8 comments
Labels

Comments

@openwrt-bot
Copy link

guidosarducci:

I've just noticed seeing the following several times within the last day or so:
[1461327.495159] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
[1461327.504081] epc = 0040f28d in dnsmasq[400000+2c000]
[1461327.509252] ra = 0040f273 in dnsmasq[400000+2c000]

I'm running the latest LEDE stable, with all updates applied as of 2017-05-05, and have been using DNSSEC for a while:

  • LEDE Reboot 17.01.1 r3316-7eb58cf109
  • D-Link DIR-835 rev. A1
  • dnsmasq-full - 2.76-6

The most recent upgrade in the same time frame was to odhcpd-2017-04-28-9268ca65-1.

After several restarts by procd and subsequent crashes, dnsmasq will be disabled, leaving me without name resolution until I notice and restart manually.

To get a little more info, I rebuilt the stable LEDE and dnsmasq-full with a "-g" CFLAG option. After installing this package, I captured the following crash details:

[1562749.817613] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
[1562749.826522] epc = 0040f295 in dnsmasq[400000+2c000]
[1562749.831681] ra = 0040f27b in dnsmasq[400000+2c000]

Checking further with gdb yields:
(gdb) info line *0x0040f27b
Line 278 of "forward.c" starts at address 0x40f275 <forward_query+204>
and ends at 0x40f281 <forward_query+216>.

(gdb) info line *0x0040f295
Line 281 of "forward.c" starts at address 0x40f295 <forward_query+236>
and ends at 0x40f29b <forward_query+242>.

And the relevant source (forward.c) looks like:
275 blockdata_retrieve(forward->stash, forward->stash_len, (void *)header);
276 plen = forward->stash_len;
277
278 if (find_pseudoheader(header, plen, NULL, &pheader, &is_sign, NULL) && !is_sign)
279 PUTSHORT(SAFE_PKTSZ, pheader);
280
281 if (forward->sentto->addr.sa.sa_family == AF_INET)
282 log_query(F_NOEXTRA | F_DNSSEC | F_IPV4, "retry", (struct all_addr *)&forward->sentto->addr.in.sin_addr, "dnssec");
283 #ifdef HAVE_IPV6
284 else
285 log_query(F_NOEXTRA | F_DNSSEC | F_IPV6, "retry", (struct all_addr

Any similar reports from others? I'll keep monitoring in the meantime but this is difficult to reproduce on demand. It seems to happen more with web browsing.

@openwrt-bot
Copy link
Author

guidosarducci:

After a little more investigation, this is definitely a bug that also exists in the latest lede/master which uses dnsmasq-2.77test5. It is easily triggered via a common mozilla DNS query, and appears related to using split DNS and DNSSEC.

A minimal, standalone dnsmasq.conf that is vulnerable:
listen-address=192.168.1.1
port=55553
bind-interfaces
no-daemon
no-hosts
no-resolv
log-queries=extra
server=8.8.8.8
server=/cloudfront.net/50.22.147.234
dnssec
dnssec-check-unsigned
trust-anchor=.,19036,8,2,49AAC11D7B6F6446702E54A1607371607A1A41855200FD2CE1CDDE32F24E8FB5
trust-anchor=.,20326,8,2,E06D44B80B8F1D39A95C0B0D7C65D08458E880409BBC683457104237C7F8EC8D

Removing either of these config lines results in no SIGSEGV:
server=/cloudfront.net/50.22.147.234
dnssec-check-unsigned

The bug can be triggered from a DNS client simply (e.g.a blank Firefox page!):
ubuntu$ nslookup -port=55553 tiles-cloudfront.cdn.mozilla.net 192.168.1.1
;; Question section mismatch: got cloudfront.net/DS/IN
;; connection timed out; no servers could be reached

I also captured a dnsmasq core file from my router and ran it through gdb:
ubuntu$ ./staging_dir/toolchain-mips_24kc_gcc-5.4.0_musl-1.1.16/bin/mips-openwrt-linux-gdb -d ./build_dir/target-mips_24kc_musl-1.1.16/dnsmasq-full/dnsmasq-2.77test5/src/ -n ./staging_dir/target-mips_24kc_musl-1.1.16/root-ar71xx/usr/sbin/dnsmasq dnsmasq.757.11.1494218146.core
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
...
Reading symbols from ./staging_dir/target-mips_24kc_musl-1.1.16/root-ar71xx/usr/sbin/dnsmasq...done.
[New LWP 757]
...
Core was generated by `dnsmasq -C crash-dnsmasq.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 forward_query (udpfd=, udpaddr=udpaddr@entry=0x7fc1d930,
dst_addr=, dst_iface=dst_iface@entry=0,
header=header@entry=0x7c8010, plen=43, plen@entry=50,
now=now@entry=1494218146, forward=0x77cabd90, ad_reqd=ad_reqd@entry=0,
do_bit=do_bit@entry=0) at forward.c:281
281 if (forward->sentto->addr.sa.sa_family == AF_INET)
(gdb) bt
#0 forward_query (udpfd=, udpaddr=udpaddr@entry=0x7fc1d930,
dst_addr=, dst_iface=dst_iface@entry=0,
header=header@entry=0x7c8010, plen=43, plen@entry=50,
now=now@entry=1494218146, forward=0x77cabd90, ad_reqd=ad_reqd@entry=0,
do_bit=do_bit@entry=0) at forward.c:281
#1 0x00410275 in receive_query (listen=listen@entry=0x77cbffe0,
now=now@entry=1494218146) at forward.c:1443
#2 0x00412825 in check_dns_listeners (now=now@entry=1494218146)
at dnsmasq.c:1565
#3 0x004047db in main (argc=, argv=)
at dnsmasq.c:1044
(gdb)

The dnsmasq config file, log file, and client log are attached. I'm not sure I can go any further, so would appreciate the dnsmasq package maintainer taking a look and advising.

Thanks!

@openwrt-bot
Copy link
Author

None:

I've forwarded your message including the replication procedure to the dnsmasq list. I was able to replicate with ease following your instruction, in fact all I needed to do was add server=/cloudfront.net/50.22.147.234 to my existing config. This makes me think it's a particular type of server that's provoking the issue.

Let's see what happens

Kevin

@openwrt-bot
Copy link
Author

guidosarducci:

Hehe, very optimistic of you to close this...

I saw the update from Simon Kelley (thank you!) on the Dnsmasq-discuss mailing list and built an updated LEDE dnsmasq-2.77rc1 package to test. (see required patch attached)

The prior minimal test-case passed, but the original production config file now creates a horrible SIGSEGV crash-loop (log attached):
Mon May 8 22:59:46 2017 kern.info kernel: [1738736.539480] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
Mon May 8 22:59:46 2017 kern.info kernel: [1738736.548375] epc = 0040e79b in dnsmasq[400000+2d000]
Mon May 8 22:59:46 2017 kern.info kernel: [1738736.553564] ra = 0040e773 in dnsmasq[400000+2d000]

Stack trace indicates something to do with logging:
(gdb) core-file dnsmasq.18906.11.1494309586.core
[New LWP 18906]
...
Core was generated by `dnsmasq -C /var/etc/dnsmasq.conf.cfg02411c --no-daemon'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0040e79b in search_servers (now=now@entry=1494309586,
addrpp=addrpp@entry=0x0, qtype=qtype@entry=32768, qdomain=,
type=type@entry=0x7fd02c74, domain=domain@entry=0x7fd02c78,
norebind=norebind@entry=0x0) at forward.c:222
222 log_query(logflags | flags | F_CONFIG | F_FORWARD, qdomain, *addrpp, NULL);
(gdb) bt
#0 0x0040e79b in search_servers (now=now@entry=1494309586,
addrpp=addrpp@entry=0x0, qtype=qtype@entry=32768, qdomain=,
type=type@entry=0x7fd02c74, domain=domain@entry=0x7fd02c78,
norebind=norebind@entry=0x0) at forward.c:222
#1 0x00410759 in reply_query (fd=, family=,
now=now@entry=1494309586) at forward.c:938
#2 0x004127dd in check_dns_listeners (now=now@entry=1494309586)
at dnsmasq.c:1560
#3 0x004047db in main (argc=, argv=)
at dnsmasq.c:1044
(gdb) print logflags
$1 = 32800
(gdb) print flags
$2 =
(gdb) print *qdomain
value has been optimized out
(gdb) print addrpp
$3 = (struct all_addr **) 0x0
(gdb)

This turns out to be easy to reproduce. Simply add domain-needed to the prior standalone config file.
Then trigger the crash from a client with:
$ nslookup -port=55553 google.com 192.168.1.1
;; connection timed out; no servers could be reached

I attached all the relevant logs, configs and patches.

@openwrt-bot
Copy link
Author

dedeckeh:

The bug record has been closed as this is an upstream issue in the dnsmasq project; meaning the issue has to be reported on the dnsmasq mailing list and needs to be fixed by the dnsmasq maintainer. Therefore it makes no sense to keep this bug record open on the Lede project.

@openwrt-bot
Copy link
Author

guidosarducci:

@Kevin Darbyshire-Bryant:
I wanted to report that I tested the patch Simon created in response to my second bug report, and can confirm this now resolves the issue. Thank you for liaising with the DNSMASQ mailing list, and please pass on this note and my thanks also to Simon Kelley.

Since these crash-loop bugs are fairly serious and difficult to troubleshoot, do you expect someone would be able to back-port the fixes to LEDE-17.01 once Simon releases dnsmasq-2.77 in the near future?

Thanks,
Tony Ambardar


@hans Dedecker:
I was asked to open this ticket by Matthias Schiffer, and rightly so. As a bug present in a //base// package of the LEDE stable release, it should clearly be reported here according to LEDE policy: [[https://lede-project.org/bugs|LEDE Project - Reporting Bugs]]. And until it is fixed in a LEDE package the issue can impact others and should remain open.

Your suggestions that the issue is unrelated to LEDE and doesn't belong here are misleading, unhelpful, and serve to dissuade others from volunteering their time to improve LEDE. I'd like to think that is not your intention. Am I wrong to think so?

@openwrt-bot
Copy link
Author

None:

No problem. Thanks for doing the hard work with gdb! I've already got a pull request on standby for when Simon tags rc2 - we'll see how quickly the release gets released and I guess someone will make a decision on how to get that into LEDE17.01.

@openwrt-bot
Copy link
Author

dedeckeh:

@tony Ambardar:
By no means it was intended to give the impression to discourage people from improving LEDE.
But the major issue here is that the upstream dnsmasq owner won't read the flyspray bug records; therefore it's more advisable to report the issue directly upstream or someone has to find time and act as liaison like Kevin Darbyshire-Bryant did.
At the same time we don't have a clear policy how to handle bugs in core packages which are not owned by LEDE; I agree it's not a bad idea to keep the bug open as long as we can keep it maintainable

@openwrt-bot
Copy link
Author

ckujau:

For the record, this is still an issue with 17.01.2 and Dnsmasq version 2.77:

kernel: [ 2860.890789] kernel: [ 2860.890789] do_page_fault(): sending SIGSEGV to dnsmasq for invalid write access to 00552000 kernel: [ 2860.899402] epc = 77cd488c in libc.so[77c62000+92000] kernel: [ 2860.904552] ra = 00406c41 in dnsmasq[400000+21000] kernel: [ 2860.909537]

I came across this one while playing around with //dnseval// from the [[https://github.com/farrokhi/dnsdiag|dnsdiag]] package. Simply calling //dnseval foo// was enough to make //dnsmasq// crash :-|

But, as this crashes the lastest git checkout from //dnsmasq// too, I shall report this upstream, of course. have reported this to the [[http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2017q3/011692.html|dnsmasq-discuss]] mailing list.

If somebody wants to take a stab at the MIPS core dump (attached), please do, as I don't have a LEDE build environment set up yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant