New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#251 - sending SIGSEGV to dnsmasq for invalid read access from 00000000 #5482
Comments
IronicSven: I've got the same message on my TP-Link TL-WR1043N/ND v1 with r2109. |
pmalecka: Same here on mikrotik 493g - r2155 The sigsegv also happens for: Sun Nov 13 02:27:07 2016 kern.info kernel: [49320.120778] Sun Nov 13 08:26:00 2016 kern.info kernel: [70852.962664] |
mkresin: It seams to me that the SIGSEGV is related to busybox since the return address (ra) points always to busybox and always to the same position in busybox. Would any of you please compile an Image with the following extra option in menuconfig:
Base system --->
<*> busybox --->
[*] Customize busybox options --->
Busybox Settings --->
Debugging Options --->
[*] Build BusyBox with extra Debugging symbols
This **might ** print the function which is called in busybox instead of the - not really helpful - position of the function in the binary. |
mamarley: I did a build with that option and it has been running for a day or so now (more than long enough to reproduce it in the past) and so far there are no segfaults at all. Stupid Heisenbug… |
NeoRaider: I'm seeing the same issue, unfortunately also without debug symbols. I haven't had a closer look yet, but here's some GDB output:
#0 0x00439ff1 in nonblock_immune_read ()
(gdb) bt
#0 0x00439ff1 in nonblock_immune_read ()
#1 0x0041bc23 in argstr ()
#2 0x0041bd89 in expandarg ()
#3 0x0041e7c3 in evalfor ()
#4 0x0041dc37 in evaltreenr ()
#5 0x0041dc37 in evaltreenr ()
#6 0x0041e117 in cmdloop ()
#7 0x0041f8e3 in ash_main ()
#8 0x00407879 in run_applet_no_and_exit ()
#9 0x004078f1 in main ()
(gdb) info registers
zero at v0 v1 a0 a1 a2 a3
R0 00000000 80480000 00000000 fffffffc 00000000 7fda119c 00000080 00000000
t0 t1 t2 t3 t4 t5 t6 t7
R8 00000000 80f7fa80 00000001 00000000 8104217c 00000024 804a0000 ffffff80
s0 s1 s2 s3 s4 s5 s6 s7
R16 00000000 00000003 00000003 0040789d 77292000 77292000 77294500 77295e94
t8 t9 k0 k1 gp sp s8 ra
R24 00000000 772144f8 00000000 00000000 7729b2b0 7fda1108 00000000 00439fe5
sr lo hi bad cause pc
0000dc13 02400000 000f4537 00000000 00800008 00439ff1
fsr fir
00000000 00000000
(gdb) disas
Dump of assembler code for function nonblock_immune_read:
0x00439fd5 <+0>: save a0-a2,48,ra,s0-s1
0x00439fd9 <+4>: move s1,a0
0x00439fdb <+6>: lw a2,56(sp)
0x00439fdd <+8>: lw a1,52(sp)
0x00439fdf <+10>: jal 0x4086c1
0x00439fe3 <+14>: move a0,s1
0x00439fe5 <+16>: slti v0,0
0x00439fe7 <+18>: move s0,v0
0x00439fe9 <+20>: bteqz 0x43a00d
0x00439feb <+22>: jal 0x448ff1 <__errno_location@mips16plt>
0x00439fef <+26>: nop
=> 0x00439ff1 <+28>: lw v0,0(v0)
0x00439ff3 <+30>: cmpi v0,11
0x00439ff5 <+32>: btnez 0x43a00d
0x00439ff7 <+34>: li v0,1
0x00439ff9 <+36>: li a2,1
0x00439ffb <+38>: move v1,sp
0x00439ffd <+40>: neg a2
0x00439fff <+42>: li a1,1
0x0043a001 <+44>: addiu a0,sp,24
0x0043a003 <+46>: sw s1,24(sp)
0x0043a005 <+48>: jal 0x43a671
0x0043a009 <+52>: sh v0,28(v1)
0x0043a00b <+54>: b 0x439fdb
0x0043a00d <+56>: move v0,s0
0x0043a00f <+58>: restore 48,ra,s0-s1
0x0043a011 <+60>: jrc ra
End of assembler dump.
|
NeoRaider: It is indeed a Heisenbug, any change to the code to add debug output makes it go away. More weirdness (if the information from the core dump I got is accurate):
I can reproduce the issue fairly easily on a TL-WR1043 v1 by calling "/etc/init.d/network restart" when dnsmasq is restarted by this, but I haven't seen it on a TL-WR841 v7. Either this is hardware-dependent, or something changed because I cleaned my tree when changing the models; I'll have to check again when I have both devices at the same place. |
NeoRaider: I'm not much closer to the root of this issue, but at least I'm a bit less confused.
I've been unable to test this command in gdb (it just hangs). When run in strace, the command doesn't ever segfault. I'll check with the musl people if they have any idea what is happening. |
NeoRaider: Further increasing severity, as this doesn't only affect init scripts, but all shell scripts using shell expansion ($() or backticks). While testing, I've experienced several crashs of sysupgrade. Further results of my investigation:
I'm currently looking into possible kernel-side causes for this issue. |
None: Adding this as a 'me too'. [ 200.009789] do_page_fault(): sending SIGSEGV to odhcpd for invalid read access from 00000000 Archer c7 v2 - linux 4.4.34 Can't re-create at will, but occurs 2-3 times every reboot. Let me know how I can be of assistance with running tests etc. |
None: Don't know if this is of any help, but I got a 'strace':
epoll_pwait(3, [], 10, 2000, NULL, 16) = 0
clock_gettime(CLOCK_MONOTONIC, {118, 496503244}) = 0
clock_gettime(CLOCK_MONOTONIC, {118, 496956153}) = 0
clock_gettime(CLOCK_MONOTONIC, {118, 497110062}) = 0
clock_gettime(CLOCK_MONOTONIC, {118, 497561396}) = 0
epoll_pwait(3, [{EPOLLIN, {u32=2002960268, u64=8602648846247395328}}], 10, 2000, NULL, 16) = 1
recvmsg(18, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\5\0\23\0\0\0\0\0\0\0P", iov_len=12}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 12
poll([{fd=18, events=POLLIN}], 1, -1) = 1 ([{fd=18, revents=POLLIN}])
recvmsg(18, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\10B\353\226\255\4\0\0\24ubus.object.add\0\7\0\0000"..., iov_len=76}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 76
sendmsg(18, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\1\0\23\0\0\0\0", iov_len=8}, {iov_base="\0\0\0\24\1\0\0\10\0\0\0\0\3\0\0\10B\353\226\255", iov_len=20}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, 0) = 28
recvmsg(18, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {118, 667711369}) = 0
clock_gettime(CLOCK_MONOTONIC, {118, 667914580}) = 0
epoll_pwait(3, [{EPOLLIN, {u32=4313216, u64=18525121660583936}}], 10, 1830, NULL, 16) = 1
recvmsg(13, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000400}, msg_namelen=28->12, msg_iov=[{iov_base=[{{len=116, type=0x18 /* NLMSG_??? */, flags=0, seq=0, pid=0}, "\n\10\0\0\377\3\0\1\0\0\0\0\0\10\0\17\0\0\0\377\0\24\0\1\377\0\0\0\0\0\0\0"...}, {{len=0, type=0x62e3 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|NLM_F_ECHO|NLM_F_DUMP_INTR|NLM_F_DUMP_FILTERED|0x27c0, seq=4272922192, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 116
recvmsg(13, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000400}, msg_namelen=28->12, msg_iov=[{iov_base=[{{len=116, type=0x18 /* NLMSG_??? */, flags=0, seq=0, pid=0}, "\n@\0\0\376\2\0\1\0\0\0\0\0\10\0\17\0\0\0\376\0\24\0\1\376\200\0\0\0\0\0\0"...}, {{len=0, type=0x62e3 /* NLMSG_??? */, flags=NLM_F_REQUEST|NLM_F_MULTI|NLM_F_ACK|NLM_F_ECHO|NLM_F_DUMP_INTR|NLM_F_DUMP_FILTERED|0x27c0, seq=4272922192, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 116
recvmsg(13, {msg_namelen=28}, MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {118, 707666461}) = 0
clock_gettime(CLOCK_MONOTONIC, {118, 707849995}) = 0
epoll_pwait(3, [{EPOLLIN, {u32=4313216, u64=18525121660583936}}], 10, 1790, NULL, 16) = 1
recvmsg(13, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000100}, msg_namelen=28->12, msg_iov=[{iov_base=[{{len=72, type=0x14 /* NLMSG_??? */, flags=0, seq=0, pid=0}, "\n\200\0\0\0\0\0\n\0\24\0\1*\2\f\177\22 \277+\0\0\0\0\0\0\0\376\0\24\0\6"...}, {{len=2359308, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\24\0\0\0\0\0\0\0\0"...}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 72
clock_gettime(CLOCK_MONOTONIC, {119, 188109722}) = 0
sendto(7, {{len=24, type=0x16 /* NLMSG_??? */, flags=NLM_F_REQUEST|0x300, seq=1, pid=0}, "\n\0\0\0\0\0\0\n"}, 24, 0, NULL, 0) = 24
recvfrom(7, [{{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1, pid=2601}, "\n\200\200\376\0\0\0\1\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\24\0\6"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1, pid=2601}, "\n\200\0\0\0\0\0\n\0\24\0\1*\2\f\177\22 \277+\0\0\0\0\0\0\0\376\0\24\0\6"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1, pid=2601}, "\n@\200\375\0\0\0\n\0\24\0\1\376\200\0\0\0\0\0\0\26\314 \377\376\276\2112\0\24\0\6"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1, pid=2601}, "\n@\200\375\0\0\0\22\0\24\0\1\376\200\0\0\0\0\0\0\26\314 \377\376\276\2111\0\24\0\6"...}, {{len=72, type=0x14 /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1, pid=2601}, "\n@\300\375\0\0\0\23\0\24\0\1\376\200\0\0\0\0\0\0\26\314 \377\376\276\2110\0\24\0\6"...}], 8192, 0, NULL, NULL) = 360
recvfrom(7, {{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1, pid=2601}, "\0\0\0\0"}, 8192, 0, NULL, NULL) = 20
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV +++
|
NeoRaider: What exact options did you use for this strace? Is contains lots of syscalls that are not from busybox. |
None: it was an 'strace -p' of odhcpd which is the thing that gets killed on a 'regular' basis. Oh hell, I've just noticed odhcpd was bumped recently.... this might be a red herring. |
NeoRaider: Most likely that is a different bug. All reports in this ticket are about busybox (ash) crashing while running shell scripts. Some strings like "dnsmasq" appear in the logs as that are the names of the scripts (e.g. /etc/init.d/dnsmasq). |
IronicSven: I'm testing latest trunk r0+2321 on my TP-Link TL-WR1043N/ND v1 since a few hours and I can't reproduce the SIGSEGV messages anymore :) |
IronicSven: I just flashed r0+2369 and the SIGSEGV messages are back. |
fuzzle: i build several lede in the last days - and at least on tplink 841 i never see this. (with some days uptime) .. .. this particular node: |
NeoRaider: fuzzle, that's not even close to the issues reported in this ticket. As mentioned in earlier comments, this ticket doesn't have to do anything with dnsmasq, but is about a segfault in busybox. Also, please don't report Gluon bugs in the LEDE tracker. |
NeoRaider: Small update: While I mostly see this issue on a TL-WR1043 v1, I've also observed it on a TL-WR841 v9 at least once; so it seems the bug is not hardware-specific after all (at least not limited to specific SoCs). Unfortunately, I've been busy with other things last week, so I haven't been able to continue debugging the issue. |
nbd: Please test the latest version |
IronicSven: I've tested a few versions since last weekend and couldn't reproduce this issue on my 1043nd v1 anymore. I think it's fixed. |
NeoRaider: Still reproducible with current master (r2695-c9c68c71776). |
mjw99: Just a "Me too". I am seeing this with a NETGEAR WNR2000v1 on r2449-7c47f43: |
IronicSven: I can't reproduce this issue since weeks. I've been testing a TL-WR1043ND v1, TL-WR1043ND v2 and Archer C7 during this period. Is it possible your images are selfbuilt and a make dirclean or make distclean might help? |
nbd: If you're still affected by this bug, please try the latest version |
mamarley: I'm not seeing this on my UAP-LR anymore. |
mjw99: I am no longer seeing this on a NETGEAR WNR2000v1 with 17.01-SNAPSHOT, r3045-e038c60. |
guidosarducci: I've just noticed seeing the following several times within the last day or so: I'm running the latest LEDE stable, with all updates applied as of 2017-05-05:
The most recent upgrade in the same time frame was to odhcpd-2017-04-28-9268ca65-1. And DNSSEC is enabled. After a few restart attempts, dnsmasq has continued to run since then. |
guidosarducci: The SIGSEGV crashes continue to happen periodically, and I may have been missing them due to dnsmasq being restarted by procd. To get a little more info, I rebuilt the stable LEDE and dnsmasq-full with a "-g" CFLAG option. After installing this package, I captured the following crash details:
Checking further with gdb yields:
Any similar reports from others? I'll keep monitoring in the meantime... |
NeoRaider: This ticket is specifically about a crash in busybox, often seen while running the dnsmasq init script (but also in other shell scripts). Your issue is a crash in dnsmasq itself, please open a new ticket. |
guidosarducci: Sure, new ticket created. I'd also like to suggest changing the unfortunately misleading title of this ticket if possible, since it matches my own issue. |
ckujau: For the record, this is still an issue with 17.01.2 and Dnsmasq version 2.77:
kernel: [ 2860.890789]
kernel: [ 2860.890789] do_page_fault(): sending SIGSEGV to dnsmasq for invalid write access to 00552000
kernel: [ 2860.899402] epc = 77cd488c in libc.so[77c62000+92000]
kernel: [ 2860.904552] ra = 00406c41 in dnsmasq[400000+21000]
kernel: [ 2860.909537]
I came across this one while playing around with //dnseval// from the [[https://github.com/farrokhi/dnsdiag|dnsdiag]] package. Simply calling //dnseval foo// was enough to make //dnsmasq// crash :-| But, as this crashes the lastest git checkout from //dnsmasq// too, I shall report this upstream, of course. |
NeoRaider: ckujau: please open a new ticket for dnsmasq, I believe your issue hasn't been reported yet. As mentioned in an earlier comment, this ticket doesn't have to do anything with dnsmasq at all; it is about a segfault in busybox that just happened to occur while running a shell script called "dnsmasq", leading to a somewhat confusing error message. |
ckujau: I think this has been reported in |
marcin1j: Christian I reported the issue you mentioned as FS#994. What's your target and device the problem occurs on? |
ckujau: I was able to reproduce this on x86 too and [[http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2017q3/011704.html|bisected]] it to upstream commit 0xfa78573778, so it was not LEDE or architecture specific and I should've have reported this upstream from the start. But yes, the fix mentioned in FS#994 is the same [[http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2017q3/011714.html|posted]] to the //dnsmasq-discuss// list. I haven't had a chance to verify it yet (the previous band-aid patch worked), will report back. For completeness' sake: my target is ar71xx (a TP-Link AC1750 Wifi router). Thanks. |
koa:
it's a message that shows up frequently on a fresh install of lede's 4.4.27 on wzr-hp-g300nh -> Linux robokoa 4.4.27 #0 Wed Oct 26 10:37:47 2016 mips GNU/Linux -- steps to reproduce are unknown, i'm unsure of the initial reason for this error
[28082.882471] do_page_fault(): sending SIGSEGV to dnsmasq for invalid read access from 00000000
[28082.891127] epc = 00439ff1 in busybox[400000+4a000]
[28082.896083] ra = 00439fe5 in busybox[400000+4a000]
[28082.901018]
The text was updated successfully, but these errors were encountered: