New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FS#391 - dnsmasq stops working properly if the fastest upstream DNS server returns a server failure #5503
Comments
stintel: Something similar happens to me from time to time. I'm a Gentoo user, and one of the nameservers for the gentoo.org domain seems to be unreliable. When it is down, it's near impossible for me to resolve anything in said domain. Someone in OpenWrt also had this problem where he was unable to resolve most records of a domain when connected to hist OpenWrt router. The problem did not occur when he was directly connected. While searching for possible solutions, I came across the --all-servers option:
By default, when dnsmasq has more than one upstream server available, it will send queries to just one server. Setting this flag forces dnsmasq to send all queries to all available servers. The reply from the server which answers first will be returned to the original requester.
Can you test if enabling it helps? Can be enabled in /etc/config/dhcp:
config dnsmasq
option allservers '1'
...
|
IronicSven: Hi Stijn, the problem still occurs with the allservers parameter. My ISPs DNS servers are working fine right now but I created a Ubuntu VM in Virtualbox, installed a bind9 DNS server and assigned the manual IP address 192.168.3.2 without gateway address (which means it can't connect to the internet to resolve the request) to reproduce the DNS server failure.
sven@sven-VirtualBox:~$ nslookup facebook.com 192.168.3.2
Server: 192.168.3.2
Address: 192.168.3.2#53
dnsmasq is now using the google public DNS servers and my faulty DNS server for the allservers parameter test.
root@FlensNet:~# uci set dhcp.@dnsmasq[0].allservers=1
root@FlensNet:~# uci commit dhcp
root@FlensNet:~# /etc/init.d/dnsmasq restart
root@FlensNet:~# nslookup facebook.com
nslookup: can't resolve '(null)': Name does not resolve
Syslog shows that the requests are forwarded to all DNS servers as expected but the reply messages are still missing: DNS starts to work again after I remove 192.168.3.2 from the DNS servers list. |
bjonglez: Did it start happening after the update to dnsmasq 2.76, in May 2016? These commits look relevant: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=51967f9807665dae403f1497b827165c5fa1084b (introduced in dnsmasq 2.69) So, the first commit introduced the issue you see as a "feature" (but there was a bug in the implementation, so it didn't work), while the second commit made the first commit actually work starting from dnsmasq 2.76. I'm not sure what is the right behaviour, but it indeed sounds strange to treat SERVFAIL as a valid response. |
EricLuehrsen: Simon Kelley took note of this. It might be a necessary though annoying behavior for a stub resolver using DNSSEC. https://www.mail-archive.com/dnsmasq-discuss@lists.thekelleys.org.uk/msg10901.html |
IronicSven: @baptiste: I can't reproduce the issue with OpenWrt Chaos Calmer 15.05.1 r49389 and Dnsmasq version 2.73. My internet outages started with Lede and Dnsmasq 2.76. @eric: I don't use DNSSEC and thus treating SERVFAIL as a valid response sounds strange to me. I've spend some time with the attempt to add some logging messages and revert the changes mentioned above. I created a patchfile in package/network/services/dnsmasq/patches:
--- a/src/forward.c
+++ b/src/forward.c
@@ -821,9 +821,15 @@ void reply_query(int fd, int family, tim
}
Now it is working as I would expect it. If the fastest DNS server returns SERVFAIL the next DNS server that returns NOERROR will be used for a valid response.
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 query[AAAA] bugs.lede-project.org from 127.0.0.1
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 83.169.185.161
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 83.169.185.225
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 8.8.8.8
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 8.8.4.4
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 forwarded bugs.lede-project.org to 192.168.3.2
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: received SERVFAIL
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 24 127.0.0.1/54663 reply bugs.lede-project.org is 148.251.78.235
Tue Jan 17 19:25:09 2017 daemon.info dnsmasq[1593]: 25 127.0.0.1/54663 reply bugs.lede-project.org is 2a01:4f8:202:43ea::3
@devs: Please feel free to use the patch. |
dtaht: A patch much like this was folded into lede a day or three back. |
None: For clarity: A patch much like that and updating to dnsmasq 2.77test1 was pulled into jow's staging tree. It is not yet in master, much less backported to 17.01.* |
IronicSven: I just tested a selfbuilt with https://git.lede-project.org/?p=source.git;a=commit;h=3bef96ef18a6fb20401313dfa6e88057d56b16ad and can't reproduce this issue anymore. I would like to suggest to cherry pick this commit for Lede v17.01.0 because it will prevent internet outages for users with unreliable DNS servers like me. PS: Simon Kelley included the fix in dnsmasq-2.77test2: http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=68f6312d4bae30b78daafcd6f51dc441b8685b1e Thanks for your support. |
None: I already have a pull request in for 2.77test2 lede-project/source#794 |
IronicSven:
** - Device problem occurs on
**Reproduced on TP-Link 1043nd v1 and TP-Link Archer C7 v2.
** - Software versions of LEDE release, packages, etc.
**Reboot (SNAPSHOT, r2961-5b089e4)
Dnsmasq version 2.76
** - Steps to reproduce
**Fastest upstream DNS server returns a server failure.
My provider is having some difficulties with his DNS servers this week. I noticed that if the fastest DNS server returns a server failure dnsmasq stops working properly because it ignores the replys of the slower DNS servers.
In google chrome
ERR_NAME_RESOLUTION_FAILED
appears and nslookup returns** server can't find google.com: SERVFAIL
I don't use strict-order and it doesn't matter if the faulty upstream DNS server is the first or the last entry in the config as long as it returns the fastest reply.
I had to delete the upstream DNS server which returns the server failure from my config to get dnsmasq working again.
I was able to create a tcpdump and syslog while the DNS server 83.169.185.162 returned a server failure today.
The text was updated successfully, but these errors were encountered: