net/dns: recheck DNS config on SERVFAIL errors (#12547)
Fixes tailscale/corp#20677 Replaces the original attempt to rectify this (by injecting a netMon event) which was both heavy handed, and missed cases where the netMon event was "minor". On apple platforms, the fetching the interface's nameservers can and does return an empty list in certain situations. Apple's API in particular is very limiting here. The header hints at notifications for dns changes which would let us react ahead of time, but it's all private APIs. To avoid remaining in the state where we end up with no nameservers but we absolutely need them, we'll react to a lack of upstream nameservers by attempting to re-query the OS. We'll rate limit this to space out the attempts. It seems relatively harmless to attempt a reconfig every 5 seconds (triggered by an incoming query) if the network is in this broken state. Missing nameservers might possibly be a persistent condition (vs a transient error), but that would also imply that something out of our control is badly misconfigured. Tested by randomly returning [] for the nameservers. When switching between Wifi networks, or cell->wifi, this will randomly trigger the bug, and we appear to reliably heal the DNS state. Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
This commit is contained in:
@ -14,7 +14,6 @@ import (
|
||||
"net/http"
|
||||
"net/netip"
|
||||
"net/url"
|
||||
"runtime"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
@ -212,6 +211,10 @@ type forwarder struct {
|
||||
// /etc/resolv.conf is missing/corrupt, and the peerapi ExitDNS stub
|
||||
// resolver lookup.
|
||||
cloudHostFallback []resolverAndDelay
|
||||
|
||||
// To be called when a SERVFAIL is returned due to missing upstream resolvers.
|
||||
// This should attempt to properly (re)set the upstream resolvers.
|
||||
missingUpstreamRecovery func()
|
||||
}
|
||||
|
||||
func newForwarder(logf logger.Logf, netMon *netmon.Monitor, linkSel ForwardLinkSelector, dialer *tsdial.Dialer, knobs *controlknobs.Knobs) *forwarder {
|
||||
@ -883,22 +886,10 @@ func (f *forwarder) forwardWithDestChan(ctx context.Context, query packet, respo
|
||||
metricDNSFwdErrorNoUpstream.Add(1)
|
||||
f.logf("no upstream resolvers set, returning SERVFAIL")
|
||||
|
||||
if runtime.GOOS == "darwin" || runtime.GOOS == "ios" {
|
||||
// On apple, having no upstream resolvers here is the result a race condition where
|
||||
// we've tried a reconfig after a major link change but the system has not yet set
|
||||
// the resolvers for the new link. We use SystemConfiguration to query nameservers, and
|
||||
// the timing of when that will give us the "right" answer is non-deterministic.
|
||||
//
|
||||
// This will typically happen on sleep-wake cycles with a Wifi interface where
|
||||
// it takes some random amount of time (after telling us that the interface exists)
|
||||
// for the system to configure the dns servers.
|
||||
//
|
||||
// Repolling the network monitor here is a bit odd, but if we're
|
||||
// seeing DNS queries, it's likely that the network is now fully configured, and it's
|
||||
// an ideal time to to requery for the nameservers.
|
||||
f.logf("injecting network monitor event to attempt to refresh the resolvers")
|
||||
f.netMon.InjectEvent()
|
||||
}
|
||||
// Attempt to recompile the DNS configuration
|
||||
// If we are being asked to forward queries and we have no
|
||||
// nameservers, the network is in a bad state.
|
||||
f.missingUpstreamRecovery()
|
||||
|
||||
res, err := servfailResponse(query)
|
||||
if err != nil {
|
||||
|
Reference in New Issue
Block a user