When running production workloads that leverage enterprise services, like DNS, we expect these services to be reliable, available, and responsive. It is assumed that when a client requests an enterprise service, it is logged and takes the most efficient and lowest latency route. In addition, a proper response is expected, successful or fail. What about when that enterprise service cannot handle the number of incoming requests and/or fails to send a proper response? In this blog, I walk through a DNS situation where Azure Databricks bombards DNS servers in attempts to authenticate with Azure AD to access data over an Azure Datalake Storage Gen2 datastore mount. I then provide a solution that, hopefully, allows for a more dynamic DNS implementation in your environment.
Note:This blog post compliments the below-linked article, which was used as a guide to create the init script and implement custom DNS routing.
Configure custom DNS settings using dnsmasq
https://docs.microsoft.com/en-us/azure/databricks/kb/cloud/custom-dns-routing
Note:The Failed Job Message in this blog post is directly related to using Azure Datalake Storage Gen2 (ADLS) as a persistent datastore mount per the configuration guidance detailed below.
Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal
https://docs.databricks.com/data/data-sources/azure/azure-storage.html#access-azure-data-lake-storage-gen2-or-blob-storage-using-oauth-20-with-an-azure-service-principal
Situation
An enterprise DNS implementation that funnels all requests to a pool of DNS servers for public, private, and Azure Private Link domain/hostname resolution. DNS settings are enforced through virtual network custom DNS settings (DHCP) and other configuration management tools and techniques. Azure Service Endpoint traffic is also routed to only authorized virtual network subnets. Due to strict security policies that enforce logging and traceability, your typical Azure Databricks dedicated subnet DNS traffic can only take advantage of certain Azure DNS optimization and performant routing features, with node-level customization, that is. DNS whitelisting is enabled on the domain “login.microsoftonline.com” to reduce the amount of logging data generated and increase DNS Server performance. This domain is used (heavily) for authentication with Azure Active Directory (AAD) when accessing Azure resources or any Relying Party Trusts using AAD as an Identity Provider.
Assumptions
We assume a managed Azure Databricks Workspace with virtual network integration (public & private subnets).
Azure Databricks cluster nodes (Ubuntu) receive their DNS configuration directly from the virtual network through DHCP. Only the first three (3) DNS servers in the resolv.conf are utilized, even though more than three (3) can be configured.
The Azure Databricks Workspace uses the ADLS Gen2 persistent mounted datastore with a Service Principal and OAuth 2.0 to process data and generates over ten (10) million DNS requests daily. Jobs fail sporadically with the Failed Job Message below. The failures are not reproducible on-demand since domain/hostname resolution succeeds more often than not using the enterprise DNS servers.
Failed Job Message
Job aborted.
Caused by: Job aborted due to stage failure.
Caused by: FileReadException: Error while reading file dbfs://.
Caused by: AbfsRestOperationException: HTTP Error -1; url=’https://login.microsoftonline.com//oauth2/token’ AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException : login.microsoftonline.com
Caused by: AzureADAuthenticator.HttpException: HTTP Error -1; url=’https://login.microsoftonline.com//oauth2/token’ AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException : login.microsoftonline.com
Troubleshooting
A few failed corrective action attempts.
- Utilize the host node VIP (168.63.129.16) as a primary, secondary, or tertiary DNS resolver configured in the Azure Virtual Network custom DNS settings.
a. Outcome: Azure Private Endpoint resolution failure. For this situation, all DNS requests are sent to a pool of DNS servers, as stated above. The Databricks Workspace integrated virtual network is not associated with any Azure Private DNS Zones, so when using this private access method Private Endpoint domain/hostname is not resolvable (properly routed) by 168.63.129.16. Instead, the public address is returned, and traffic is denied at the nearest firewall. - Trying to use the Configure custom DNS settings using dnsmasq Microsoft provided content as is (linked above).
a. Firewalls will need to allow access from the Azure Databricks subnet(s) to “archive.ubuntu.com”. This is required to install and update Ubuntu packages (dnsmasq), unless your organization has implemented other init scripts to adjust these default settings.
b. Modify the init script itself to:
Solution
Since DNS requests for “login.microsoftonline.com” are already being whitelisted (not logged), the preferred solution is to route this traffic directly, using dnsmasq via a cluster level init script, to the Azure host node virtual IP address (168.63.129.16). The VIP address is accessible by all of the Azure Databricks cluster nodes for certain services and not subject to network security group rules except by service tag, per the article linked below. With this DNS configuration, DNS requests/responses for “login.microsoftonline.com” no longer depend on enterprise DNS or subject to the routing and processing rules, delays, and latency introduced with custom/complex enterprise DNS implementations. The sporadic “java.net.UnknownHostException” for “login.microsoftonline.com” no longer exists on these Azure Databricks clusters.
What is IP address 168.63.129.16?
https://docs.microsoft.com/en-us/azure/virtual-network/what-is-ip-address-168-63-129-16