Hello FusionAuth Support and Community,
I'm facing a critical issue with a dedicated FusionAuth instance and would greatly appreciate your expertise. Here's the situation:
Problem Description
Sporadic Unavailability & Downtime: Our FusionAuth dedicated instance becomes randomly unreachable specifically from within our Google Kubernetes Engine (GKE) cluster. This causes our authenticated portion of the site to be unavailable. This happens every once in a while. It happened twice this week two days in a row, and happened once before about a month ago.
Accessibility Contrast: Intriguingly, the instance remains accessible from our personal computers during these unavailability periods.
Timeout from Pods: When attempting a curl request from a pod within the GKE cluster, we consistently get a "connect ETIMEDOUT" error for the FusionAuth instance's API endpoint.
Resolves Itself: The issue mysteriously resolves itself within approximately 30 minutes.
Server Logs
The following server logs accompany the timeout:
preplan-api-7465b86756-dwgnw ClientResponse {
preplan-api-7465b86756-dwgnw exception: FetchError: request to https://[obfuscated-instance-url]/api/identity-provider/login failed, reason: connect ETIMEDOUT [obfuscated-ip]:443
... [Stack Trace]
}
Troubleshooting Steps (So Far)
Verified Instance Status: The dedicated instance shows no signs of being down when accessed outside the GKE cluster.
General Connectivity: Our pods have regular internet connectivity otherwise (able to curl google.com).
Whitelisting: We have whitelisted our NGINX Load Balancer IP address from our fusionauth instance settings.
Environment Details
FusionAuth Version: 1.47.1
GKE Setup: gke running a network pool of 4 nodes with our API replicated 10 times. No other issues with our cluster and site is otherwise available.
Request for Guidance
I would sincerely appreciate the community's help in figuring out:
Potential Root Causes: What could explain this temporary, selective unavailability of FusionAuth only from within our GKE cluster?
Network Configuration Issues: Are there specific firewall rules, routing, or DNS settings within GKE to examine?
Troubleshooting Techniques: Any recommended strategies to further diagnose this connectivity problem?
Thank you in advance for your insights and assistance!