Troubleshooting ¶
Tools ¶
Monitoring ¶
We have ping tests set up to notify about availability of each environment. Alerts go to #benefits-notify.
Logs ¶
Azure App Service Logs ¶
Open the Logs
for the environment you are interested in. The following tables are likely of interest:
AppServiceConsoleLogs
:stdout
andstderr
coming from the containerAppServiceHTTPLogs
: requests coming through App ServiceAppServicePlatformLogs
: deployment information
For some pre-defined queries, click Queries
, then Group by: Query type
, and look under Query pack queries
.
Live tail ¶
After setting up the Azure CLI, you can use the following command to stream live logs:
az webapp log tail --resource-group RG-CDT-PUB-VIP-CALITP-P-001 --name AS-CDT-PUB-VIP-CALITP-P-001 2>&1 | grep -v /healthcheck
SCM ¶
https://as-cdt-pub-vip-calitp-p-001-dev.scm.azurewebsites.net/api/logs/docker
Sentry ¶
Cal-ITP’s Sentry instance collects both errors (“Issues”) and app performance info.
Alerts are sent to #benefits-notify in Slack. Others can be configured.
You can troubleshoot Sentry itself by turning on debug mode and visiting /error/
.
Specific issues ¶
This section serves as the runbook for Benefits.
Terraform lock ¶
If Terraform commands fail (locally or in the Pipeline) due to an Error acquiring the state lock
:
- Check the
Lock Info
for theCreated
timestamp. If it’s in the past ten minutes or so, that probably means Terraform is still running elsewhere, and you should wait (stop here). - Are any Pipeline runs stuck? If so, cancel that build, and try re-running the Terraform command.
- Do any engineers have a Terrafrom command running locally? You’ll need to ask them. For example: They may have started an
apply
and it’s sitting waiting for them to approve it. They will need to (gracefully) exit for the lock to be released. - If none of the steps above identified the source of the lock, and especially if the
Created
time is more than ten minutes ago, that probably means the last Terraform command didn’t release the lock. You’ll need to grab theID
from theLock Info
output and force unlock.
App fails to start ¶
If the container fails to start, you should see a downtime alert. Assuming this app version was working in another environment, the issue is likely due to misconfiguration. Some things you can do:
- Check the logs
- Ensure the environment variables and configuration data are set properly.
- Turn on debugging
- Force-push/revert the environment branch back to the old version to roll back
Littlepay API issue ¶
Littlepay API issues may show up as:
- The monitor failing
- The
Connect your card
button doesn’t work
A common problem that causes Littlepay API failures is that the certificate expired. To resolve:
- Reach out to support@littlepay.com
- Receive a new certificate
- Put that certificate into the configuration data and/or the GitHub Actions secrets
Eligibility Server ¶
If the Benefits application gets a 403 error when trying to make API calls to the Eligibility Server, it may be because the outbound IP addresses changed, and the Eligibility Server firewall is still restricting access to the old IP ranges.
- Grab the
outbound_ip_ranges
output
values from the most recent Benefit deployment to the relevant environment. - Update the IP ranges
- Go to the Eligibility Server Pipeline
- Click
Edit
- Click
Variables
- Update the relevant variable with the new list of CIDRs
Note there is nightly downtime as the Eligibility Server restarts and loads new data.