Troubleshooting ¶

Tools ¶

Monitoring ¶

We have ping tests set up to notify about availability of each environment. Alerts go to #benefits-notify.

Logs ¶

Azure App Service Logs ¶

Open the Logs for the environment you are interested in. The following tables are likely of interest:

AppServiceConsoleLogs: stdout and stderr coming from the container
AppServiceHTTPLogs: requests coming through App Service
AppServicePlatformLogs: deployment information

For some pre-defined queries, click Queries, then Group by: Query type, and look under Query pack queries.

Live tail ¶

After setting up the Azure CLI, you can use the following command to stream live logs:

az webapp log tail --resource-group RG-CDT-PUB-VIP-CALITP-P-001 --name AS-CDT-PUB-VIP-CALITP-P-001 2>&1 | grep -v /healthcheck

SCM ¶

https://as-cdt-pub-vip-calitp-p-001-dev.scm.azurewebsites.net/api/logs/docker

Sentry ¶

Cal-ITP’s Sentry instance collects both errors (“Issues”) and app performance info.

Alerts are sent to #benefits-notify in Slack. Others can be configured.

You can troubleshoot Sentry itself by turning on debug mode and visiting /error/.

Specific issues ¶

This section serves as the runbook for Benefits.

Terraform lock ¶

General info

If Terraform commands fail (locally or in the Pipeline) due to an Error acquiring the state lock:

Check the Lock Info for the Created timestamp. If it’s in the past ten minutes or so, that probably means Terraform is still running elsewhere, and you should wait (stop here).
Are any Pipeline runs stuck? If so, cancel that build, and try re-running the Terraform command.
Do any engineers have a Terrafrom command running locally? You’ll need to ask them. For example: They may have started an apply and it’s sitting waiting for them to approve it. They will need to (gracefully) exit for the lock to be released.
If none of the steps above identified the source of the lock, and especially if the Created time is more than ten minutes ago, that probably means the last Terraform command didn’t release the lock. You’ll need to grab the ID from the Lock Info output and force unlock.

App fails to start ¶

If the container fails to start, you should see a downtime alert. Assuming this app version was working in another environment, the issue is likely due to misconfiguration. Some things you can do:

Check the logs
Ensure the environment variables and configuration data are set properly.
Turn on debugging
Force-push/revert the environment branch back to the old version to roll back

Littlepay API issue ¶

Littlepay API issues may show up as:

The monitor failing
The Connect your card button doesn’t work

A common problem that causes Littlepay API failures is that the certificate expired. To resolve:

Reach out to support@littlepay.com
Receive a new certificate
Put that certificate into the configuration data and/or the GitHub Actions secrets

Eligibility Server ¶

If the Benefits application gets a 403 error when trying to make API calls to the Eligibility Server, it may be because the outbound IP addresses changed, and the Eligibility Server firewall is still restricting access to the old IP ranges.

Grab the outbound_ip_ranges output values from the most recent Benefit deployment to the relevant environment.
Update the IP ranges
1. Go to the Eligibility Server Pipeline
2. Click Edit
3. Click Variables
4. Update the relevant variable with the new list of CIDRs

Note there is nightly downtime as the Eligibility Server restarts and loads new data.