Scope of Incident
Altapay experienced reduced operability and high Latency of payments processing from 8:40 to 9:27
Analysis of incident:
* Load was not distributed evenly across databases
* One database reached 100% load
* Although this should have affected only a small amount of requests, a limited proxy configuration forced processors to wait for the transactions affected by the database to resolve, dropping new connections beyond the number of processors, increasing the impacted customer base
* The system was able to process, but it created 504 and high response times for some payments
Remedial Action
* A study of load was performed on the spot and increased performance in the database affected
* The processor configuration was amended
* Monitoring was in place, but was configured as the database cluster as a whole, and not per active instances, as part of the cluster. The configuration has now been changed, to be per node level instead of accumulated total cluster level usage.