------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ <img src={require('./img/blogmysql1.png').default} alt="MySQL production outage alert received by DevOps team" width="700" height="450"/> <br/> # We Got a Notification for Database Downtime — A DevOps Lesson Learned About How to Handle An Outage Of MySQL Like most unexpected production incidents, it began with a notification that left us feeling nauseous. > **“Database not responding.”** Not a warning. No slowdown. A quick notification that our backend was no longer able to connect to MySQL. We initially believed it to be a straightforward networking problem. A firewall rule might have changed. Perhaps there was a glitch in AWS networking. However, our findings proved to be a valuable lesson in connection management, monitoring, and production readiness. Here’s the full story. --- ## The First Signs of Trouble <img src={require('./img/mysql22.png').default} alt="MySQL too many connections error" width="700" height="450"/> All of a sudden, our apps began to fail database requests. Errors were being returned by APIs. Deployments were hanging. Data could not be retrieved by internal tools. Nothing worked when we attempted to manually connect to MySQL from outside the server. Even worse, we attempted the following after SSHing into the server: ```bash sudo mysql ``` We saw: ``` ERROR 1040 (HY000): Too many connections ``` <img src={require('./img/blogmysql3.png').default} alt="MySQL error 1040 too many connections" width="700" height="450"/> We came to the conclusion that this was not a network issue at that point. This was a **database connection overload**. If you’re building or running containerized applications in production, you might also find insights in our article on [deploying containerized apps](https://nife.io/solutions/deploy_containarized_apps), which covers best practices for resilient deployment architectures. --- ## Investigating the Root Cause MySQL was the initial area of focus when we discovered this problem, but it still wouldn’t permit connections. We confirmed: - MySQL service was active - Port was listening correctly - Security groups were open - No firewall blocks But the database still refused connections. Using different systems to view the status of backend services yielded hundreds of TCP connections to backend services that were open. MySQL reached its maximum connection capacity and there were no new requests being accepted — including a valid admin access login. This is when we identified the root cause of the problem. Our backend services were creating an excessive amount of connections into MySQL. If you’re unfamiliar with how this command works or how to interpret its output, the official [MySQL documentation explains it in detail](https://dev.mysql.com/doc/refman/8.0/en/show-processlist.html) --- ## The Immediate Fix We were unable to login to MySQL for manual termination of existing connections, so the job was to restart the MySQL service. ```bash sudo systemctl restart mysql ``` When the service was restarted, all existing connections to MySQL were terminated and new connections were allowed. This was simply a way to temporarily resolve the issue because unless we identified the root cause of the problem — this event will continue to occur in the future. Within seconds: - API services recovered - Database logins worked - Production stabilized But we knew this was only a temporary fix. If we did not address the underlying cause, the same incident would happen again. As we dug into this issue, having a solid application lifecycle management process made debugging and fixing much more effective.We discuss approaches to this in our guide on [application lifecycle management](https://nife.io/solutions/application_lifecycle_management). --- ## What We Found After Recovery Once back inside MySQL, we checked: ```sql SHOW PROCESSLIST; SHOW VARIABLES LIKE 'max_connections'; ``` The default `max_connections` was only 151. Normally, the number of connections to MySQL would be manageable under normal operating conditions, however, because of the large volumes of traffic and inefficient connection management on our backend, the number of connections was achieved much faster. Most of the connections were in `Sleep` state which means that they had not been used and still occupied a connection slot. This behavior is documented clearly in MySQL’s official reference [For the `max_connections` system variable](https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_max_connections) --- ## 🔧The Real Fix: Connection Management The following improvements were implemented: ### 1️⃣ Increased Connection Limit Increased MySQL capacity to accommodate expected connection bursts. ```ini max_connections = 300 ``` ### 2️⃣ Reduced Idle Timeouts Configured MySQL to accelerate the closure of dangling connections. ```ini wait_timeout = 60 interactive_timeout = 60 ``` ### 3️⃣ Fixed Backend Connection Pooling Instead of opening new DB connections for every request, we introduced proper pooling in our backend services. Example (Go backend): ```go db.SetMaxOpenConns(25) db.SetMaxIdleConns(10) db.SetConnMaxLifetime(5 * time.Minute) ``` Connection spikes were severely reduced after this fix. Go’s `database/sql` package already includes built-in connection pooling, but it must be explicitly configured to work correctly in production. The official Go documentation explains how connection limits, idle connections, and lifetimes are managed [Go database/sql connection pooling documentation](https://pkg.go.dev/database/sql) to understand how to properly configure connection pools and avoid overwhelming MySQL in production. --- ## 🧠Lessons We Learned <img src={require('./img/blogmysql5.png').default} alt="MySQL connection pooling solution" width="700" height="450"/> Incidents such as these are stressful. However, they also provide a lot of insight into system design. Here are some major takeaways: - Application behavior usually is the cause of database outages rather than infrastructure failure. - Connection pooling should always be used with production systems. - It is crucial to monitor active DB connections. - Most of the default MySQL settings are not sufficient for production use. - Idle connections can be equally harmful compared to active queries. The most valuable learning from this experience was the importance of calm debugging and thorough investigation rather than jumping to conclusions and making a lot of assumptions. --- ## Final Thoughts Getting notified that your database is down is never good news; However, each incident provides you with an opportunity for improving the resilience and architecture of your system as you go forward. After fixing the connection handling and modifying the MySQL configuration, we were able to deliver a much more stable system, and we now include monitoring of DB connection count in our first-class production metrics. If your team operates DB-related backend services, you should not wait until experiencing a crash to validate that your connection handling is accurate. You will thank your future self for it. When your team spans multiple groups or the organizations, it’s important to have shared practices for production reliability.We cover this more in our post on [managing multiple organizations](https://nife.io/solutions/managing_multiple_organizations).