When Your Server Is Struggling
A server pegged at 100% CPU makes everything suffer — slow response times, timeouts, dropped requests, and in severe cases, complete unresponsiveness. Before you reach for the "resize server" button, follow this diagnostic process. In many cases, high CPU usage is caused by a misbehaving process, a misconfiguration, or unexpected traffic — all of which can be fixed without spending more money.
Step 1: Confirm the Problem with top or htop
Start with a real-time view of CPU consumption:
top
# or for a better interface:
htop
Look at the %CPU column. Identify the process(es) consuming the most CPU. Note the PID (Process ID) and the process name. Press P in top to sort by CPU usage.
Also check the load average in the top-right corner. A load average consistently above your number of CPU cores indicates sustained overload.
Step 2: Identify the Culprit Process
Once you have a PID, get more detail:
ps aux --sort=-%cpu | head -20 # Top CPU-consuming processes
ps -p [PID] -o pid,ppid,cmd,%cpu,%mem # Details for specific PID
Common culprits include:
- PHP-FPM workers — usually caused by poorly optimized application code or too many concurrent requests
- MySQL/MariaDB — slow queries, missing indexes, or too many connections
- Web crawlers and bots — excessive scraping can spike CPU on dynamic sites
- Runaway cron jobs — a failed or looping scheduled task
- Malware or crypto miners — look for unknown processes with generic names
Step 3: Check for Runaway Processes
Look for zombie processes or processes stuck in D (uninterruptible sleep) state:
ps aux | awk '$8 ~ /^[DZ]/'
A process in D state is typically waiting on I/O — this can indicate a storage bottleneck rather than pure CPU load.
Step 4: Analyze Web Server Logs for Traffic Spikes
If the culprit is your web server, check access logs for unusual patterns:
# Nginx: count requests per IP in the last 1000 lines
tail -1000 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20
# Apache equivalent
tail -1000 /var/log/apache2/access.log | awk '{print $1}' | sort | uniq -c | sort -rn
If a single IP is hammering your server, block it at the firewall:
sudo ufw deny from 203.0.113.45 to any
Step 5: Investigate Database Queries
If MySQL is the problem, enable the slow query log to find expensive queries:
# In MySQL:
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; # Log queries taking over 1 second
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
Then analyze with mysqldumpslow /var/log/mysql/slow.log or pt-query-digest from Percona Toolkit.
Step 6: Check for Malware
Unexpected high CPU, especially from processes with names like kworker, sshd, or random alphanumeric strings, can indicate a compromised server running a crypto miner. Verify with:
ls -la /proc/[PID]/exe # Check what binary is actually running
netstat -tulnp | grep [PID] # Check network connections
A process making outbound connections to unknown IPs on unusual ports is a red flag. If compromise is suspected, isolate the server immediately and investigate thoroughly before bringing it back online.
Step 7: Tune PHP-FPM or Application Workers
If PHP-FPM is overloaded, review your pool configuration. Setting pm = dynamic with appropriate pm.max_children prevents runaway worker spawning:
pm = dynamic
pm.max_children = 20
pm.start_servers = 5
pm.min_spare_servers = 2
pm.max_spare_servers = 8
The right values depend on available RAM and application memory usage per worker.
When to Scale vs. When to Optimize
Scaling (adding CPU/RAM) is appropriate when your server is genuinely handling the expected load efficiently, but that load has grown beyond its capacity. Optimization is appropriate — and should always come first — when processes are inefficient, misconfigured, or misbehaving. Throwing hardware at a slow query problem just makes the slow query run more slowly on faster hardware.
Quick Reference Checklist
- Check top/htop to identify the process
- Review logs for traffic spikes or error floods
- Look for runaway or zombie processes
- Investigate slow database queries
- Check for malware or unauthorized processes
- Tune application worker settings
- Rate-limit or block abusive IPs
- Scale only after optimization is exhausted