Database Monitoring & Debugging
Purpose: Monitor PostgreSQL health, analyze logs, debug queries, troubleshoot issues
Last Updated: 2025-11-26
Overview
Comprehensive observability for PostgreSQL using Grafana, Loki, and Promtail. Enables real-time error detection, performance analysis, and debugging.
Architecture
PostgreSQL → Log Files → Promtail → Loki → Grafana
/var/log/ (collect) (store) (visualize)
postgresql/
Components:
- PostgreSQL: Generates detailed logs
- Promtail: Collects and ships logs
- Loki: Indexes and stores logs
- Grafana: Visualizes logs and metrics
Quick Start
Start Monitoring Stack
# Start all monitoring services
docker compose up -d grafana loki promtail
# Access Grafana
open http://localhost:3101
# Auto-login enabled (no credentials needed)
View Logs
In Grafana:
- Navigate to Explore
- Select Loki datasource
- Query:
{job="postgresql"}
Common queries:
# All PostgreSQL logs
{job="postgresql"}
# Errors only
{job="postgresql", level="ERROR"}
# Specific user queries
{job="postgresql", user="savvi"}
# Slow queries (>1s)
{job="postgresql"} |= "duration:" |~ "duration: [1-9][0-9]{3,}"
PostgreSQL Log Configuration
Log Settings
Configured in docker-compose.yml:
postgres:
command:
- "-c" "logging_collector=on"
- "-c" "log_directory=/var/log/postgresql"
- "-c" "log_statement=mod" # Log DDL/modifications
- "-c" "log_min_duration_statement=1000" # Log slow queries (>1s)
- "-c" "log_error_verbosity=verbose"
Log Format
timestamp [pid] user@database [tx_id] function: LEVEL: code: message
Example:
2025-01-07 10:30:45.123 EST [1234] savvi@savvi_studio [5678]
auth.login: ERROR: 23505: duplicate key value violates unique constraint
Log Labels
Promtail extracts these labels for filtering:
| Label | Description | Example |
|---|---|---|
level |
Log level | ERROR, WARNING, INFO |
user |
Database user | savvi, postgres |
database |
Database name | savvi_studio |
function |
PostgreSQL function | auth.login |
error_code |
PostgreSQL error code | 23505 |
pid |
Process ID | 1234 |
transaction_id |
Transaction ID | 5678 |
Error Analysis
Common Error Patterns
Unique Constraint Violations (23505)
{job="postgresql", error_code="23505"}
Example log:
ERROR: 23505: duplicate key value violates unique constraint "users_email_key"
DETAIL: Key (email)=(user@example.com) already exists.
Analysis:
- Concurrent insert attempts
- Application needs retry logic
- Consider UPSERT patterns (INSERT ... ON CONFLICT)
Foreign Key Violations (23503)
{job="postgresql", error_code="23503"}
Indicates:
- Reference to non-existent record
- Data integrity issue
- Check cascade rules
Deadlocks (40P01)
{job="postgresql", error_code="40P01"}
Analysis:
- Transaction ordering issue
- Review locking strategy
- Use SELECT FOR UPDATE with consistent order
Connection Limits
{job="postgresql"} |= "too many connections"
Actions:
- Increase
max_connections - Check for connection leaks
- Review pool configuration
Error Trends
Daily error summary:
sum by (error_code) (
count_over_time({job="postgresql", level="ERROR"}[24h])
)
Error rate spike detection:
rate({job="postgresql", level="ERROR"}[5m]) > 0.1
Top 10 errors:
topk(10, sum by (error_code) (
count_over_time({job="postgresql", level="ERROR"}[1h])
))
Performance Monitoring
Slow Query Detection
Find slow queries:
{job="postgresql"} |= "duration:"
Filter by duration (>5s):
{job="postgresql"} |= "duration:" |~ "duration: [5-9][0-9]{3,}"
Example log:
LOG: duration: 2534.123 ms statement:
SELECT * FROM nodes WHERE data @> '{"tag": "expensive"}'
Query Analysis Process
- Identify slow query in logs
- Explain with EXPLAIN ANALYZE
- Check indexes on filter columns
- Optimize query or add indexes
- Verify improvement in logs
Connection Monitoring
Active connections by state:
SELECT count(*), state, application_name
FROM pg_stat_activity
WHERE datname = 'savvi_studio'
GROUP BY state, application_name
ORDER BY count DESC;
Long-running queries:
SELECT
pid,
usename,
application_name,
state,
NOW() - query_start AS duration,
LEFT(query, 100) as query
FROM pg_stat_activity
WHERE datname = 'savvi_studio'
AND state = 'active'
AND NOW() - query_start > INTERVAL '10 seconds'
ORDER BY duration DESC;
Table Statistics
Table sizes and health:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
n_live_tup as live_rows,
n_dead_tup as dead_rows,
round(n_dead_tup::float / NULLIF(n_live_tup, 0) * 100, 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
Unused indexes:
SELECT
schemaname,
tablename,
indexname,
idx_scan as scans,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(indexrelid) DESC;
Query Optimization
EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM graph.nodes
WHERE node_type = 'document'
AND data->>'status' = 'active'
ORDER BY created_at DESC
LIMIT 100;
Key metrics:
- cost: Estimated query cost
- actual time: Real execution time
- rows: Rows returned vs estimated
- Buffers hit: Data from cache
- Buffers read: Data from disk
Optimization checklist:
- Appropriate indexes exist
- Index is being used (not Seq Scan)
- Filter selectivity is good
- Join order is optimal
- No unnecessary sorts
- LIMIT applied early
- Statistics are up to date
Update Statistics
-- Analyze specific table
ANALYZE graph.nodes;
-- Analyze all tables
ANALYZE;
-- Vacuum and analyze
VACUUM ANALYZE;
Grafana Dashboards
PostgreSQL Error Logs Dashboard
Pre-configured dashboard shows:
- Error count over time
- Error distribution by level
- Recent error messages
- Error details with context
- Function/location tracking
Access: Grafana → Dashboards → PostgreSQL Error Logs
Creating Custom Dashboards
- Add panel → Logs or Time series
- Set datasource → Loki
- Write query using LogQL
- Configure visualization
- Save dashboard
Example panel queries:
# Error rate by level
sum by (level) (rate({job="postgresql"}[5m]))
# Connection count estimate
count_over_time({job="postgresql"} |= "connection received"[1m])
# Query duration histogram
histogram_quantile(0.99,
sum(rate({job="postgresql"} |= "duration:" |
regexp `duration: (?P<duration>\\d+)` [5m])) by (le)
)
Debug Utilities
Built-in Functions
Located in db/998_debug_utils.sql:
-- Connection state
SELECT * FROM util.get_connection_state();
-- Session info
SELECT * FROM util.get_session_info();
-- Table sizes
SELECT * FROM util.get_table_sizes('graph');
-- Unused indexes
SELECT * FROM util.get_unused_indexes();
Enable Verbose Logging
-- For current session
SET log_statement = 'all';
SET log_min_duration_statement = 0;
SET client_min_messages = debug;
-- Your queries here
-- Reset
RESET log_statement;
RESET log_min_duration_statement;
RESET client_min_messages;
Transaction Debugging
BEGIN;
SET LOCAL log_error_verbosity = verbose;
SET LOCAL client_min_messages = debug;
-- Your queries
INSERT INTO nodes ...;
COMMIT;
Common Issues
High Error Rate
Symptoms: Grafana shows error spike
Debug:
- Check error dashboard
- Identify error pattern
- View error details in logs
- Analyze root cause
- Fix application or schema
Slow Queries
Symptoms: High latency, timeouts
Debug:
- Find slow queries in logs
- Run EXPLAIN ANALYZE
- Check for missing indexes
- Add appropriate indexes
- Verify improvement
Connection Pool Exhaustion
Symptoms: "Connection timeout" errors
Debug:
- Check active connections
- Find long-running queries
- Terminate problem queries if needed
- Fix application connection handling
- Increase pool size if necessary
Disk Space Issues
Symptoms: "No space left on device"
Debug:
- Check database size
- Find large tables
- Check for bloat (dead tuples)
- Run VACUUM FULL ANALYZE
- Consider archiving old data
Best Practices
Regular Monitoring
- Check Grafana daily for error trends
- Review slow queries weekly
- Analyze table growth monthly
- Update statistics after bulk operations
Proactive Maintenance
-- Weekly maintenance
DO $$
BEGIN
ANALYZE; -- Update statistics
VACUUM; -- Clean dead tuples
-- Reindex if needed
REINDEX DATABASE savvi_studio;
END $$;
Alert Configuration
Set up alerts in Grafana for:
- Error rate > threshold
- Slow query count increase
- Connection pool > 80% capacity
- Disk usage > 80%
- Replication lag (if applicable)
Log Retention
Configured in Loki:
limits_config:
retention_period: 744h # 31 days
Adjust based on:
- Compliance requirements
- Storage capacity
- Query patterns
Query Optimization Guidelines
- Use indexes for common filters
- Limit result sets with LIMIT
- Use cursor pagination for large sets
- Batch operations where possible
- Cache frequently accessed data
- Avoid N+1 queries
Related Documentation
- Local Development - Start monitoring stack
- Best Practices - Query optimization
- Testing - Test performance
Quick Reference
| Task | Query/Command |
|---|---|
| All errors | {job="postgresql", level="ERROR"} |
| Slow queries | {job="postgresql"} |= "duration:" |
| By user | {job="postgresql", user="savvi"} |
| By function | {job="postgresql", function="auth.login"} |
| Error code | {job="postgresql", error_code="23505"} |
| Active connections | See Connection Monitoring SQL |
| Table sizes | See Table Statistics SQL |
For setup: See Local Development to start monitoring services
For queries: Use Grafana Explore with LogQL queries above
For debugging: Use built-in util functions from 998_debug_utils.sql