Database Monitoring & Debugging

Purpose: Monitor PostgreSQL health, analyze logs, debug queries, troubleshoot issues
Last Updated: 2025-11-26

Overview

Comprehensive observability for PostgreSQL using Grafana, Loki, and Promtail. Enables real-time error detection, performance analysis, and debugging.

Architecture

PostgreSQL → Log Files → Promtail → Loki → Grafana
             /var/log/    (collect) (store) (visualize)
             postgresql/

Components:

PostgreSQL: Generates detailed logs
Promtail: Collects and ships logs
Loki: Indexes and stores logs
Grafana: Visualizes logs and metrics

Quick Start

Start Monitoring Stack

# Start all monitoring services
docker compose up -d grafana loki promtail

# Access Grafana
open http://localhost:3101

# Auto-login enabled (no credentials needed)

View Logs

In Grafana:

Navigate to Explore
Select Loki datasource
Query: {job="postgresql"}

Common queries:

# All PostgreSQL logs
{job="postgresql"}

# Errors only
{job="postgresql", level="ERROR"}

# Specific user queries
{job="postgresql", user="savvi"}

# Slow queries (>1s)
{job="postgresql"} |= "duration:" |~ "duration: [1-9][0-9]{3,}"

PostgreSQL Log Configuration

Log Settings

Configured in docker-compose.yml:

postgres:
  command:
    - "-c" "logging_collector=on"
    - "-c" "log_directory=/var/log/postgresql"
    - "-c" "log_statement=mod"                # Log DDL/modifications
    - "-c" "log_min_duration_statement=1000"  # Log slow queries (>1s)
    - "-c" "log_error_verbosity=verbose"

Log Format

timestamp [pid] user@database [tx_id] function: LEVEL: code: message

Example:

2025-01-07 10:30:45.123 EST [1234] savvi@savvi_studio [5678] 
auth.login: ERROR: 23505: duplicate key value violates unique constraint

Log Labels

Promtail extracts these labels for filtering:

Label	Description	Example
`level`	Log level	ERROR, WARNING, INFO
`user`	Database user	savvi, postgres
`database`	Database name	savvi_studio
`function`	PostgreSQL function	auth.login
`error_code`	PostgreSQL error code	23505
`pid`	Process ID	1234
`transaction_id`	Transaction ID	5678

Error Analysis

Common Error Patterns

Unique Constraint Violations (23505)

{job="postgresql", error_code="23505"}

Example log:

ERROR: 23505: duplicate key value violates unique constraint "users_email_key"
DETAIL: Key (email)=(user@example.com) already exists.

Analysis:

Concurrent insert attempts
Application needs retry logic
Consider UPSERT patterns (INSERT ... ON CONFLICT)

Foreign Key Violations (23503)

{job="postgresql", error_code="23503"}

Indicates:

Reference to non-existent record
Data integrity issue
Check cascade rules

Deadlocks (40P01)

{job="postgresql", error_code="40P01"}

Analysis:

Transaction ordering issue
Review locking strategy
Use SELECT FOR UPDATE with consistent order

Connection Limits

{job="postgresql"} |= "too many connections"

Actions:

Increase max_connections
Check for connection leaks
Review pool configuration

Error Trends

Daily error summary:

sum by (error_code) (
  count_over_time({job="postgresql", level="ERROR"}[24h])
)

Error rate spike detection:

rate({job="postgresql", level="ERROR"}[5m]) > 0.1

Top 10 errors:

topk(10, sum by (error_code) (
  count_over_time({job="postgresql", level="ERROR"}[1h])
))

Performance Monitoring

Slow Query Detection

Find slow queries:

{job="postgresql"} |= "duration:"

Filter by duration (>5s):

{job="postgresql"} |= "duration:" |~ "duration: [5-9][0-9]{3,}"

Example log:

LOG: duration: 2534.123 ms  statement: 
SELECT * FROM nodes WHERE data @> '{"tag": "expensive"}'

Query Analysis Process

Identify slow query in logs
Explain with EXPLAIN ANALYZE
Check indexes on filter columns
Optimize query or add indexes
Verify improvement in logs

Connection Monitoring

Active connections by state:

SELECT count(*), state, application_name
FROM pg_stat_activity
WHERE datname = 'savvi_studio'
GROUP BY state, application_name
ORDER BY count DESC;

Long-running queries:

SELECT 
  pid,
  usename,
  application_name,
  state,
  NOW() - query_start AS duration,
  LEFT(query, 100) as query
FROM pg_stat_activity
WHERE datname = 'savvi_studio'
  AND state = 'active'
  AND NOW() - query_start > INTERVAL '10 seconds'
ORDER BY duration DESC;

Table Statistics

Table sizes and health:

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
  n_live_tup as live_rows,
  n_dead_tup as dead_rows,
  round(n_dead_tup::float / NULLIF(n_live_tup, 0) * 100, 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Unused indexes:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan as scans,
  pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(indexrelid) DESC;

Query Optimization

EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM graph.nodes
WHERE node_type = 'document'
  AND data->>'status' = 'active'
ORDER BY created_at DESC
LIMIT 100;

Key metrics:

cost: Estimated query cost
actual time: Real execution time
rows: Rows returned vs estimated
Buffers hit: Data from cache
Buffers read: Data from disk

Optimization checklist:

Update Statistics

-- Analyze specific table
ANALYZE graph.nodes;

-- Analyze all tables
ANALYZE;

-- Vacuum and analyze
VACUUM ANALYZE;

Grafana Dashboards

PostgreSQL Error Logs Dashboard

Pre-configured dashboard shows:

Error count over time
Error distribution by level
Recent error messages
Error details with context
Function/location tracking

Access: Grafana → Dashboards → PostgreSQL Error Logs

Creating Custom Dashboards

Add panel → Logs or Time series
Set datasource → Loki
Write query using LogQL
Configure visualization
Save dashboard

Example panel queries:

# Error rate by level
sum by (level) (rate({job="postgresql"}[5m]))

# Connection count estimate
count_over_time({job="postgresql"} |= "connection received"[1m])

# Query duration histogram
histogram_quantile(0.99,
  sum(rate({job="postgresql"} |= "duration:" | 
    regexp `duration: (?P<duration>\\d+)` [5m])) by (le)
)

Debug Utilities

Built-in Functions

Located in db/998_debug_utils.sql:

-- Connection state
SELECT * FROM util.get_connection_state();

-- Session info
SELECT * FROM util.get_session_info();

-- Table sizes
SELECT * FROM util.get_table_sizes('graph');

-- Unused indexes
SELECT * FROM util.get_unused_indexes();

Enable Verbose Logging

-- For current session
SET log_statement = 'all';
SET log_min_duration_statement = 0;
SET client_min_messages = debug;

-- Your queries here

-- Reset
RESET log_statement;
RESET log_min_duration_statement;
RESET client_min_messages;

Transaction Debugging

BEGIN;
SET LOCAL log_error_verbosity = verbose;
SET LOCAL client_min_messages = debug;

-- Your queries
INSERT INTO nodes ...;

COMMIT;

Common Issues

High Error Rate

Symptoms: Grafana shows error spike

Debug:

Check error dashboard
Identify error pattern
View error details in logs
Analyze root cause
Fix application or schema

Slow Queries

Symptoms: High latency, timeouts

Debug:

Find slow queries in logs
Run EXPLAIN ANALYZE
Check for missing indexes
Add appropriate indexes
Verify improvement

Connection Pool Exhaustion

Symptoms: "Connection timeout" errors

Debug:

Check active connections
Find long-running queries
Terminate problem queries if needed
Fix application connection handling
Increase pool size if necessary

Disk Space Issues

Symptoms: "No space left on device"

Debug:

Check database size
Find large tables
Check for bloat (dead tuples)
Run VACUUM FULL ANALYZE
Consider archiving old data

Best Practices

Regular Monitoring

Check Grafana daily for error trends
Review slow queries weekly
Analyze table growth monthly
Update statistics after bulk operations

Proactive Maintenance

-- Weekly maintenance
DO $$
BEGIN
  ANALYZE;           -- Update statistics
  VACUUM;            -- Clean dead tuples
  
  -- Reindex if needed
  REINDEX DATABASE savvi_studio;
END $$;

Alert Configuration

Set up alerts in Grafana for:

Error rate > threshold
Slow query count increase
Connection pool > 80% capacity
Disk usage > 80%
Replication lag (if applicable)

Log Retention

Configured in Loki:

limits_config:
  retention_period: 744h  # 31 days

Adjust based on:

Compliance requirements
Storage capacity
Query patterns

Query Optimization Guidelines

Use indexes for common filters
Limit result sets with LIMIT
Use cursor pagination for large sets
Batch operations where possible
Cache frequently accessed data
Avoid N+1 queries

Local Development - Start monitoring stack
Best Practices - Query optimization
Testing - Test performance

Quick Reference

Task	Query/Command
All errors	`{job="postgresql", level="ERROR"}`
Slow queries	`{job="postgresql"} \|= "duration:"`
By user	`{job="postgresql", user="savvi"}`
By function	`{job="postgresql", function="auth.login"}`
Error code	`{job="postgresql", error_code="23505"}`
Active connections	See Connection Monitoring SQL
Table sizes	See Table Statistics SQL

For setup: See Local Development to start monitoring services
For queries: Use Grafana Explore with LogQL queries above
For debugging: Use built-in util functions from 998_debug_utils.sql

Database Monitoring & Debugging

Overview

Architecture

Quick Start

Start Monitoring Stack

View Logs

PostgreSQL Log Configuration

Log Settings

Log Format

Log Labels

Error Analysis

Common Error Patterns

Unique Constraint Violations (23505)

Foreign Key Violations (23503)

Deadlocks (40P01)

Connection Limits

Error Trends

Performance Monitoring

Slow Query Detection

Query Analysis Process

Connection Monitoring

Table Statistics

Query Optimization

EXPLAIN ANALYZE

Update Statistics

Grafana Dashboards

PostgreSQL Error Logs Dashboard

Creating Custom Dashboards

Debug Utilities

Built-in Functions

Enable Verbose Logging

Transaction Debugging

Common Issues

High Error Rate

Slow Queries

Connection Pool Exhaustion

Disk Space Issues

Best Practices

Regular Monitoring

Proactive Maintenance

Alert Configuration

Log Retention

Query Optimization Guidelines

Related Documentation

Quick Reference