Savvi Studio

Database Monitoring & Debugging

Purpose: Monitor PostgreSQL health, analyze logs, debug queries, troubleshoot issues
Last Updated: 2025-11-26

Overview

Comprehensive observability for PostgreSQL using Grafana, Loki, and Promtail. Enables real-time error detection, performance analysis, and debugging.

Architecture

PostgreSQL → Log Files → Promtail → Loki → Grafana
             /var/log/    (collect) (store) (visualize)
             postgresql/

Components:

  • PostgreSQL: Generates detailed logs
  • Promtail: Collects and ships logs
  • Loki: Indexes and stores logs
  • Grafana: Visualizes logs and metrics

Quick Start

Start Monitoring Stack

# Start all monitoring services
docker compose up -d grafana loki promtail

# Access Grafana
open http://localhost:3101

# Auto-login enabled (no credentials needed)

View Logs

In Grafana:

  1. Navigate to Explore
  2. Select Loki datasource
  3. Query: {job="postgresql"}

Common queries:

# All PostgreSQL logs
{job="postgresql"}

# Errors only
{job="postgresql", level="ERROR"}

# Specific user queries
{job="postgresql", user="savvi"}

# Slow queries (>1s)
{job="postgresql"} |= "duration:" |~ "duration: [1-9][0-9]{3,}"

PostgreSQL Log Configuration

Log Settings

Configured in docker-compose.yml:

postgres:
  command:
    - "-c" "logging_collector=on"
    - "-c" "log_directory=/var/log/postgresql"
    - "-c" "log_statement=mod"                # Log DDL/modifications
    - "-c" "log_min_duration_statement=1000"  # Log slow queries (>1s)
    - "-c" "log_error_verbosity=verbose"

Log Format

timestamp [pid] user@database [tx_id] function: LEVEL: code: message

Example:

2025-01-07 10:30:45.123 EST [1234] savvi@savvi_studio [5678] 
auth.login: ERROR: 23505: duplicate key value violates unique constraint

Log Labels

Promtail extracts these labels for filtering:

Label Description Example
level Log level ERROR, WARNING, INFO
user Database user savvi, postgres
database Database name savvi_studio
function PostgreSQL function auth.login
error_code PostgreSQL error code 23505
pid Process ID 1234
transaction_id Transaction ID 5678

Error Analysis

Common Error Patterns

Unique Constraint Violations (23505)

{job="postgresql", error_code="23505"}

Example log:

ERROR: 23505: duplicate key value violates unique constraint "users_email_key"
DETAIL: Key (email)=(user@example.com) already exists.

Analysis:

  • Concurrent insert attempts
  • Application needs retry logic
  • Consider UPSERT patterns (INSERT ... ON CONFLICT)

Foreign Key Violations (23503)

{job="postgresql", error_code="23503"}

Indicates:

  • Reference to non-existent record
  • Data integrity issue
  • Check cascade rules

Deadlocks (40P01)

{job="postgresql", error_code="40P01"}

Analysis:

  • Transaction ordering issue
  • Review locking strategy
  • Use SELECT FOR UPDATE with consistent order

Connection Limits

{job="postgresql"} |= "too many connections"

Actions:

  • Increase max_connections
  • Check for connection leaks
  • Review pool configuration

Daily error summary:

sum by (error_code) (
  count_over_time({job="postgresql", level="ERROR"}[24h])
)

Error rate spike detection:

rate({job="postgresql", level="ERROR"}[5m]) > 0.1

Top 10 errors:

topk(10, sum by (error_code) (
  count_over_time({job="postgresql", level="ERROR"}[1h])
))

Performance Monitoring

Slow Query Detection

Find slow queries:

{job="postgresql"} |= "duration:"

Filter by duration (>5s):

{job="postgresql"} |= "duration:" |~ "duration: [5-9][0-9]{3,}"

Example log:

LOG: duration: 2534.123 ms  statement: 
SELECT * FROM nodes WHERE data @> '{"tag": "expensive"}'

Query Analysis Process

  1. Identify slow query in logs
  2. Explain with EXPLAIN ANALYZE
  3. Check indexes on filter columns
  4. Optimize query or add indexes
  5. Verify improvement in logs

Connection Monitoring

Active connections by state:

SELECT count(*), state, application_name
FROM pg_stat_activity
WHERE datname = 'savvi_studio'
GROUP BY state, application_name
ORDER BY count DESC;

Long-running queries:

SELECT 
  pid,
  usename,
  application_name,
  state,
  NOW() - query_start AS duration,
  LEFT(query, 100) as query
FROM pg_stat_activity
WHERE datname = 'savvi_studio'
  AND state = 'active'
  AND NOW() - query_start > INTERVAL '10 seconds'
ORDER BY duration DESC;

Table Statistics

Table sizes and health:

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
  n_live_tup as live_rows,
  n_dead_tup as dead_rows,
  round(n_dead_tup::float / NULLIF(n_live_tup, 0) * 100, 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Unused indexes:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan as scans,
  pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(indexrelid) DESC;

Query Optimization

EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM graph.nodes
WHERE node_type = 'document'
  AND data->>'status' = 'active'
ORDER BY created_at DESC
LIMIT 100;

Key metrics:

  • cost: Estimated query cost
  • actual time: Real execution time
  • rows: Rows returned vs estimated
  • Buffers hit: Data from cache
  • Buffers read: Data from disk

Optimization checklist:

  • Appropriate indexes exist
  • Index is being used (not Seq Scan)
  • Filter selectivity is good
  • Join order is optimal
  • No unnecessary sorts
  • LIMIT applied early
  • Statistics are up to date

Update Statistics

-- Analyze specific table
ANALYZE graph.nodes;

-- Analyze all tables
ANALYZE;

-- Vacuum and analyze
VACUUM ANALYZE;

Grafana Dashboards

PostgreSQL Error Logs Dashboard

Pre-configured dashboard shows:

  • Error count over time
  • Error distribution by level
  • Recent error messages
  • Error details with context
  • Function/location tracking

Access: Grafana → Dashboards → PostgreSQL Error Logs

Creating Custom Dashboards

  1. Add panel → Logs or Time series
  2. Set datasource → Loki
  3. Write query using LogQL
  4. Configure visualization
  5. Save dashboard

Example panel queries:

# Error rate by level
sum by (level) (rate({job="postgresql"}[5m]))

# Connection count estimate
count_over_time({job="postgresql"} |= "connection received"[1m])

# Query duration histogram
histogram_quantile(0.99,
  sum(rate({job="postgresql"} |= "duration:" | 
    regexp `duration: (?P<duration>\\d+)` [5m])) by (le)
)

Debug Utilities

Built-in Functions

Located in db/998_debug_utils.sql:

-- Connection state
SELECT * FROM util.get_connection_state();

-- Session info
SELECT * FROM util.get_session_info();

-- Table sizes
SELECT * FROM util.get_table_sizes('graph');

-- Unused indexes
SELECT * FROM util.get_unused_indexes();

Enable Verbose Logging

-- For current session
SET log_statement = 'all';
SET log_min_duration_statement = 0;
SET client_min_messages = debug;

-- Your queries here

-- Reset
RESET log_statement;
RESET log_min_duration_statement;
RESET client_min_messages;

Transaction Debugging

BEGIN;
SET LOCAL log_error_verbosity = verbose;
SET LOCAL client_min_messages = debug;

-- Your queries
INSERT INTO nodes ...;

COMMIT;

Common Issues

High Error Rate

Symptoms: Grafana shows error spike

Debug:

  1. Check error dashboard
  2. Identify error pattern
  3. View error details in logs
  4. Analyze root cause
  5. Fix application or schema

Slow Queries

Symptoms: High latency, timeouts

Debug:

  1. Find slow queries in logs
  2. Run EXPLAIN ANALYZE
  3. Check for missing indexes
  4. Add appropriate indexes
  5. Verify improvement

Connection Pool Exhaustion

Symptoms: "Connection timeout" errors

Debug:

  1. Check active connections
  2. Find long-running queries
  3. Terminate problem queries if needed
  4. Fix application connection handling
  5. Increase pool size if necessary

Disk Space Issues

Symptoms: "No space left on device"

Debug:

  1. Check database size
  2. Find large tables
  3. Check for bloat (dead tuples)
  4. Run VACUUM FULL ANALYZE
  5. Consider archiving old data

Best Practices

Regular Monitoring

  • Check Grafana daily for error trends
  • Review slow queries weekly
  • Analyze table growth monthly
  • Update statistics after bulk operations

Proactive Maintenance

-- Weekly maintenance
DO $$
BEGIN
  ANALYZE;           -- Update statistics
  VACUUM;            -- Clean dead tuples
  
  -- Reindex if needed
  REINDEX DATABASE savvi_studio;
END $$;

Alert Configuration

Set up alerts in Grafana for:

  • Error rate > threshold
  • Slow query count increase
  • Connection pool > 80% capacity
  • Disk usage > 80%
  • Replication lag (if applicable)

Log Retention

Configured in Loki:

limits_config:
  retention_period: 744h  # 31 days

Adjust based on:

  • Compliance requirements
  • Storage capacity
  • Query patterns

Query Optimization Guidelines

  • Use indexes for common filters
  • Limit result sets with LIMIT
  • Use cursor pagination for large sets
  • Batch operations where possible
  • Cache frequently accessed data
  • Avoid N+1 queries

Quick Reference

Task Query/Command
All errors {job="postgresql", level="ERROR"}
Slow queries {job="postgresql"} |= "duration:"
By user {job="postgresql", user="savvi"}
By function {job="postgresql", function="auth.login"}
Error code {job="postgresql", error_code="23505"}
Active connections See Connection Monitoring SQL
Table sizes See Table Statistics SQL

For setup: See Local Development to start monitoring services
For queries: Use Grafana Explore with LogQL queries above
For debugging: Use built-in util functions from 998_debug_utils.sql