Monitoring

This document details the monitoring solutions implemented to ensure the reliability, performance, and security of the TrioSigno application.

Overview

TrioSigno's monitoring system is designed to monitor several aspects of the application:

Availability - Ensuring the application is accessible to users
Performance - Monitoring response times and resource usage
Errors - Detecting and alerting on application errors
Security - Monitoring unauthorized access attempts and vulnerabilities
Business Metrics - Tracking key performance indicators (KPIs)

Monitoring Architecture

The monitoring architecture is based on the ELK stack (Elasticsearch, Logstash, Kibana) for log management, and Prometheus with Grafana for system and application metrics.

┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  Application  │   │    Server     │   │  Database     │
│  (Logs)       │──▶│    (Metrics)  │──▶│               │
└───────┬───────┘   └───────┬───────┘   └───────┬───────┘
        │                   │                   │
        ▼                   ▼                   ▼
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│   Logstash    │   │   Prometheus  │   │   Exporters   │
│   (Collection)│   │   (Metrics)   │   │   (Metrics)   │
└───────┬───────┘   └───────┬───────┘   └───────┬───────┘
        │                   │                   │
        ▼                   │                   │
┌───────────────┐           │                   │
│ Elasticsearch │◀──────────┴───────────────────┘
│   (Storage)   │
└───────┬───────┘
        │
        ▼
┌───────────────┐   ┌───────────────┐
│    Kibana     │   │    Grafana    │
│(Visualization)│◀─▶│  (Dashboards) │
└───────┬───────┘   └───────┬───────┘
        │                   │
        ▼                   ▼
┌───────────────────────────────────┐
│           Alerting                │
│    (Email, Slack, PagerDuty)      │
└───────────────────────────────────┘

Log Collection

Logging Configuration

The TrioSigno application uses Winston for structured logging. Logs are generated in JSON format to facilitate their processing.

// Example logger configuration
import winston from "winston";

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || "info",
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: { service: "triosigno-app" },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: "error.log", level: "error" }),
    new winston.transports.File({ filename: "combined.log" }),
  ],
});

export default logger;

Collection with Logstash

Logstash is configured to collect logs from multiple sources:

Application logs via Filebeat
System logs via Syslog
PostgreSQL database logs

Example Logstash configuration:

input {
  beats {
    port => 5044
  }
  syslog {
    port => 5140
  }
}

filter {
  if [type] == "app" {
    json {
      source => "message"
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "triosigno-logs-%{+YYYY.MM.dd}"
  }
}

System and Application Metrics

Prometheus

Prometheus is used to collect and store time-series metrics. It is configured to scrape metrics endpoints exposed by:

Application services via the Prometheus middleware
Node Exporter for system metrics
PostgreSQL Exporter for database metrics

Example Prometheus configuration:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "triosigno-app"
    metrics_path: "/metrics"
    static_configs:
      - targets: ["app:3000"]

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "postgres"
    static_configs:
      - targets: ["postgres-exporter:9187"]

Application Instrumentation

The TrioSigno application is instrumented to expose specific metrics via a /metrics endpoint:

import express from "express";
import promBundle from "express-prom-bundle";

const app = express();

// Add Prometheus middleware
const metricsMiddleware = promBundle({
  includeMethod: true,
  includePath: true,
  includeStatusCode: true,
  includeUp: true,
  customLabels: { app: "triosigno" },
  promClient: {
    collectDefaultMetrics: {},
  },
});

app.use(metricsMiddleware);

// Custom metrics
const signLearningCounter = new promClient.Counter({
  name: "triosigno_sign_learning_total",
  help: "Counter for sign language learning attempts",
  labelNames: ["sign_id", "result"],
});

// Application routes...

export default app;

Visualization and Dashboards

Kibana

Kibana is used to visualize and analyze logs. Several dashboards are configured:

Overview - General view of activity and errors
Errors - Detailed analysis of errors by type and frequency
Security - Tracking of logins and access attempts
Performance - Analysis of API response times

Grafana

Grafana provides dashboards to visualize metrics collected by Prometheus:

Infrastructure - CPU, memory, disk, and network
Application - Response times, requests per second, error rate
Database - PostgreSQL performance, connections, queries
Business Metrics - Active users, completed lessons, progression

Example Grafana dashboard configuration:

{
  "dashboard": {
    "id": null,
    "title": "TrioSigno Application Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_request_duration_seconds_count{app=\"triosigno\"}[5m])) by (method, route)",
            "legendFormat": "{{method}} {{route}}"
          }
        ]
      },
      {
        "title": "Response Time (95th percentile)",
        "type": "graph",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"triosigno\"}[5m])) by (le, route))",
            "legendFormat": "{{route}}"
          }
        ]
      }
    ],
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "refresh": "5m"
  }
}

Alerting

The alerting system is configured to notify the team of detected issues:

Alert Rules

Alerts are defined for various conditions:

Availability - Service unavailable for more than 2 minutes
Latency - Response time greater than 2 seconds for 5 minutes
Errors - Error rate greater than 5% for 5 minutes
Resources - CPU/memory usage greater than 80% for 10 minutes
Database - Query time greater than 1 second, connections near limit

Example Prometheus alert rule:

groups:
  - name: TrioSigno
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_request_duration_seconds_count{status=~"5.."}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for the last 5 minutes"

      - alert: ServiceDown
        expr: up{job="triosigno-app"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "The TrioSigno application has been down for more than 2 minutes"

Notification Channels

Alerts are sent via multiple channels:

Email - For general notifications
Slack - For team communication
PagerDuty - For critical incidents requiring immediate intervention

User Monitoring (Real User Monitoring)

In addition to system metrics, TrioSigno integrates user experience tracking:

Frontend Analytics

The application frontend integrates trackers to measure:

Page load times
User interactions
JavaScript errors
Perceived performance

// Example client-side monitoring integration
import { initPerformanceMonitoring } from "./monitoring";

document.addEventListener("DOMContentLoaded", () => {
  // Initialize performance monitoring
  initPerformanceMonitoring();

  // Capture unhandled errors
  window.addEventListener("error", (event) => {
    reportErrorToBackend({
      message: event.message,
      source: event.filename,
      line: event.lineno,
      stack: event.error?.stack,
    });
  });
});

Session Profiling

The monitoring system tracks user sessions to identify usage issues:

User Journey - Analysis of paths taken by users
Friction Points - Identification of steps where users drop off
Time Spent - Measurement of time spent on each screen or feature

Database Monitoring

PostgreSQL-specific monitoring includes:

Key Metrics

Connections - Number of active and maximum connections
Cache - Cache hit/miss rate
Query Performance - Execution time and slow queries
Locking - Detection of contentions and deadlocks
Space Usage - Growth of tables and indexes

Slow Queries

A slow query capture system is configured to identify necessary optimizations:

log_min_duration_statement = 200ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0

Installation and Configuration

Prerequisites

Docker and Docker Compose
At least 4GB of RAM for the monitoring stack
Network access to services to monitor

Deployment with Docker Compose

Monitoring is deployed via Docker Compose:

version: "3.8"

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.14.0
    environment:
      - discovery.type=single-node
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:7.14.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"
      - "5140:5140"
    depends_on:
      - elasticsearch

  kibana:
    image: docker.elastic.co/kibana/kibana:7.14.0
    ports:
      - "5601:5601"
    environment:
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
    depends_on:
      - elasticsearch

  prometheus:
    image: prom/prometheus:v2.30.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"

  grafana:
    image: grafana/grafana:8.1.2
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports:
      - "3000:3000"
    depends_on:
      - prometheus

  node-exporter:
    image: prom/node-exporter:v1.2.2
    ports:
      - "9100:9100"

  postgres-exporter:
    image: wrouesnel/postgres_exporter:v0.9.0
    environment:
      DATA_SOURCE_NAME: "postgresql://postgres:postgres@postgres:5432/triosigno?sslmode=disable"
    ports:
      - "9187:9187"
    depends_on:
      - postgres

volumes:
  elasticsearch-data:
  prometheus-data:
  grafana-data:

Best Practices

Data Retention - Define appropriate retention policies to avoid excessive data growth
Security - Protect access to monitoring interfaces with authentication and TLS
Alert Gradation - Configure different alert levels according to the severity of problems
Documentation - Maintain up-to-date documentation of dashboards and metrics
Regular Testing - Periodically test the alerting system to ensure it works correctly

Troubleshooting Common Issues

Elasticsearch

Issue: JVM heap space errors
- Solution: Adjust ES_JAVA_OPTS parameters or increase RAM

Prometheus

Issue: Storage full
- Solution: Adjust retention parameters or increase disk space

Alerting

Issue: Missing alerts
- Solution: Check Alertmanager configuration and notification routes

Overview​

Monitoring Architecture​

Log Collection​

Logging Configuration​

Collection with Logstash​

System and Application Metrics​

Prometheus​

Application Instrumentation​

Visualization and Dashboards​

Kibana​

Grafana​

Alerting​

Alert Rules​

Notification Channels​

User Monitoring (Real User Monitoring)​

Frontend Analytics​

Session Profiling​

Database Monitoring​

Key Metrics​

Slow Queries​

Installation and Configuration​

Prerequisites​

Deployment with Docker Compose​

Best Practices​

Troubleshooting Common Issues​

Elasticsearch​

Prometheus​

Alerting​

Overview

Monitoring Architecture

Log Collection

Logging Configuration

Collection with Logstash

System and Application Metrics

Prometheus

Application Instrumentation

Visualization and Dashboards

Kibana

Grafana

Alerting

Alert Rules

Notification Channels

User Monitoring (Real User Monitoring)

Frontend Analytics

Session Profiling

Database Monitoring

Key Metrics

Slow Queries

Installation and Configuration

Prerequisites

Deployment with Docker Compose

Best Practices

Troubleshooting Common Issues

Elasticsearch

Prometheus

Alerting