What are Observability?
Observability is the activities that involve measuring, collecting, and analyzing various diagnostics signals from a system. These signals may include metrics, traces, logs, events, profiles and more
- Log aggregation
- Application metrics
- Audit logging
- Distributed tracing
- Exception tracking
- Health check API
- Log deployments and changes
Log aggregation
Definition Use a centralized logging service that aggregates logs from each service instance. The users can search and analyze the logs. They can configure alerts that are triggered when certain messages appear in the logs.
Issue: Handling a large volume of logs requires substantial infrastructure.Note: Any solution should have minimal runtime overhead.
Application metrics
DefinitionInstrument a service to gather statistics about individual operations. Aggregate metrics in centralized metrics service, which provides reporting and alerting. There are two models for aggregating metrics:
- push - the service pushes metrics to the metrics service
- pull - the metrics services pulls metrics from the service
-
Instrumentation libraries:
- Prometheus client libraries
-
Metrics aggregation services:
- Prometheus
Benefits:It provides deep insight into application behavior
Drawbacks: Metrics code is intertwined with business logic making it more complicated
Issues: Aggregating metrics can require significant infrastructure
Audit logging
Definition: Record user activity in a database.
Benefits: Provides a record of user actionsDrawbacks: The auditing code is intertwined with the business logic, which makes the business logic more complicated
Distributed Tracing
Definition:
- Assigns each external request a unique external request id
- Passes the external request id to all services that are involved in handling the request
- Includes the external request id in all log messages
- Records information (e.g. start time, end time) about the requests and operations performed when handling a external request in a centralized service
- It provides useful insight into the behavior of the system including the sources of latency
- It enables developers to see how an individual request is handled by searching across aggregated logs for its external request id
Issues: Aggregating and storing traces can require significant infrastructure
Note:
- External monitoring only tells you the overall response time and number of invocations - no insight into the individual operations
- Any solution should have minimal runtime overhead
- Log entries for a request are scattered across numerous logs
Tools:
- Jaeger: open source, end-to-end distributed tracing Monitor and troubleshoot transactions in complex distributed systems
- Open Zipkin - service for recording and displaying tracing information
- Open Tracing - standardized API for distributed tracing
Exception tracking
Definition Report all exceptions to a centralized exception tracking service that aggregates and tracks exceptions and notifies developers.
Benefits: It is easier to view exceptions and track their resolutionDrawbacks: The exception tracking service is additional infrastructure
Note:
- Exceptions must be de-duplicated, recorded, investigated by developers and the underlying issue resolved
- Any solution should have minimal runtime overhead
Health Check API
Definition A service has an health check API endpoint (e.g. HTTP /health) that returns the health of the service. The API endpoint handler performs various checks, such as
- the status of the connections to the infrastructure services used by the service instance
- the status of the host, e.g. disk space
- application specific logic
A health check client - a monitoring service, service registry or load balancer - periodically invokes the endpoint to check the health of the service instance.
Benefits: The health check endpoint enables the health of a service instance to be periodically testedDrawbacks: The health check might not sufficiently comprehensive or the service instance might fail between health checks and so requests might still be routed to a failed service instance
Log deployments and changes
Definition Log every deployment and every change to the (production) environment.
Benefits: Enables deployments and changes to be easily correlated with issues leading to a faster resolution.Technology stack for Observability
- https://opentracing.io/ - OpenTracing is not a download or a program. Distributed tracing requires that software developers add instrumentation to the code of an application, or to the frameworks used in the application.