Step-by-step solutions with copy-ready code for every lab in the 2-week study guide
| Term | Definition |
|---|---|
| SLI | Service Level Indicator — a measurable metric (e.g. p99 latency < 200ms) |
| SLO | Service Level Objective — target for an SLI (e.g. 99.9% of requests < 200ms) |
| SLA | Service Level Agreement — contractual commitment; breach = financial penalty |
| Error Budget | 1 − SLO availability. 99.9% SLO = 0.1% budget ≈ 43 min/month of allowed downtime |
| Cardinality | Number of unique values in a dimension. High-cardinality: user_id, request_id. Low-cardinality: region, status_code |
| RED Method | Rate, Errors, Duration — for request-driven services |
| USE Method | Utilisation, Saturation, Errors — for resources (CPU, disk, network) |
| Golden Signals | Latency, Traffic, Errors, Saturation (Google SRE) |
| Push vs Pull | Prometheus = pull (scrapes /metrics). OTLP = push (exporter sends data to receiver) |
| Observability | Ability to infer internal system state from external outputs (logs, metrics, traces) |
Your diagram should represent this exact data flow:
Application Code
│
▼
OTel API ──── language-specific interfaces; ships as no-op without SDK
│
▼
OTel SDK ──── TracerProvider / MeterProvider / LoggerProvider
│ │
│ [SpanProcessor]
│ [BatchSpanProcessor] ◄─── always use Batch in production
│
▼
Exporter ──── OTLPSpanExporter / PrometheusExporter / JaegerExporter
│
▼
OTel Collector (optional but strongly recommended)
[Receiver] ──► [Processor] ──► [Exporter]
otlp grpc memory_limiter otlp/jaeger
otlp http batch prometheus
filelog attributes logging
│
▼
Backend ──── Jaeger / Prometheus / Grafana / Splunk / etc.
git clone https://github.com/open-telemetry/opentelemetry-demo.git cd opentelemetry-demo docker compose up --no-build # Wait ~2 minutes for all services to be healthy # Frontend UI: http://localhost:8080 # Jaeger UI: http://localhost:8080/jaeger/ui # Grafana: http://localhost:8080/grafana
# Example header value you will see in the browser: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # Breakdown: # 00 = version (always 00 in W3C spec) # 4bf92f3577b34da6a3ce929d0e0e4736 = traceId (128-bit, 32 hex chars) # 00f067aa0ba902b7 = parentSpanId (64-bit, 16 hex chars) # 01 = trace-flags: 01=sampled, 00=not sampled
class TraceContextPropagator: def inject(self, carrier: dict, context: Context) -> None: span = get_current_span(context) span_context = span.get_span_context() if not span_context.is_valid: return traceparent = ( f'00-{span_context.trace_id:032x}' f'-{span_context.span_id:016x}' f'-{"01" if span_context.trace_flags else "00"}' ) carrier['traceparent'] = traceparent if span_context.trace_state: carrier['tracestate'] = str(span_context.trace_state) def extract(self, carrier: dict) -> Context: header = carrier.get('traceparent', '') parts = header.split('-') if len(parts) != 4 or parts[0] != '00': return Context() # invalid → empty context trace_id = int(parts[1], 16) span_id = int(parts[2], 16) flags = int(parts[3], 16) span_ctx = SpanContext( trace_id=trace_id, span_id=span_id, is_remote=True, trace_flags=flags ) return set_span_in_context(NonRecordingSpan(span_ctx))
pip install opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-exporter-prometheus \
opentelemetry-instrumentation-flask \
prometheus-client \
flask
import time, random from flask import Flask, request from opentelemetry import metrics from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter from opentelemetry.sdk.metrics.view import View exporter = OTLPMetricExporter(endpoint='http://localhost:4317', insecure=True) reader = PeriodicExportingMetricReader(exporter, export_interval_millis=5000) # View: rename instrument + filter to 2 attributes only latency_view = View( instrument_name='http.server.request.duration', name='api.request.duration', attribute_keys={'http.request.method', 'http.response.status_code'} ) provider = MeterProvider(metric_readers=[reader], views=[latency_view]) metrics.set_meter_provider(provider) meter = metrics.get_meter('my.server', '1.0.0') # Counter — monotonically increasing request_counter = meter.create_counter( name='http.server.request.total', description='Total HTTP requests received', unit='1', ) # Histogram — latency distribution latency_histogram = meter.create_histogram( name='http.server.request.duration', description='HTTP request duration in seconds', unit='s', ) app = Flask(__name__) @app.route('/api/orders', methods=['GET']) def get_orders(): start = time.time() time.sleep(random.uniform(0.01, 0.3)) # simulate work duration = time.time() - start attrs = { 'http.request.method': 'GET', 'http.response.status_code': 200, 'service.name': 'order-service' } request_counter.add(1, attrs) latency_histogram.record(duration, attrs) return {'orders': []}, 200 if __name__ == '__main__': app.run(port=5000)
from opentelemetry.exporter.prometheus import PrometheusMetricReader from prometheus_client import start_http_server start_http_server(port=8000) # exposes /metrics on :8000 reader = PrometheusMetricReader() provider = MeterProvider(metric_readers=[reader], views=[latency_view]) # Verify: # curl http://localhost:8000/metrics | grep http_server # # api_request_duration_bucket{http_request_method="GET",le="0.05"} 12.0 # http_server_request_total_total{http_request_method="GET",...} 42.0
/metrics changes to api_request_duration_* and only http_request_method + http_response_status_code labels appear.<dependency> <groupId>io.opentelemetry.instrumentation</groupId> <artifactId>opentelemetry-log4j-appender-2.17</artifactId> <version>2.9.0-alpha</version> </dependency> <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-exporter-otlp</artifactId> <version>1.43.0</version> </dependency>
<?xml version="1.0" encoding="UTF-8"?> <Configuration status="WARN"> <Appenders> <!-- Console: injects trace_id and span_id via MDC --> <Console name="Console" target="SYSTEM_OUT"> <PatternLayout pattern="%d [%X{trace_id}] [%X{span_id}] %-5level %msg%n"/> </Console> <!-- OTel appender: ships logs via OTLP to Collector --> <OpenTelemetry name="OpenTelemetryAppender"/> </Appenders> <Loggers> <Root level="info"> <AppenderRef ref="Console"/> <AppenderRef ref="OpenTelemetryAppender"/> </Root> </Loggers> </Configuration>
receivers: filelog: include: ["/var/log/myapp/*.log"] start_at: beginning operators: - type: regex_parser regex: '^(?P<timestamp>\S+ \S+) \[(?P<trace_id>[a-f0-9]*)\] \[(?P<span_id>[a-f0-9]*)\] (?P<severity>\S+) (?P<message>.*)$' timestamp: parse_from: attributes.timestamp layout: '%Y-%m-%d %H:%M:%S.%f' severity: parse_from: attributes.severity processors: batch: exporters: logging: verbosity: detailed service: pipelines: logs: receivers: [filelog] processors: [batch] exporters: [logging]
trace_id and span_id fields. In Grafana Loki, clicking a trace_id opens the correlated trace in Jaeger/Tempo.curl -L -o opentelemetry-javaagent.jar \ https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
export OTEL_SERVICE_NAME=order-service export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_LOGS_EXPORTER=otlp export OTEL_METRICS_EXPORTER=otlp export OTEL_TRACES_EXPORTER=otlp export OTEL_TRACES_SAMPLER=parentbased_traceidratio export OTEL_TRACES_SAMPLER_ARG=1.0 export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=dev,service.version=1.0.0 java -javaagent:./opentelemetry-javaagent.jar \ -jar target/order-service-1.0.0.jar
# 10% sampling — 1 in ~10 requests generates a trace export OTEL_TRACES_SAMPLER=parentbased_traceidratio export OTEL_TRACES_SAMPLER_ARG=0.1 # Drop everything export OTEL_TRACES_SAMPLER=always_off # Sample everything (default for dev) export OTEL_TRACES_SAMPLER=always_on
traceflags=01, this service will sample regardless of the ratio arg.from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.trace import StatusCode # SDK bootstrap exporter = OTLPSpanExporter(endpoint='http://localhost:4317', insecure=True) provider = TracerProvider() provider.add_span_processor(BatchSpanProcessor(exporter)) trace.set_tracer_provider(provider) tracer = trace.get_tracer('com.myapp.payments', '1.0.0') def process_payment(order_id: str, amount: float) -> dict: with tracer.start_as_current_span('payments.process') as span: # Semantic convention attributes span.set_attribute('order.id', order_id) span.set_attribute('payment.amount', amount) span.set_attribute('payment.currency', 'USD') # Span event — timestamped annotation (not an error) span.add_event('payment.validation.started') try: result = call_payment_gateway(order_id, amount) span.add_event('payment.gateway.responded', { 'gateway.response_code': result['code'] }) # Do NOT set OK unless you explicitly want to mark success return result except Exception as e: span.record_exception(e) # adds exception event with stacktrace span.set_status(StatusCode.ERROR, f'Payment declined: {str(e)}') raise def call_payment_gateway(order_id, amount): # Child span — CLIENT kind for outbound calls with tracer.start_as_current_span( 'payments.gateway.charge', kind=trace.SpanKind.CLIENT ) as child: child.set_attribute('server.address', 'gateway.example.com') child.set_attribute('server.port', 443) return {'code': '00', 'auth_code': 'ABC123'}
Tags: error=true. The failed span shows Status=ERROR, an exception event with exception.type, exception.message, and exception.stacktrace.version: '3.8' services: jaeger: image: jaegertracing/all-in-one:latest environment: - COLLECTOR_OTLP_ENABLED=true ports: - '16686:16686' # Jaeger UI - '4317:4317' # OTLP gRPC otel-collector: image: otel/opentelemetry-collector-contrib:latest command: ['--config=/etc/otelcol-contrib/config.yaml'] volumes: - ./collector.yaml:/etc/otelcol-contrib/config.yaml ports: - '4318:4318' # OTLP HTTP (app → collector) - '8888:8888' # Collector internal metrics depends_on: [jaeger] prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - '9090:9090'
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: # ← MUST be first limit_mib: 512 spike_limit_mib: 128 check_interval: 5s batch: # ← MUST be last timeout: 10s send_batch_size: 1024 exporters: otlp/jaeger: endpoint: jaeger:4317 tls: insecure: true prometheus: endpoint: 0.0.0.0:8889 logging: verbosity: detailed extensions: health_check: endpoint: 0.0.0.0:13133 service: extensions: [health_check] pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/jaeger, logging] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus, logging] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [logging]
# Install (macOS) brew install equinix-labs/otel-cli/otel-cli # Send a test span otel-cli exec \ --service my-test-service \ --name 'test.operation' \ --attrs 'http.request.method=GET,http.response.status_code=200' \ --endpoint http://localhost:4318 \ -- echo 'hello from otel-cli' # Verify health check curl http://localhost:13133/ # {"status":"Server available"}
loadbalancingexporter tier first, routing by TraceID hash.processors: tail_sampling: decision_wait: 10s # wait up to 10s for all spans num_traces: 50000 expected_new_traces_per_sec: 100 policies: - name: errors-policy type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces-policy type: latency latency: {threshold_ms: 500} - name: baseline-5pct type: probabilistic probabilistic: {sampling_percentage: 5} - name: composite-policy type: composite composite: max_total_spans_per_second: 1000 policy_order: [errors-policy, slow-traces-policy, baseline-5pct] composite_sub_policy: - name: errors-policy type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces-policy type: latency latency: {threshold_ms: 500} - name: baseline-5pct type: probabilistic probabilistic: {sampling_percentage: 5} rate_allocation: - policy: errors-policy percent: 50 - policy: slow-traces-policy percent: 30 - policy: baseline-5pct percent: 20
connectors: spanmetrics: histogram: explicit: buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s] dimensions: - name: http.request.method - name: http.response.status_code - name: service.name exemplars: enabled: true metrics_flush_interval: 15s service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/jaeger, spanmetrics] # spanmetrics = exporter here metrics/spanmetrics: receivers: [spanmetrics] # spanmetrics = receiver here processors: [memory_limiter, batch] exporters: [prometheus] # Generated Prometheus metrics: # calls_total{service_name="...",http_request_method="GET",...} 142 # duration_milliseconds_bucket{le="50",...} 98
processors: filter/drop-healthchecks: error_mode: ignore traces: span: - 'attributes["url.path"] == "/health"' - 'attributes["url.path"] == "/healthz"' - 'attributes["url.path"] == "/ready"' - 'attributes["user_agent.original"] == "ELB-HealthChecker/2.0"' service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, filter/drop-healthchecks, batch] exporters: [otlp/jaeger]
url.path=/health — it will NOT appear in Jaeger. A span with url.path=/api/orders WILL appear.apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus:9090 jsonData: exemplarTraceIdDestinations: - name: trace_id datasourceUid: jaeger - name: Jaeger type: jaeger uid: jaeger url: http://jaeger:16686
span.set_attribute('http.request.method', 'POST') span.set_attribute('url.full', 'https://api.example.com/v1/checkout') span.set_attribute('url.path', '/v1/checkout') span.set_attribute('url.scheme', 'https') span.set_attribute('server.address', 'api.example.com') span.set_attribute('server.port', 443) span.set_attribute('http.response.status_code', 201) span.set_attribute('client.address', '203.0.113.5') span.set_attribute('user_agent.original', 'Mozilla/5.0 ...')
span.set_attribute('http.request.method', 'GET') span.set_attribute('url.full', 'https://payment-gateway.example.com/charge') span.set_attribute('server.address', 'payment-gateway.example.com') span.set_attribute('server.port', 443) span.set_attribute('http.response.status_code', 200)
span.set_attribute('db.system', 'postgresql') span.set_attribute('db.name', 'orders') span.set_attribute('db.statement', 'SELECT * FROM orders WHERE id = $1') span.set_attribute('db.operation', 'SELECT') span.set_attribute('server.address', 'postgres.internal') span.set_attribute('server.port', 5432)
span.set_attribute('messaging.system', 'kafka') span.set_attribute('messaging.operation', 'publish') span.set_attribute('messaging.destination.name', 'orders.created') span.set_attribute('messaging.message.id', 'msg-abc-123') span.set_attribute('server.address', 'kafka.internal') span.set_attribute('server.port', 9092)
span.set_attribute('rpc.system', 'grpc') span.set_attribute('rpc.service', 'payments.PaymentService') span.set_attribute('rpc.method', 'ProcessPayment') span.set_attribute('rpc.grpc.status_code', 0) # 0 = OK span.set_attribute('server.address', 'payment-service.internal') span.set_attribute('server.port', 9090)
| Deprecated (old) | Current (new) | Since |
|---|---|---|
| http.method | http.request.method | semconv 1.23.0 |
| http.status_code | http.response.status_code | semconv 1.23.0 |
| http.url | url.full | semconv 1.23.0 |
| http.target | url.path + url.query | semconv 1.23.0 |
| net.peer.name | server.address | semconv 1.23.0 |
| net.peer.port | server.port | semconv 1.23.0 |
| http.scheme | url.scheme | semconv 1.23.0 |