Day 1

Observability Fundamentals

Lab 1 Flashcards — 10 Core Terms
TermDefinition
SLIService Level Indicator — a measurable metric (e.g. p99 latency < 200ms)
SLOService Level Objective — target for an SLI (e.g. 99.9% of requests < 200ms)
SLAService Level Agreement — contractual commitment; breach = financial penalty
Error Budget1 − SLO availability. 99.9% SLO = 0.1% budget ≈ 43 min/month of allowed downtime
CardinalityNumber of unique values in a dimension. High-cardinality: user_id, request_id. Low-cardinality: region, status_code
RED MethodRate, Errors, Duration — for request-driven services
USE MethodUtilisation, Saturation, Errors — for resources (CPU, disk, network)
Golden SignalsLatency, Traffic, Errors, Saturation (Google SRE)
Push vs PullPrometheus = pull (scrapes /metrics). OTLP = push (exporter sends data to receiver)
ObservabilityAbility to infer internal system state from external outputs (logs, metrics, traces)

Day 2

OTel Architecture & Core Concepts

Lab 1 OTel Data Flow Diagram

Your diagram should represent this exact data flow:

architecture diagram
Application Code
     │
     ▼
OTel API  ──── language-specific interfaces; ships as no-op without SDK
     │
     ▼
OTel SDK  ──── TracerProvider / MeterProvider / LoggerProvider
     │                  │
     │         [SpanProcessor]
     │         [BatchSpanProcessor]  ◄─── always use Batch in production
     │
     ▼
Exporter  ──── OTLPSpanExporter / PrometheusExporter / JaegerExporter
     │
     ▼
OTel Collector  (optional but strongly recommended)
  [Receiver]  ──► [Processor]  ──► [Exporter]
  otlp grpc       memory_limiter    otlp/jaeger
  otlp http       batch             prometheus
  filelog         attributes        logging
     │
     ▼
Backend  ──── Jaeger / Prometheus / Grafana / Splunk / etc.

Day 3

Traces & Context Propagation

Lab 1 Run OTel Demo + Decode traceparent
1
Run the official OTel Demo
bash
git clone https://github.com/open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo
docker compose up --no-build

# Wait ~2 minutes for all services to be healthy
# Frontend UI: http://localhost:8080
# Jaeger UI:   http://localhost:8080/jaeger/ui
# Grafana:     http://localhost:8080/grafana
2
Decode a traceparent header (DevTools → Network tab)
traceparent format
# Example header value you will see in the browser:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

# Breakdown:
#  00                               = version (always 00 in W3C spec)
#  4bf92f3577b34da6a3ce929d0e0e4736  = traceId  (128-bit, 32 hex chars)
#  00f067aa0ba902b7                  = parentSpanId (64-bit, 16 hex chars)
#  01                               = trace-flags: 01=sampled, 00=not sampled
3
Pseudocode propagator — inject and extract pattern
python
class TraceContextPropagator:

    def inject(self, carrier: dict, context: Context) -> None:
        span = get_current_span(context)
        span_context = span.get_span_context()
        if not span_context.is_valid:
            return
        traceparent = (
            f'00-{span_context.trace_id:032x}'
            f'-{span_context.span_id:016x}'
            f'-{"01" if span_context.trace_flags else "00"}'
        )
        carrier['traceparent'] = traceparent
        if span_context.trace_state:
            carrier['tracestate'] = str(span_context.trace_state)

    def extract(self, carrier: dict) -> Context:
        header = carrier.get('traceparent', '')
        parts = header.split('-')
        if len(parts) != 4 or parts[0] != '00':
            return Context()  # invalid → empty context
        trace_id = int(parts[1], 16)
        span_id  = int(parts[2], 16)
        flags    = int(parts[3], 16)
        span_ctx = SpanContext(
            trace_id=trace_id, span_id=span_id,
            is_remote=True, trace_flags=flags
        )
        return set_span_in_context(NonRecordingSpan(span_ctx))

Day 4

Metrics in OpenTelemetry

Lab 1 Instrument HTTP Server — Counter, Histogram, View
1
Install dependencies
bash
pip install opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            opentelemetry-exporter-prometheus \
            opentelemetry-instrumentation-flask \
            prometheus-client \
            flask
2
Instrumented Flask server — Counter + Histogram (app.py)
python — app.py
import time, random
from flask import Flask, request
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.view import View

exporter = OTLPMetricExporter(endpoint='http://localhost:4317', insecure=True)
reader   = PeriodicExportingMetricReader(exporter, export_interval_millis=5000)

# View: rename instrument + filter to 2 attributes only
latency_view = View(
    instrument_name='http.server.request.duration',
    name='api.request.duration',
    attribute_keys={'http.request.method', 'http.response.status_code'}
)

provider = MeterProvider(metric_readers=[reader], views=[latency_view])
metrics.set_meter_provider(provider)
meter = metrics.get_meter('my.server', '1.0.0')

# Counter — monotonically increasing
request_counter = meter.create_counter(
    name='http.server.request.total',
    description='Total HTTP requests received',
    unit='1',
)

# Histogram — latency distribution
latency_histogram = meter.create_histogram(
    name='http.server.request.duration',
    description='HTTP request duration in seconds',
    unit='s',
)

app = Flask(__name__)

@app.route('/api/orders', methods=['GET'])
def get_orders():
    start = time.time()
    time.sleep(random.uniform(0.01, 0.3))   # simulate work
    duration = time.time() - start

    attrs = {
        'http.request.method': 'GET',
        'http.response.status_code': 200,
        'service.name': 'order-service'
    }
    request_counter.add(1, attrs)
    latency_histogram.record(duration, attrs)
    return {'orders': []}, 200

if __name__ == '__main__':
    app.run(port=5000)
3
Switch to Prometheus scrape endpoint
python
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server

start_http_server(port=8000)   # exposes /metrics on :8000
reader = PrometheusMetricReader()
provider = MeterProvider(metric_readers=[reader], views=[latency_view])

# Verify:
# curl http://localhost:8000/metrics | grep http_server
#
# api_request_duration_bucket{http_request_method="GET",le="0.05"} 12.0
# http_server_request_total_total{http_request_method="GET",...}   42.0
After applying the View, the metric name in /metrics changes to api_request_duration_* and only http_request_method + http_response_status_code labels appear.

Day 5

Logs in OpenTelemetry

Lab 1 Java log4j2 OTel Appender + Filelog Receiver
1
Maven dependencies (pom.xml)
xml — pom.xml
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-log4j-appender-2.17</artifactId>
    <version>2.9.0-alpha</version>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
    <version>1.43.0</version>
</dependency>
2
log4j2.xml — with OTel appender + trace correlation pattern
xml — log4j2.xml
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
  <Appenders>
    <!-- Console: injects trace_id and span_id via MDC -->
    <Console name="Console" target="SYSTEM_OUT">
      <PatternLayout
        pattern="%d [%X{trace_id}] [%X{span_id}] %-5level %msg%n"/>
    </Console>

    <!-- OTel appender: ships logs via OTLP to Collector -->
    <OpenTelemetry name="OpenTelemetryAppender"/>
  </Appenders>
  <Loggers>
    <Root level="info">
      <AppenderRef ref="Console"/>
      <AppenderRef ref="OpenTelemetryAppender"/>
    </Root>
  </Loggers>
</Configuration>
3
Collector config — filelog receiver with regex parser
yaml — collector.yaml
receivers:
  filelog:
    include: ["/var/log/myapp/*.log"]
    start_at: beginning
    operators:
      - type: regex_parser
        regex: '^(?P<timestamp>\S+ \S+) \[(?P<trace_id>[a-f0-9]*)\] \[(?P<span_id>[a-f0-9]*)\] (?P<severity>\S+) (?P<message>.*)$'
        timestamp:
          parse_from: attributes.timestamp
          layout: '%Y-%m-%d %H:%M:%S.%f'
        severity:
          parse_from: attributes.severity

processors:
  batch:

exporters:
  logging:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [batch]
      exporters: [logging]
Verify: Collector output shows log records with trace_id and span_id fields. In Grafana Loki, clicking a trace_id opens the correlated trace in Jaeger/Tempo.

Day 6

Automatic Instrumentation

Lab 1 Java Agent — Spring Boot Zero-Code Instrumentation
1
Download the Java agent
bash
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
2
Launch with all env vars configured
bash
export OTEL_SERVICE_NAME=order-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_LOGS_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=otlp
export OTEL_TRACES_EXPORTER=otlp
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=1.0
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=dev,service.version=1.0.0

java -javaagent:./opentelemetry-javaagent.jar \
     -jar target/order-service-1.0.0.jar
3
Toggle sampling and observe the difference
bash
# 10% sampling — 1 in ~10 requests generates a trace
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

# Drop everything
export OTEL_TRACES_SAMPLER=always_off

# Sample everything (default for dev)
export OTEL_TRACES_SAMPLER=always_on
parentbased_traceidratio respects the incoming sampling decision. If an upstream sent traceflags=01, this service will sample regardless of the ratio arg.

Day 8

Manual Instrumentation

Lab 1 Custom Spans, Exception Recording & Metrics
1
Full working example — spans, events, exception recording
python — manual_trace.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.trace import StatusCode

# SDK bootstrap
exporter = OTLPSpanExporter(endpoint='http://localhost:4317', insecure=True)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer('com.myapp.payments', '1.0.0')


def process_payment(order_id: str, amount: float) -> dict:
    with tracer.start_as_current_span('payments.process') as span:
        # Semantic convention attributes
        span.set_attribute('order.id', order_id)
        span.set_attribute('payment.amount', amount)
        span.set_attribute('payment.currency', 'USD')

        # Span event — timestamped annotation (not an error)
        span.add_event('payment.validation.started')

        try:
            result = call_payment_gateway(order_id, amount)
            span.add_event('payment.gateway.responded', {
                'gateway.response_code': result['code']
            })
            # Do NOT set OK unless you explicitly want to mark success
            return result
        except Exception as e:
            span.record_exception(e)   # adds exception event with stacktrace
            span.set_status(StatusCode.ERROR, f'Payment declined: {str(e)}')
            raise


def call_payment_gateway(order_id, amount):
    # Child span — CLIENT kind for outbound calls
    with tracer.start_as_current_span(
        'payments.gateway.charge',
        kind=trace.SpanKind.CLIENT
    ) as child:
        child.set_attribute('server.address', 'gateway.example.com')
        child.set_attribute('server.port', 443)
        return {'code': '00', 'auth_code': 'ABC123'}
In Jaeger, filter by Tags: error=true. The failed span shows Status=ERROR, an exception event with exception.type, exception.message, and exception.stacktrace.

Day 9

OTel Collector — Basics

Lab 1 Run Collector + Send Test Trace with otel-cli
1
docker-compose.yml — Collector + Jaeger + Prometheus
yaml — docker-compose.yml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - '16686:16686'   # Jaeger UI
      - '4317:4317'     # OTLP gRPC

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ['--config=/etc/otelcol-contrib/config.yaml']
    volumes:
      - ./collector.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - '4318:4318'   # OTLP HTTP (app → collector)
      - '8888:8888'   # Collector internal metrics
    depends_on: [jaeger]

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - '9090:9090'
2
collector.yaml — three pipelines, correct processor ordering
yaml — collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:          # ← MUST be first
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s
  batch:                   # ← MUST be last
    timeout: 10s
    send_batch_size: 1024

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889
  logging:
    verbosity: detailed

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlp/jaeger, logging]
    metrics:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [prometheus, logging]
    logs:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [logging]
3
Send a test span with otel-cli
bash
# Install (macOS)
brew install equinix-labs/otel-cli/otel-cli

# Send a test span
otel-cli exec \
  --service my-test-service \
  --name 'test.operation' \
  --attrs 'http.request.method=GET,http.response.status_code=200' \
  --endpoint http://localhost:4318 \
  -- echo 'hello from otel-cli'

# Verify health check
curl http://localhost:13133/
# {"status":"Server available"}

Day 10

OTel Collector — Advanced

Lab 1 Tail Sampling — status_code + latency + baseline composite
All spans of a trace must reach the same Collector instance. In multi-Collector deployments, add a loadbalancingexporter tier first, routing by TraceID hash.
yaml — tail_sampling processor
processors:
  tail_sampling:
    decision_wait: 10s         # wait up to 10s for all spans
    num_traces: 50000
    expected_new_traces_per_sec: 100
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}

      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 500}

      - name: baseline-5pct
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order: [errors-policy, slow-traces-policy, baseline-5pct]
          composite_sub_policy:
            - name: errors-policy
              type: status_code
              status_code: {status_codes: [ERROR]}
            - name: slow-traces-policy
              type: latency
              latency: {threshold_ms: 500}
            - name: baseline-5pct
              type: probabilistic
              probabilistic: {sampling_percentage: 5}
          rate_allocation:
            - policy: errors-policy
              percent: 50
            - policy: slow-traces-policy
              percent: 30
            - policy: baseline-5pct
              percent: 20
Lab 2 spanmetrics Connector — RED metrics from traces
yaml — collector.yaml additions
connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
    dimensions:
      - name: http.request.method
      - name: http.response.status_code
      - name: service.name
    exemplars:
      enabled: true
    metrics_flush_interval: 15s

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlp/jaeger, spanmetrics]  # spanmetrics = exporter here
    metrics/spanmetrics:
      receivers:  [spanmetrics]                  # spanmetrics = receiver here
      processors: [memory_limiter, batch]
      exporters:  [prometheus]

# Generated Prometheus metrics:
# calls_total{service_name="...",http_request_method="GET",...}   142
# duration_milliseconds_bucket{le="50",...}                        98
Lab 3 filter processor — Drop health check spans
yaml — filter processor
processors:
  filter/drop-healthchecks:
    error_mode: ignore
    traces:
      span:
        - 'attributes["url.path"] == "/health"'
        - 'attributes["url.path"] == "/healthz"'
        - 'attributes["url.path"] == "/ready"'
        - 'attributes["user_agent.original"] == "ELB-HealthChecker/2.0"'

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, filter/drop-healthchecks, batch]
      exporters:  [otlp/jaeger]
Send a span with url.path=/health — it will NOT appear in Jaeger. A span with url.path=/api/orders WILL appear.

Day 11

Backends, Exporters & Visualisation

Lab 1 Grafana + Prometheus + Jaeger — Exemplar Linking
1
Grafana datasource provisioning — enables exemplar links
yaml — grafana-datasources.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: jaeger

  - name: Jaeger
    type: jaeger
    uid: jaeger
    url: http://jaeger:16686
In Grafana, open a histogram panel → hover over a data point → you'll see a ◆ (exemplar diamond). Click it to open the linked trace directly in Jaeger.

Day 12

Semantic Conventions

Lab 1 5 Annotated Span Patterns

Span 1 — Inbound HTTP Server (kind: SERVER)

python
span.set_attribute('http.request.method',        'POST')
span.set_attribute('url.full',                   'https://api.example.com/v1/checkout')
span.set_attribute('url.path',                   '/v1/checkout')
span.set_attribute('url.scheme',                  'https')
span.set_attribute('server.address',              'api.example.com')
span.set_attribute('server.port',                 443)
span.set_attribute('http.response.status_code',   201)
span.set_attribute('client.address',              '203.0.113.5')
span.set_attribute('user_agent.original',         'Mozilla/5.0 ...')

Span 2 — Outbound HTTP Client (kind: CLIENT)

python
span.set_attribute('http.request.method',        'GET')
span.set_attribute('url.full',                   'https://payment-gateway.example.com/charge')
span.set_attribute('server.address',              'payment-gateway.example.com')
span.set_attribute('server.port',                 443)
span.set_attribute('http.response.status_code',   200)

Span 3 — Database Query (kind: CLIENT)

python
span.set_attribute('db.system',       'postgresql')
span.set_attribute('db.name',         'orders')
span.set_attribute('db.statement',    'SELECT * FROM orders WHERE id = $1')
span.set_attribute('db.operation',    'SELECT')
span.set_attribute('server.address',  'postgres.internal')
span.set_attribute('server.port',     5432)

Span 4 — Kafka Producer (kind: PRODUCER)

python
span.set_attribute('messaging.system',            'kafka')
span.set_attribute('messaging.operation',          'publish')
span.set_attribute('messaging.destination.name',   'orders.created')
span.set_attribute('messaging.message.id',         'msg-abc-123')
span.set_attribute('server.address',               'kafka.internal')
span.set_attribute('server.port',                  9092)

Span 5 — gRPC Client Call (kind: CLIENT)

python
span.set_attribute('rpc.system',               'grpc')
span.set_attribute('rpc.service',              'payments.PaymentService')
span.set_attribute('rpc.method',               'ProcessPayment')
span.set_attribute('rpc.grpc.status_code',     0)   # 0 = OK
span.set_attribute('server.address',            'payment-service.internal')
span.set_attribute('server.port',               9090)
Lab 2 Deprecated Attribute Migrations
Deprecated (old)Current (new)Since
http.methodhttp.request.methodsemconv 1.23.0
http.status_codehttp.response.status_codesemconv 1.23.0
http.urlurl.fullsemconv 1.23.0
http.targeturl.path + url.querysemconv 1.23.0
net.peer.nameserver.addresssemconv 1.23.0
net.peer.portserver.portsemconv 1.23.0
http.schemeurl.schemesemconv 1.23.0
The OTCA exam may test both old and new attribute names. Know the migration direction: net.peer.* → server.*, http.* → http.request.*/http.response.*/url.*

Done

Lab Completion Checklist

  • Day 1 — Created 10 flashcards covering SLI/SLO/SLA, RED, USE, Golden Signals, Cardinality
  • Day 2 — Drew full OTel data flow: App → API → SDK → Exporter → Collector → Backend
  • Day 3 — Decoded a traceparent header by hand; ran OTel Demo; viewed traces in Jaeger
  • Day 4 — Instrumented HTTP server with Counter + Histogram; created a View; verified /metrics
  • Day 5 — Configured log4j2 OTel appender; verified TraceID in logs; set up filelog receiver
  • Day 6 — Ran Java agent with zero code changes; toggled sampler args; verified DB spans in Jaeger
  • Day 8 — Added custom spans with attributes + events; recorded exception; confirmed ERROR status
  • Day 9 — Ran Collector with 3-pipeline config; sent test span with otel-cli; verified health_check
  • Day 10 — Configured tail_sampling composite policy; set up spanmetrics connector; filter processor
  • Day 11 — Full Grafana stack with exemplar linking from Prometheus histogram to Jaeger trace
  • Day 12 — Annotated 5 spans with correct semconv; documented deprecated attribute renames