Black box, monitoring system, and observability

A distributed system contains multiple physically separate nodes linked together by a network. Such systems are inherently difficult to monitor as they are composed of “block-box” components: nodes from different, even competing providers. In such an environment, how to quickly debug, track service processing time, identify performance bottlenecks, and reasonably evaluate service capacity are the major challenges. Monitoring platform plays a crucial role in system feedback, without which improvement is impossible.

Axon is built on a peer-to-peer network. It's real-time monitoring system is composed of components that collect logs, metrics, and system resource usage data from the nodes and display the information in graphs in Grafana, the data visualization platform. This article aims to give you an overview of the architecture of Axon monitoring system and the key components (II), the deployment of the monitoring platform (III), and how to read these metrics (IV). Sections III and IV include technical details and parameters that will be particularly informative for DevOps; if these technical details are not your focus, feel free to skip.

Before diving into the technical details, let's refresh ourselves about the concept of “observability". When introduced in response to the black box problems in distributed systems, observability refers to the ability to measure the internal states of a system by examining its outputs. While traditional monitoring and alerting are focusing on system anomalies and failures, observability is about showing the actual behavior of the system itself. For an observable system, the primary concern is the state of the application itself, such as the current throughput and latency, instead of the indirect evidences like the machine or the network where the application is situated.

Logs, metrics, and traces are seen as three pillars of observability.


Logs are primarily used for recording discrete, time-stamped events. Applications output log messages in a specific format to a file, then use a logging program to aggregate the logs for analysis. Mature solutions, such as ELK Stack and Grafana Loki, are available on the market.

Despite being comprehensive and information-rich, log files eat up a lot of storage space. Although all the log events can be concatenated by timestamps, it is difficult to demonstrate their comprehensive mapping relationships.


Metrics are mainly aggregated data measured over intervals of time. It is suited to measurable data, such as the number of calls, CPU, usage, and query size. Compared to logs, metrics take up much less storage space. Monitoring and alerting are the main uses of metrics, which Prometheus has established as the de facto standard.


A trace is a series of causally related distributed events, providing a wider and continuous view of an application. Tracing follows a program’s data progression path through a system. A trace contains one or more span(s), the logical unit of work named by Jaeger. Each span includes the operation name, start time, and duration. Tracing adds critical visibility into the health of an application end-to-end, which is helpful for getting a better understanding of system behavior, debugging and troubleshooting issues related to performance.

Implement a monitoring system basically follows these steps:

  1. Setup event tracking
  2. Output formatted logs
  3. Collect metrics and logs
  4. Display data in graphing interface
  5. Establish response to early warnings

Axon’s Monitoring Platform

Axon's monitoring platform contains Agent (Axon node), monitor server, and monitor dashboard. The diagram below illustrates these components and their interactions:

Figure 1. Overall design of Axon’s monitoring platform

Agent (Axon node)

Agent, or Axon node, is for collecting monitoring metrics and interacting with the monitor server.

  • node-exporter: mainly collects system resource usage data from the target and converts it into a format supported by Prometheus located in the monitor server. Prometheus periodically scrapes metrics from the node exporter. This data will be visualized in Grafana.
  • axon-exporter: mainly collects Axon’s metrics, which will be periodically scraped by Promethues and eventually be visualized in Grafana.
  • jaeger-agent: listens and collects the traces data exposed by Axon and forwards them to the configured jaeger-collector.
  • filebeat-agent: a lightweight shipper that collects, centralizes, and forwards log data to elasticsearch, a distributed search and analytics engine.

Monitor Server

Monitor server is mainly for displaying the data of metrics and traces.

  • prometheus: collects Axon metrics which will be displayed as a datasource in Grafana.
  • jaeger-collector: receives data pushed by jaeger-agent which will be displayed as a datasource in Grafana.
  • elasticsearch: collects and stores Axon‘s logs, which will be displayed as a datasource in Grafana. Elasticsearch filters error logs and sends alerts through ES-alert (the alerting for Elasticsearch) in accordance with previously defined rules. (Please be aware that Elasticsearch may consume a lot of machine memory.)

Monitor Dashboard

  • Grafana: uses Promethues, jaeger, and elasticsearch as datasources and displays data in dashboard, monitors, and logs.

Deploy Monitor and Monitor Agent

Axon application performance monitoring (apm) supports one-click deployment of monitor and monitor agent.

Monitor Deployment

Step 1 Copy axon-devops to the target machine.

git clone
cd axon-devops/apm/monitor

Step 2 Edit prometheus.yml and roles/monitor/vars/main.yml.

prometheus.yml: contains Prometheus configuration files.

scrape_config, one of these files, specifies a set of targets and parameters describing how to scrape them. In the general case, one scrape configuration specifies a single job.

Below are a few scrape_config parameters set in Axon, and more configuration information is available here:


scrape_interval: 5s
# Scrape targets from a job every 15s. Default = 1m.

evaluation_interval: 5s
# Evaluate rules every 15s. Default = 1m.

scrape_timeout: 5s
# Set to the global default (10s).

roles/monitor/vars/main.yml: mainly used as the parameters required by Ansible when deploying the monitor.

monitor_dir: /home/ckb/axon-monitor

Step 3 Execute monitor deployment command.

cd axon-devops/apm/deploy
make monitor-clean # Stop monitor service(History data will be cleared. Use carefully)
make monitor-deploy # Start monitor service
docker-compose ps # Check if service is seccessfully started.

Monitor Agent Deployment

Step 1 Copy axon-devops directory to the target machine

git clone
cd axon-devops/apm/agent

Step 2 Edit config files.

  • .env: sets environment variables that are used when running the docker-compose.yml
# Used to push data to the jaeger server.
# Configure the ip port for jaeger-collector.
# Corresponds to the jaeger-collector service in monitor docker-compose.

# Used to interact with Axon.
# Associated with the [apm] tracing_address parameter.

# Used to collect logs for filebeat.
# Associated with the [logger] log_path parameter.

jaeger_agent_port config file is here.

  • filebeat.yml: used to build structured configurations for lists and dictionaries

- type: log
enabled: true
- '/usr/share/filebeat/logs/*log'
# Filebeat collects files in /usr/share/filebeat/logs/ that end in .log

fields_under_root: true
# Custom fields will be stored as top-level fields in the output file

keys_under_root: true
# The existing keys are overwritten by keys in decoded JSON obejct

ignore_older: 5m
# Filebeat will ingore files modified before the specified time span

scan_frequency: 1s
# How frequently Filebeat checks new files in the path specified for files collection

# filebeat sends logs to elasticsearch

hosts: ["ES_ADDRESS:9200"]
# elasticsearch host

- index: "axon-%{[agent.version]}-%{+yyyy.MM.dd}"
# Index for creating datasource in Grafaba. Logs can be located with axon-*
  • axon-devops/apm/deploy/roles/agent/vars/main.yaml
monitor_agent_dir: /home/ckb/axon-apm-agent 
# Copy the command to the target file storage section

log_path: \/home\/ckb\/axon\/logs
# Axon's Logs directory. Must be identical with Axon deployment directory.

es_address: XXX.XX.XX.XX
# Intranet of the ip address deployed by elasticsearch.
  • axon-devops/apm/deploy/hosts
# Configure monitor agent deployment

# Monitor agent deployment follows axon agent. Here axon node ip must be specified.

# prometheus_server


Step 3 Execute Monitor Agent deployment command

cd axon-devops/apm/deploy
make clean
# Clear axon monitor agent

make deploy
# Deploy axon monitor agent

docker-compose ps
# Check if service is successfully started.

View Monitoring Platform

When all services are up, you can access the corresponding monitoring platforms through the browser by the following address:

  • Grafana
  • Jaeger

Deploy High Availability Monitoring

High Availability (HA) monitoring is also supported. However, due to the high cost of operation and maintenance, Axon does not use HA but a single-node deployment (as explained in Section II) instead.

If your monitoring platform is suitable for HA, you can refer to the following tutorials:

Monitoring Metrics Explained

This section provides an overview of the monitoring metrics on each Grafana panel.


Recource Overview

Figure 2. Server Resource Overview

Metrics BrowerTypeDescriptionLegend Details
1Overall total 5m load & average CPU usedCPUMonitor overall cpu usageCPU Cores Number of cores for all CPUs count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})
Total 5m load Load5 for all CPUs sum(node_load5{job=~"node_exporter"})
Overall average used% Average utilization of all CPUs avg(1 - avg(irate(node_cpu_seconds_total{job=~"node_exporter",mode="idle"}[5m])) by (instance)) * 100
Load5 Avg Load5 Avg for all CPUs sum(node_load5{job=~"node_exporter"}) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})
2Overall total memory & average memory usedDiskMonitor overall disk usageTotal Total memory sum(node_memory_MemTotal_bytes{job=~"node_exporter"})
Total Used Overall used memory sum(node_memory_MemTotal_bytes{job=~"node_exporter"} - node_memory_MemAvailable_bytes{job=~"node_exporter"})
Total Average Used Utilization of all memory (sum(node_memory_MemTotal_bytes{job=~"node_exporter"} - node_memory_MemAvailable_bytes{job=~"node_exporter"}) / sum(node_memory_MemTotal_bytes{job=~"node_exporter"}))*100
3Overall total disk & average disk used%DiskMonitor overall disk usageTotal Total memory sum(avg(node_filesystem_size_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance))
Total Used Overall used disk sum(avg(node_filesystem_size_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance)) -sum(avg(node_filesystem_free_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance))
Total Average Used% Utilization of all disk (sum(avg(node_filesystem_size_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=\~"node_exporter",fstype=\~"xfs\|ext.*"})by(device,instance))))

Resource Details

Figure 3. Resource details

NameTypeDescriptionLegend Details
1Internet traffic per hourNetworkTraffic statisticsReceive Receive statistics increase(node_network_receive_bytes_total{instance=~"$node",device=~"$device"}[60m])
Transmit transmit statistics increase(node_network_transmit_bytes_total{instance=~"$node",device=~"$device"}[60m])
2CPU% BasicCPUTraffic statisticsSystem Average sy ratio avg(irate(node_cpu_seconds_total{instance=~"$node",mode="system"}[5m])) by (instance) *100
User Average sy ratio avg(irate(node_cpu_seconds_total{instance=~"$node",mode="user"}[5m])) by (instance) *100
Iowait Average sy ratio avg(irate(node_cpu_seconds_total{instance=~"$node",mode="iowait"}[5m])) by (instance) *100
Total Average CPU usage (1 - avg(irate(node_cpu_seconds_total{instance=~"$node",mode="idle"}[5m])) by (instance))*100
3Memory BasicMemoryNode memory usageTotal Total memory node_memory_MemTotal_bytes{instance=~"$node"}
Used Used memory node_memory_MemTotal_bytes{instance=~"$node"} - node_memory_MemAvailable_bytes{instance=~"$node"}
Available Available memory size node_memory_MemAvailable_bytes{instance=~"$node"}
Used% Utilization of all memory (1 - (node_memory_MemAvailable_bytes{instance=~"$node"} / (node_memory_MemTotal_bytes{instance=~"$node"})))* 100
4Network bandwidth usage per second allNetworkNetwork bandwidthReceiveTotal Receive statistics per second irate(node_network_receive_bytes_total{instance=~'$node',device=~"$device"}[5m])*8
Transmit Transmit statistics per second irate(node_network_transmit_bytes_total{instance=~'$node',device=~"$device"}[5m])*8
5System LoadCPUSystem Load1m Load 1 node_load1{instance=~"$node"}
5m Load 5 node_load5{instance=~"$node"}
15m Load 15 node_load15{instance=~"$node"}
CPU cores Number of cores for CPU sum(count(node_cpu_seconds_total{instance=~"$node", mode='system'}) by (cpu,instance)) by(instance)
Load5 Avg load5 Avg for CPU avg(node_load5{instance=~"$node"}) / count(node_cpu_seconds_total{instance=~"$node", mode='system'})
Load5 Avg-{{instance}} Not shown, for alert sum(node_load5) by (instance) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'}) by (instance)
6Disk R/W DataDiskDisk thoughputRead bytes node_load1{instance=~"$node"}
Written bytes node_load5{instance=~"$node"}
7Disk Space Used% BasicDiskIOPSMount point Disk space utilization `(node_filesystem_size_bytes{instance=~'$node',fstype=~"ext.*
8Disk IOps Completed (IOPS)DiskIOPSReads completed Read IOPS irate(node_disk_io_time_seconds_total{instance=~"$node"}[5m])
Writes completed Write IOPS irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m])
9Time Spent Doing I/OsDiskI/O UtilizationIO time I/O Utilization irate(node_disk_io_time_seconds_total{instance=~"$node"}[5m])
{{instance}}-%util Not shown, for alert irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m])
10Disk R/W Time(Reference: less than 100ms)(beta)DiskAverage response timeRead time irate(node_disk_read_time_seconds_total{instance=~"$node"}[5m]) / irate(node_disk_reads_completed_total{instance=~"$node"}[5m])
Write time irate(node_disk_write_time_seconds_total{instance=~"$node"}[5m]) / irate(node_disk_writes_completed_total{instance=~"$node"}[5m])
11Network SockstatNetworkSocketstatCurrEstab Number of ESTABLISHED state connections node_netstat_Tcp_CurrEstab{instance=~'$node'}
TCP_tw status Number of time_wait state connections node_sockstat_TCP_tw{instance=~'$node'}
Sockets_used Total number of all protocol sockets used node_sockstat_sockets_used{instance=~'$node'}
UDP_inuse Number of UDP sockets in use node_sockstat_UDP_inuse{instance=~'$node'}
TCP_alloc Number of tcp sockets(ESTABLISHED, sk_buff) Number of TCP sockets(ESTABLISHED, sk_buff)
Tcp_PassiveOpens Number of passively opened tcp connections irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])
Tcp_ActiveOpens Number of active open tcp connections irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])
Tcp_InSegs Number of tcp messages received irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])
Tcp_OutSegs Number of tcp messages transmit irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])
Tcp_RetransSegs Number of tcp messages retransmitted irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])
12Open File Descriptor (left)/Context switches (right)DiskI/O UtilizationUsed filefd Number of open file fd node_filefd_allocated{instance=~"$node"
Switches Context switches irate(node_context_switches_total{instance=~"$node"}[5m])

Actuator Health

NameTypeDescriptionLegend Details
Axon StatusAxonAxon service statusactive Number of Axon service in up status count(up{job="axon_exporter"} == 1)
down Number of Axon service in down status (up{job="axon_exporter"} == 0)
Node statusNode_exporterNode_exporter service statusactive Number of Node_exporter services in up status count(up{job="node_exporter"} == 1)
down Number of Node_exporter services in down status count(up{job="node_exporter"} == 0)
Promethues StatusPromethuesPromethues service statusactive Number of Promethues services in up status count(up{job="prometheus"} == 1)
down Number of Promethues services in down status count(up{job="prometheus"} == 0)
Jaeger statusJaegerJaeger service statusjaeger-query-active Number of Jaeger-query services in up status count(up{instance=~"(.*):16687"} == 1)
jaeger-collector-active Number of Jaeger-collector services in down status count(up{instance=~"(.*):14269"} == 1)
jaeger-query-down Number of Jaeger-query services in up status count(up{instance=~"(.*):16687"} == 0)
jaeger-collector-down Number of Jaeger-collector services in down status count(up{instance=~"(.*):14269"} == 0)
Jaeger Agent StatusJaegerJaeger agent statusactive Number of Jaeger-agent services in up status count(up{job="jaeger_agent"} == 1)
down Number of Jaeger-agent services in down status count(up{job="jaeger_agent"} == 0)


Panel 1

Figure 4. Panel 1

NameDescriptionLegend Details
1TPSTPS for consensusavg(rate(axon_consensus_committed_tx_total[5m]))
2exec_90Consensus exec time for P90avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance)))
3consensus_round_costNumber of rounds needed to reach consensus{{instance}} Number of rounds needed to reach consensus (axon_consensus_round > 0 )
4consensus_90Consensus time for P90time_usage(s) Consensus time for P90 avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance)))

Panel 2

Figure 5. Panel 2

NameDescriptionLegend Details
1get_cf_each_block_time_usageAverage time per block for rocksdb running get_cfAverage time per block for rocksdb running get_cf avg (sum by (instance) (increase(axon_storage_get_cf_seconds[5m]))) / avg(increase(axon_consensus_height[5m]))
2put_cf_each_block_time_usageAverage time per block for rocksdb running put_cfAverage time per block for rocksdb running put_cf avg (sum by (instance) (increase(axon_storage_put_cf_seconds[5m]))) / avg(increase(axon_consensus_height[5m]))
3current_heightChain current height{{instance}} Node current height sort_desc(axon_consensus_height)
4check_block_cost_p90check block exec time for P90Check block exec time for P90 avg(histogram_quantile(0.90, sum(rate(axon_consensus_check_block_bucket{type="get_txs_cost"}[5m])) by (le, instance)))

Panel 3

Figure 6. Panel 3

NameDescriptionLegend Details
1synced_blockNumber of blocks synchronized by nodes{{instance}} Number of blocks synchronized by nodes axon_consensus_sync_block_total
2livenessLivenessGrowth in node height increase(axon_consensus_height{job="axon_exporter"}[1m])
3mempool_cached_txNumber of transactions in the current mempool{{instance}} Number of transactions in the current mempool axon_mempool_tx_count

Panel 4

Figure 7. Panel 4

NameDescriptionLegend Details
1processed_tx_requestreceived transaction request count in last 5 minutes (the unit is count/second)Total Total number of transaction requests sum(rate(axon_api_request_result_total{type="send_transaction"}[5m]))
Success Total Total number of successful transaction requests sum(rate(axon_api_request_result_total{result="success",type="send_transaction"}[5m]))
instance processed transaction request count in last 5 minutes (the unit is count/second) rate(axon_api_request_result_total{result="success", type="send_transaction"}[5m])
2processed_rpc_requestEstimate total number of of successful API request in last five minutesSuccess Total Total number of successful API requests sum(rate(axon_api_request_result_total{result="success"}[5m])) by (type)

Other Items

NameDescriptionLegend Details
network_message_arrival_rateEstimate the network message arrival rate in the last five minutesEstimate the network message arrival rate in the last five minutes ( # broadcast_count * (instance_count - 1) sum(increase(axon_network_message_total{target="all", direction="sent"}[5m])) * (count(count by (instance) (axon_network_message_total)) - 1) # unicast_count + sum(increase(axon_network_message_total{target="single", direction="sent"}[5m]))) /# received_count(sum(increase(axon_network_message_total{direction="received"}[5m])))
consensus_round_costNumber of rounds needed to reach consensus{{instance}} Number of rounds needed to reach consensus (axon_consensus_round > 0 )
Connected Peers(Gauge)Number of nodes on the current connection{{instance}} Number of nodes on the current connection axon_network_connected_peers
Connected Peers(Graph)Number of nodes on the current connectionSaved peers Total number of peers max(axon_network_saved_peer_count)
Connected Peers Number of nodes on the current connection axon_network_connected_peers
Consensus Peers(Gauge)Number of consensus nodes{{instance}} Number of consensus nodes axon_network_tagged_consensus_peers
Consensus Peers(Graph)Number of consensus nodesConsensus peers Total number of consensus peers max(axon_network_tagged_consensus_peers)
{{instance}}-Connected Consensus Peers (Minus itself) Number of consensus nodes axon_network_connected_consensus_peers
Saved peersNumber of nodes saved peers{{instance}} Number of nodes saved peers axon_network_saved_peer_count
Unidentified ConnectionsThe number of connections in the handshake, requiring verification of the chain{{instance}} The number of connections in the handshake, requiring verification of the chain id axon_networ_unidentified_connections
Connecting PeersNumber of active initiations to establish connections with other{{instance}} Number of active initiations to establish connections with other machines axon_network_outbound_connecting_peers
Disconnected count (To other peers)Disconnected count{{instance}} Disconnected count axon_network_ip_disconnected_count
Received messages in processingNumber of messages being processed{{instance}} Number of messages being processed axon_network_received_message_in_processing_guage
Received messages in processing by ipNumber of messages being processed (based on IP of received messages){{instance}} Number of messages being processed (based on IP of received messages) axon_network_received_ip_message_in_processing_guage{instance=~"$node"}
Ping (ms)_ p90p90 for p2p Ping{{instance}} p90 for P2p Ping avg(histogram_quantile(0.90, sum(rate(axon_network_ping_in_ms_bucket[5m])) by (le, instance)))
Network bandwidth usage per second alllink to axon-node (Network bandwidth usage per second all)link to axon-node (Network bandwidth usage per second all)
Internet traffic per hourlink to axon-node (Internet traffic per hour)link axon-benchmark (internet traffic per hour)
mempool_cached_txlink axon-benchmark (mempool_cached_tx)link axon-benchmark (mempool_cached_tx)
consensus_round_costlink axon-benchmark (consensus_round_cost)link axon-benchmark (consensus_round_cost)