netdata

最後更新: 2019-08-13

介紹

real-time performance and health monitoring solution

Designed to:

  • Solve the centralization problem of monitoring (Scales to infinity)
  • Replace the console for performance troubleshooting

HomePage: http://my-netdata.io

特點

  • 1s granularity
  • Zero disk I/O (預設所有資料放在 RAM)

Port:

19999/tcp                  # Web

8125/TCP, 8125/UDP  # statsd

運作:

           Query
             |
Collect -> Store -> Stream -> Archive
             |
           Check

positive / negative values

positive values

read, input, inbound, received

negative values

write, output, outbound, sent

monitoring agent

  • metrics collector
  • time-series database
  • metrics visualizer
  • alarms notification engine

Security Design

  • Netdata daemon runs as a normal system user
  • plugins perform a hard coded data collection job
  • plugins & Netdata slaves unidirectional: from the plugin towards the Netdata daemon
  • dashboards are read-only
  • data do not leave the server where they are collected
  • Netdata servers do not talk to each other
  • your browser connects all the Netdata servers

目錄

  • Installation
  • Anonymous Statistics
  • Upgrade
  • Usage
  • Configure
  • Database
  • Authentication
  • Netdata registry
  • Web Setting
  • Plugins
  • Disable IPv6
  • Health monitoring
     +
  • Stopping notifications for individual alarms
  • Central Netdata server (streaming)
     + statsd
  • Database Queries
  • Export and import a snapshot
  • Performance Tuning
  • Netdata 的自身 Info.

Installation

 

netdata 一共有 4 重安裝方式

  1. Linux 64bit pre-built static binary
  2. Binary Packages
  3. Run Netdata in a Docker container
  4. Install Netdata on Linux manually

Static Binary

mkdir /usr/src/netdata

cd /usr/src/netdata

wget https://github.com/netdata/netdata/releases/download/v1.16.0/netdata-v1....

chmod 700 ./netdata-v1.16.0.gz.run

./netdata-v1.16.0.gz.run

systemctl enable netdata

systemctl start netdata

Remark

  • 會安裝在 /opt/netdata
  • 會建立 User: 'netdata', Group: 'netdata'

Checking

netstat -ntlp | grep netdata

tcp        0      0 127.0.0.1:8125          0.0.0.0:*               LISTEN      22515/netdata
tcp        0      0 0.0.0.0:19999           0.0.0.0:*               LISTEN      22515/netdata

Manually(用 Static Binary 方便很多)

Source: https://github.com/firehol/netdata.git

# Debian / Ubuntu

apt-get install zlib1g-dev uuid-dev libuv1-dev liblz4-dev libjudy-dev libssl-dev libmnl-dev \
 gcc make git autoconf autoconf-archive autogen automake pkg-config curl python

# CentOS / Red Hat Enterprise Linux

yum install autoconf automake curl gcc git make nc pkgconfig python\
 libmnl-devel libuuid-devel openssl-devel libuv-devel lz4-devel Judy-devel zlib-devel

 


Anonymous Statistics

 

Starting with v1.12 Netdata also collects anonymous statistics on certain events

To opt-out from sending anonymous statistics

touch /opt/netdata/etc/netdata/.opt-out-from-anonymous-statistics

log

/opt/netdata/var/log/netdata/error.log

2019-08-29 17:32:14: netdata INFO  : MAIN :
 /opt/netdata/usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'OK' '-'

/opt/netdata/usr/libexec/netdata/plugins.d/anonymous-statistics.sh

if [ -f "/opt/netdata/etc/netdata/.opt-out-from-anonymous-statistics" ]; then
        exit 0
fi

 


Upgrade

 

chmod 700 ./netdata-latest.gz.run

# 過程會自動 stop / start netdata

./netdata-latest.gz.run --accept

 


Usage

 

Web: http://your.server.ip:19999/

the current charts zooming (SHIFT + mouse wheel over a chart),

the highlighted time-frame (ALT + select an area on a chart),

Auto-detection of data collection sources

This auto-detection process happens only once, when Netdata starts.

Exceptions:

containers and VMs are auto-detected forever

 


Configure

 

Get running config:

http://127.0.0.1:19999/netdata.conf

Config File Location:

/opt/netdata/etc/netdata/netdata.conf

# 建立 config file

wget -O /opt/netdata/etc/netdata/netdata.conf http://localhost:19999/netdata.conf

# CPU & RAM Usage

[global]
  # Enable KSM to half Netdata memory requirement
  history = 3600
  update every = 1

# Memory modes

  • ram
  • alloc
  • save
  • map
  • none
  • dbengine
[global]
  memory mode = save
  cache directory = /var/cache/netdata

ram

data are purely in memory. Data are never saved on disk. (Supports KSM)

alloc

like ram but it uses calloc() and does not support KSM. (fallback)

save (the default)

Data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.

It also uses mmap() and supports KSM.

map

data are in memory mapped files. This works like the swap. (constant write on your disk)

(does not support KSM)

For each chart, Netdata maps the following files:

  • chart/main.db                    # chart information. Every time data are collected for a chart, this is updated.
  • chart/dimension_name.db   # round robin database

none

without a database (collected metrics can only be streamed to another Netdata)

dbengine

The data are in database files.

Files

ls /opt/netdata/var/cache/netdata/dbengine

datafile-1-0000000001.ndf
journalfile-1-0000000001.njf
datafile-1-0000000002.ndf        # more recent metric data
journalfile-1-0000000002.njf
...

There is some amount of RAM dedicated to data caching and indexing

# Unit: MiB
page cache size = 32

The number of history entries is not fixed (depends on the configured disk space)

"history" configuration option is meaningless for "memory mode = dbengine"

"dbengine" is the only mode that supports changing "update_every" without losing the previously stored metrics

Suggest to use this mode on nodes that also run other applications

Database Engine uses direct I/O to avoid polluting the OS filesystem caches

# Unit: MiB
dbengine disk space = 256

---

The DB engine stores chart metric values in 4k pages in memory.

Each chart dimension gets its own page to store consecutive values generated from the data collectors.

When those pages fill up they are slowly compressed and flushed to disk.

 => 亦即是每 17 min. flush 一次

    # 每類 chart 的 cache = 4 kbyte, 每隻 record 4 bytes, 在每秒 get 一次的情況下, 1024 秒就 full

    4096 / 4 = 1024 sec (dimension: 1s)

When the disk quota is exceeded the oldest values are removed from the DB engine at real time

 * When we query the DB engine for data

    => trigger disk read I/O requests that fill the Page Cache with the requested pages

 * The Database Engine uses direct I/O to avoid polluting the OS filesystem caches.

---

Config

/opt/netdata/etc/netdata/netdata.conf

[global]
    memory mode = dbengine
    # Unit: MiB
    page cache size = 32
    dbengine disk space = 256

 * There is one DB engine instance per Netdata host/node

 * All DB engine instances, for localhost and all other streaming recipient nodes inherit their configuration from netdata.conf

 * There are explicit memory requirements per DB engine instance

File descriptor

The Database Engine may keep a significant amount of files open per instance

(at least 50 file descriptors available per dbengine instance)

systemctl edit netdata

[Service]
LimitNOFILE=65536

Remark: /etc/sysctl.conf: "fs.file-max = 65536"

Performance

OOM Score

[global]
    OOM score = 1000

Netdata runs with OOMScore = 1000

This means Netdata will be the first to be killed when your server runs out of memory.

Scheduling Policy

[global]
  process scheduling policy = idle

By default Netdata runs with the idle process scheduling policy,

so that it uses CPU resources, only when there is idle CPU to spare.

 


Database

 

Ram Usage (for DB)

The default history is 3600 entries,

thus it will need 14.4KB for each chart dimension (4 bytes for the value * the entries of its history)

If you need 1000 dimensions, they will occupy just 14.4MB.

If data collection frequency is set to 1 second. You will have just one hour of data.

KSM

Netdata offers all its round robin database to kernel for deduplication

KSM is a solution that will provide 60+% memory savings to Netdata.

# by default 0; 1 for the kernel to spawn ksmd

echo 1 >/sys/kernel/mm/ksm/run

 


Authentication

 

IP Level ACL

 * best and the suggested way to protect Netdata

   => Expose Netdata only in a private LAN => IP Level

[web]
    bind to = 10.1.1.1:19999 localhost:19999

username & password

Use web server to provide authentication (in front of all your Netdata servers)

Web Server Setting (nginx)

Nginx to forward requests to netdata

HTTP auth file: /etc/nginx/netdata.users

URL: https://your-server/netdata/

# Running netdata as a subfolder to an existing virtual host

server {
    ...
    include /etc/nginx/templates/netdata.tmpl;
}

netdata.tmpl

location = /status {
    return 301 /status/;
}

location ~ /status/(?<ndpath>.*) {
    proxy_redirect off;
    proxy_set_header Host $host;

    proxy_set_header X-Forwarded-Host $host;
    proxy_set_header X-Forwarded-Server $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_http_version 1.1;
    proxy_pass_request_headers on;
    proxy_set_header Connection "keep-alive";
    proxy_store off;
    proxy_pass http://127.0.0.1:19999/$ndpath$is_args$args;

    gzip on;
    gzip_proxied any;
    gzip_types *;

    auth_basic "Authentication Required";
    auth_basic_user_file /etc/nginx/netdata.users;
}

 


Netdata registry

 

registry = node menu  (on top left corner of the Netdata dashboards)

目的:

  • enables the Netdata cloud features, such as the node view
  • multiple Netdata are integrated into one distributed application (distributed monitoring)

The registry keeps track of 4 entities:

  • machine_guid: a random GUID generated by each Netdata  (first time it starts)
  • person_guid: the web browsers accessing the Netdata installations (first time it sees a new web browser)
  • URLs of Netdata installations
  • accounts: i.e. the information used to sign-in via one of the available sign-in methods.

Default registry: https://registry.my-netdata.io

Who talks to the registry?

Your web browser only!

Flow

Browser --> netdata

 <-URL to Registry-

Browser --> Registry

Run your own registry

# Server (registry)

* Every Netdata can be a registry

[registry]
    enabled = yes
    registry to announce = http://your.registry:19999
    allow from = 192.168.123.*

Remark

(1) registry 有 DB database: /var/lib/netdata/registry/*.db

  • registry-log.db, the transaction log
  • registry.db, the database

(2) IPs allowed by [registry].allow from should also be allowed by [web].allow connection from.

# Client (netdata)

Advertise it to registry

[registry]
    enabled = no
    registry to announce = http://your.registry:19999

 

 


Web Setting

 

Disable Web Dashboard

[web]
    mode = none

Threads

# The default number of processor threads is min(cpu cores, 6)

[web]
  web server threads = 4
  web server max sockets = 512

Access lists

Netdata supports access lists in netdata.conf:

[web]
    allow connections from = localhost *
    allow dashboard from = localhost *
    allow badges from = *
    allow streaming from = *
    allow netdata.conf from = localhost
    allow management from = localhost

說明

allow badges from

checks if the API request is for a badge. Badges are not matched by allow dashboard from.

allow netdata.conf from

checks the IP to allow http://netdata.host:19999/netdata.conf

IPs allowed by allow netdata.conf from should also be allowed by allow connections from

 


Plugins

 

Internal, External, Modular plugins

  • Internal data collection plugins (running inside the Netdata daemon)
  • External data collection plugins (independent processes, sending data to Netdata over pipes)
  • Modular plugin orchestrators (external plugins that have multiple data collection modules)

netdata.conf

Disable a plug-in

在 config folder node.d python.d ..

[plugins]
    proc = yes
    diskspace = yes
    ...
    node.d = yes

Per plug-in setting

[plugin:python.d]
        # update every = 5

 


Disable IPv6

 

# per plugin configuration
[plugin:proc]
  /proc/net/sockstat6 = no
  /proc/net/snmp6 = no

[plugin:proc:/proc/net/snmp6]
 ...

[plugin:proc:/proc/net/sockstat6]
 ipv6 TCP sockets = no
 ipv6 UDP sockets = no
 ipv6 UDPLITE sockets = no
 ipv6 RAW sockets = no
 ipv6 FRAG sockets = no
# filename to monitor = /proc/net/sockstat6

 


Disks Setting

 

Disabling performance metrics for individual device

[plugin:proc:/proc/diskstats:sda]
    enable performance metrics = no

# Disks 分類內其他 items

[plugin:proc:diskspace]
    exclude space metrics on paths = /tmp /dev/* /run/* /var/*
    exclude space metrics on filesystems = *sshfs fusectl autofs

 


Health monitoring

 

# Enable monitoring

/opt/netdata/etc/netdata/netdata.conf

[health]
    enabled = yes

/opt/netdata/etc/netdata/edit-config health_alarm_notify.conf

SEND_EMAIL="YES"

 * Default alarms shipped with Netdata.

Alerm to multi mailbox

health_alarm_notify.conf

# to receive only critical alarms, set it to "root|critical"
# 沒有 "to:" 時發比誰
DEFAULT_RECIPIENT_EMAIL="email1@example.org email2@example.org|critical"

...

# Alert "to: sysadmin" 時發 mail 比誰
role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}"

Testing Notifications

# become user netdata

su -s /bin/bash netdata

# enable debugging info on the console

export NETDATA_ALARM_NOTIFY_DEBUG=1

# send test alarms to sysadmin

/opt/netdata/usr/libexec/netdata/plugins.d/alarm-notify.sh test

...
--- BEGIN sendmail command ---
/usr/sbin/sendmail -t
--- END sendmail command ---
2021-01-07 18:23:17: alarm-notify.sh: 
 INFO: sent email notification for: 
  hypervisor.datahunter.org test.chart.test_alarm is CLEAR to 'tim@datahunter.org'
# OK

Stop notifications for individual alarms (silencing the alarm)

Step1: Find the alarm configuration file

ie.

/opt/netdata/usr/lib/netdata/conf.d/health.d/net.conf

Step2: Edit the file to enable silencing

to: sysadmin

改成

to: silent

Example

# NIC Full Loading

    alarm: 5m_sent_traffic_overflow
       on: net.ens192
       os: linux
    hosts: *
 families: *
   #lookup: average -5m unaligned absolute of received
   lookup: average -5m unaligned absolute of sent
     calc: ($interface_speed > 0) ? ($this * 100 / (100 * 1000)) : ( nan )
    units: %
    every: 60s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (90))
    delay: down 1m multiplier 1.5 max 1h
     info: interface sent bandwidth usage over net device speed max
       to: sysadmin

Value in alerm mail

[ $this = 94.079845 ] [ $status = 1 ] [ $CRITICAL = 4 ]

# CPU Usage

template: 10min_cpu_usage
      on: system.cpu
      os: linux
   hosts: *
  lookup: average -10m unaligned of user,system,softirq,irq,guest
   units: %
   every: 1m
    warn: $this > (($status >= $WARNING)  ? (75) : (85))
    crit: $this > (($status == $CRITICAL) ? (85) : (95))
   delay: down 15m multiplier 1.5 max 1h
    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
      to: sysadmin

# 用 variable

 template: 1m_received_traffic_overflow
       on: net.net
       os: linux
    hosts: *
 families: *
   lookup: average -1m unaligned absolute of received
     # $interface_speed 在之前的 template 定義出來
     calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan )
    units: %
    every: 10s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (90))
    delay: down 1m multiplier 1.5 max 1h
     info: interface received bandwidth usage over net device speed max
       to: sysadmin

alarm vs template

Alarms

It attached to specific charts and use the alarm label. (net.eth0)

Alarms have higher precedence and will override templates.

If an alarm and template entity have the same name and attach to the same chart, Netdata will use the alarm.

Need to find the context? Hover over the date on any given chart and look at the tooltip.

Templates

define rules that apply to all charts of a specific context(net.net), and use the template label.

Templates help you apply one entity to all disks, all network interfaces, all MySQL databases, and so on.

解說

on:

Which chart the entity listens to

lookup:

This line makes a database lookup to find a value. This result of this lookup is available as $this

lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS]

METHOD

  one of average, min, max, sum, incremental-sum

average:

    Calculate the average of all the metrics collected.

percentage:

    Clarify that we're calculating a percentage of RAM usage.

    of used: Specify which dimension (used) on the system.ram chart you want to monitor with this entity.

AFTER

a relative number of seconds, but it also accepts a single letter for changing the units,

like -1s = 1 second in the past, -1m = 1 minute in the past, -1h = 1 hour in the past

OPTIONS

space separated list of percentage, absolute, min2max, unaligned, match-ids, match-names

i.e.

lookup: average -10m unaligned of user,system,softirq,irq,guest

units:

"calc:" 回來的值的 units

every:

How often to perform the lookup calculation to decide whether or not to trigger this alarm.

warn/crit:

The value at which Netdata should trigger a warning or critical alarm.

warn: EXPRESSION

i.e.

  warn: $this > 80
  crit: $this >= 90

conditional evaluation operator "?"

The conditional evaluation operator ? is supported too.

Using this operator IF-THEN-ELSE conditional statements can be specified.

The format is: (condition) ? (true expression) : (false expression).

hysteresis

warn: $this > (($status >= $WARNING)  ? (75) : (85))
crit: $this > (($status == $CRITICAL) ? (85) : (95))

If the value is constantly varying between 80 and 90,
then it will trigger a warning the first time it goes above 85,
but will remain a warning until it goes below 75 (or goes above 85).

If the value is constantly varying between 90 and 100,
then it will trigger a critical alert the first time it goes above 95,
but will remain a critical alert goes below 85
(at which point it will return to being a warning).

percentage

instead of returning the value, calculate the percentage of the sum of the selected dimensions,
versus the sum of all the dimensions of the chart. This also sets the units to %.

absolute or abs, turn all values positive and then sum them.

min2max, when multiple dimensions are given, do not sum them, but take their max - min

special variables

$this, which is resolved to the value of the current alarm.

$status, which is resolved to the current status of the alarm

  This values can be compared with

  $REMOVED, $UNINITIALIZED, $UNDEFINED, $CLEAR, $WARNING, $CRITICAL.

  These values are incremental, ie. $status > $CLEAR works as expected.

$now, which is resolved to current unix timestamp.

unaligned

when data are reduced / aggregated (e.g. the request is about the average of the last minute, or hour),
Netdata by default aligns them so that the charts will have a constant shape
(so average per minute returns always XX:XX:00 - XX:XX:59).
Setting the unaligned option, Netdata will aggregate data without any alignment,
so if the request is for 60 seconds, it will aggregate the latest 60 seconds of collected data.

calc:

A calculation to apply to the value found via lookup or another variable.

green/red:

Set the green and red thresholds of a chart.

Both are available as $green and $red in expressions.

These will eventually visualized on the dashboard.

exec:

The script to execute when the alarm changes status.

repeat:

Format: repeat: [off] [warning DURATION] [critical DURATION]

The interval for sending notifications when an alarm is in WARNING or CRITICAL mode.
This will override the default interval settings inherited from health settings in netdata.conf
(default repeat warning = DURATION and default repeat critical = DURATION)
Use 0s to turn off the repeating notification for WARNING / CRITICAL mode.

ie

repeat: warning 600s critical 600s

delay:

delay: [[[up U] [down D] multiplier M] max X]

up U

defines the delay to be applied to a notification for an alarm that raised its status (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, up 10s, the notification for this event will be sent 10 seconds after the actual event. This is used in hope the alarm will get back to its previous state within the duration given. The default U is zero.

mutliplier M

multiplies U and D when every time an alarm changes state, while a notification is delayed.

The default multiplier is 1.0.

ie

delay: down 15m multiplier 1.5 max 1h

info:

A description of the alarm, which will appear in the dashboard and notifications.

Reload health configuration

To make any changes to your health configuration live, you must reload Netdata's health monitoring system.

To do that without restarting all of Netdata, run the following:

killall -USR2 netdata

OR

netdatacli reload-health

 


Stopping notifications for individual alarms

 

設定

cd /opt/netdata/etc/netdata

./edit-config health.d/btrfs.conf        # call nano to create  health.d/btrfs.conf

# To silence this alarm, change sysadmin to silent.

to: silent

# reload

killall -USR2 netdata

 


Central Netdata server (streaming)

 

Netdata slaves streaming metrics to upstream Netdata servers(statsd),

use exactly the same protocol local plugins use.

 

statsd

Port: 8125/TCP, P8125/UDP

statsd is a system to collect data from any application.

Applications are sending metrics to it, usually via non-blocking UDP communication,

and statsd servers collect these metrics,

perform a few simple calculations on them and push them to backend time-series databases.

 * Netdata is a fully featured statsd server.

 * Netdata statsd is inside Netdata (an internal plugin, running inside the Netdata daemon)

Disable statsd

[statsd]
  enabled = no

 


Database Queries

 

API: /api/v1/data and /api/v1/badge.svg

after and before define a time-frame, accepting:

formatter

  • html
  • csv

curl -Ss 'http://192.168.28.49:19999/api/v1/data?chart=system.cpu&format=csv&after=-600&points=1&options=percentage'

points

The number of points to be returned.

If not given, the result will have the same granularity as the database

group

The grouping method to use when reducing the points the database has.

If not given, it defaults to average.

options

Only 2 options are used by the query engine: unaligned and percentage.

All the other options are used by the output formatters.

The default is to return aligned data.
 


Export and import a snapshot

 

Snapshots can be incredibly useful for diagnosing anomalies after they've already happened.

Let's say Netdata triggered an alarm while you were sleeping.

The generated snapshot will include all charts of this dashboard, for the visible timeframe

To export a snapshot, click on the export icon.

The snapshot will be downloaded as a file, to your computer,

that can be imported back into any netdata dashboard (no need to import it back on this server).

 


Performance Tuning

 

# CPU

[web]
    enable gzip compression = no

# Disable logs

you're not actively auditing Netdata's logs, disable them in netdata.conf

[global]
    # debug log = /opt/netdata/var/log/netdata/debug.log
    # error log = /opt/netdata/var/log/netdata/error.log
    # access log = /opt/netdata/var/log/netdata/access.log
    debug log = none
    error log = none
    access log = none

 


Netdata 的自身 Info.

 

How many metrics, on average, do your Agents collect?

Dashboard 的右下角有寫

Every 5 seconds, Netdata collects 3,387 metrics on home, 
presents them in 654 charts and monitors them with 268 alarms.
 
netdata
v1.28.0

What is your compression savings ratio?

Search "dbengine_compression_ratio" on dashboard (Netdata Monitoring / dbengine)

Typical compression ratio of 80%

Disk Usage (1 day)

假設

多久收集 metrics (update every) = 5
每次收集幾多 metrics = 2000
每 metrics 佔用空間 = 4 bytes
在沒有壓縮情況下
 => (2000 * 3600 * 24 * 4 / 1)/(1024^2) = 659 MB