参数配置

Prometheus参数配置

Prometheus配置方式有两种：命令行，用来配置不可变命令参数，主要是Prometheus运行参数，比如数据存储位置、配置文件，用来配置Prometheus应用参数，比如数据采集，报警对接。不重启进程配置生效方式也有两种：

对进程发送信号SIGHUP，kill -HUP <pid>。
HTTP POST请求，需要开启–web.enable-lifecycle选项 curl -X POST http://192.168.66.112:9091/-/reload。

命令行配置

存储

Prometheus具有几个允许配置本地存储的标志。最重要的是：

–storage.tsdb.path:文件的存储位置，默认是 data/。
–storage.tsdb.retention.time:这确定了何时删除旧数据。默认为15d。如果此标志设置为默认值以外的其他值，则覆盖storage.tsdb.retention。
–storage.tsdb.retention.size: [EXPERIMENTAL]这确定了存储块可以使用的最大字节数（请注意，这不包括WAL大小，这可能是很大的）。最旧的数据将首先被删除。默认为0或禁用。该标志是实验性的，可以在将来的版本中进行更改。支持的单位：KB，MB，GB，PB。例如：512MB。
–storage.tsdb.wal-compression:此标志启用预写日志（WAL）的压缩。根据您的数据，您可以预期WAL大小将减少一半，而额外的CPU负载却很少。请注意，如果启用此标志，然后将Prometheus降级到2.11.0以下的版本，则您将需要删除WAL，因为它将不可读。

配置文件基础

我们可以通过参数 –config.file来指定配置文件，配置文件格式为YAML。配置文件的基础构成如下：

＃全局配置
global:

＃规则配置主要是配置报警规则
rule_files:

＃抓取配置，主要配置抓取客户端相关
scrape_configs:

＃报警配置
alerting:

＃用于远程存储写配置
remote_write:

＃用于远程读配置
remote_read:

配置文件中通用字段值格式：<boolean>:布尔类型值为true和false、<scheme>:协议方式包含http和https。我们可以打开默认的配置文件prometheus.yml看下里面的内容：

/etc/prometheus $ cat prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

Prometheus默认的配置文件分为四大块：

global块：Prometheus的全局配置，比如scrape_interval表示Prometheus多久抓取一次数据，evaluation_interval表示多久检测一次告警规则。
alerting块：关于Alertmanager的配置。
rule_files块：告警规则。
scrape_config块：这里定义了Prometheus要抓取的目标，我们可以看到默认已经配置了一个名称为prometheus的job，这是因为Prometheus在启动的时候也会通过HTTP接口暴露自身的指标数据，这就相当于Prometheus自己监控自己，虽然这在真正使用Prometheus时没啥用处，但是我们可以通过这个例子来学习如何使用Prometheus；可以访问http://localhost:9090/metrics查看Prometheus暴露了哪些指标。

global字段

# scrape_interval，全局默认的数据拉取间隔
[ scrape_interval: <duration> | default = 1m ]

# scrape_timeout，全局默认的单次数据拉取超时，当报context deadline exceeded错误时需要在特定的job下配置该字段
[ scrape_timeout: <duration> | default = 10s ]

# evaluation_interval，全局默认的规则(主要是报警规则)拉取间隔
[ evaluation_interval: <duration> | default = 1m ]

# external_labels，该服务端在与其他系统对接所携带的标签
[ <labelname>: <labelvalue> ... ]

rule_files

Prometheus支持两种类型的规则：记录规则和警报规则。要在Prometheus中包含规则，请创建一个包含必要规则语句的文件，并让Prometheus通过配置中的rule_files字段加载规则文件。

规则分组rule_group

不论是recording rules还是alerting rules都要在组里面：

groups:
  #groups的名称
  - name: example
    #该组下的规则
    rules: [- <rule> ...]

定义Recording rules

有一些监控的数据查询时很耗时的，还有一些数据查询所使用的查询语句很繁琐。Recording rules可以把一些很耗时的查询或者很繁琐的查询进行提前查询好，然后在需要数据的时候就可以很快拉出数据。

# 指出规则类型record 后面接名称
record: <string>

# 写入PromQL表达式查询语句
#expr: sum(http_inprogress_requests) by (job)
expr: <string>

# 在存储数据之前加上标签
labels: [<labelname>: <labelvalue>]

groups:
  - name: example
    rules:
      - record: job:http_inprogress_requests:sum
        expr: sum(http_inprogress_requests) by (job)

规则检查

#打镜像后使用
FROM golang:1.10

RUN GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go get -u github.com/prometheus/prometheus/cmd/promtool

FROM alpine:latest

COPY --from=0 /go/bin/promtool /bin
ENTRYPOINT ["/bin/promtool"]

# 编译
docker build -t promtool:0.1 .
#使用
docker run --rm -v /root/test/prom:/opt promtool:0.1 check rules /opt/rule.yml
#返回
Checking /opt/rule.yml
  SUCCESS: 1 rules found

scrape_configs

拉取数据配置，在配置字段内可以配置拉取数据的对象(Targets)，job以及实例。

job_name

定义job名称，是一个拉取单元。每个job_name都会自动引入默认配置如：

scrape_interval依赖全局配置
scrape_timeout依赖全局配置
metrics_path默认为’/metrics’
scheme默认为’http’

这些也可以在单独的job中自定义：

[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
[ metrics_path: <path> | default = /metrics ]

honor_labels

服务端拉取过来的数据也会存在标签，配置文件中也会有标签，这样就可能发生冲突：

true就是以抓取数据中的标签为准
false就会重新命名抓取数据中的标签为“exported”形式，然后添加配置文件中的标签

[ honor_labels: <boolean> | default = false ]

scheme

切换抓取数据所用的协议

[ scheme: <scheme> | default = http ]

params

定义可选的url参数：

[ <string>: [<string>, ...] ]

抓取认证类

每次抓取数据请求的认证信息。

basic_auth

password和password_file互斥只可以选择其一：

basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

bearer_token

bearer_token和bearer_token_file互斥只可以选择其一：

[ bearer_token: <secret> ]
[ bearer_token_file: /path/to/bearer/token/file ]

tls_config

抓取ssl请求时证书配置：

tls_config:
  [ ca_file: <filename> ]
  [ cert_file: <filename> ]
  [ key_file: <filename> ]
  [ server_name: <string> ]
  #禁用证书验证
  [ insecure_skip_verify: <boolean> ]

proxy_url

通过代理去主去数据

[ proxy_url: <string> ]

服务发现类

Prometheus支持多种服务现工具，详细配置这里不再展开：

#sd就是service discovery的缩写
azure_sd_configs:
consul_sd_configs:
dns_sd_configs:
ec2_sd_configs:
openstack_sd_configs:
file_sd_configs:
gce_sd_configs:
kubernetes_sd_configs:
marathon_sd_configs:
nerve_sd_configs:
serverset_sd_configs:
triton_sd_configs:

static_configs

服务发现来获取抓取目标为动态配置，这个配置项目为静态配置，静态配置为典型的targets配置，在改配置字段可以直接添加标签：

- targets: [- '<host>']
  labels: [<labelname>: <labelvalue> ...]

采集器所采集的数据都会带有label，当使用服务发现时，比如consul所携带的label如下：

__meta_consul_address: consul地址
__meta_consul_dc: consul中服务所在的数据中心
__meta_consul_metadata_: 服务的metadata
__meta_consul_node: 服务所在consul节点的信息
__meta_consul_service_address: 服务访问地址
__meta_consul_service_id: 服务ID
__meta_consul_service_port: 服务端口
__meta_consul_service: 服务名称
__meta_consul_tags: 服务包含的标签信息

这些label是数据筛选与聚合计算的基础。

数据过滤类

抓取数据很繁杂，尤其是通过服务发现添加的target。所以过滤就显得尤为重要，我们知道抓取数据就是抓取target的一些列metrics，Prometheus过滤是通过对标签操作操现的，在字段relabel_configs和metric_relabel_configs里面配置，两者的配置都需要relabel_config字段。该字段需要配置项如下：

[ source_labels: '[' <labelname> [, ...] ']' ]

[ separator: <string> | default = ; ]

[ target_label: <labelname> ]

[ regex: <regex> | default = (.*) ]

[ modulus: <uint64> ]

[ replacement: <string> | default = $1 ]

#action除了默认动作还有keep、drop、hashmod、labelmap、labeldrop、labelkeep
[ action: <relabel_action> | default = replace ]

target配置示例

relabel_configs:
  - source_labels: [job]
    regex: (.*)some-[regex]
    action: drop
  - source_labels: [__address__]
    modulus: 8
    target_label: __tmp_hash
    action: hashmod

target中metric示例

- job_name: cadvisor
  ...
  metric_relabel_configs:
  - source_labels: [id]
    regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
    action: drop
  - source_labels: [container_label_JenkinsId]
    regex: '.+'
    action: drop

alerting字段

该字段配置与Alertmanager进行对接的配置。

alert_relabel_configs

此项配置和scrape_configs字段中relabel_configs配置一样，用于对需要报警的数据进行过滤后发向Alertmanager。

alertmanagers

该项目主要用来配置不同的alertmanagers服务，以及Prometheus服务和他们的链接参数。alertmanagers服务可以静态配置也可以使用服务发现配置。Prometheus以pushing的方式向alertmanager传递数据。

alertmanager服务配置和target配置一样，可用字段如下：

[ timeout: <duration> | default = 10s ]
[ path_prefix: <path> | default = / ]
[ scheme: <scheme> | default = http ]
basic_auth:
  [ username: <string> ]
  [ password: <string> ]
  [ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
  [ <tls_config> ]
[ proxy_url: <string> ]
azure_sd_configs:
  [ - <azure_sd_config> ... ]
consul_sd_configs:
  [ - <consul_sd_config> ... ]
dns_sd_configs:
  [ - <dns_sd_config> ... ]
ec2_sd_configs:
  [ - <ec2_sd_config> ... ]
file_sd_configs:
  [ - <file_sd_config> ... ]
gce_sd_configs:
  [ - <gce_sd_config> ... ]
kubernetes_sd_configs:
  [ - <kubernetes_sd_config> ... ]
marathon_sd_configs:
  [ - <marathon_sd_config> ... ]
nerve_sd_configs:
  [ - <nerve_sd_config> ... ]
serverset_sd_configs:
  [ - <serverset_sd_config> ... ]
triton_sd_configs:
  [ - <triton_sd_config> ... ]
static_configs:
  [ - <static_config> ... ]
relabel_configs:
  [ - <relabel_config> ... ]

远程读写

Prometheus可以进行远程读/写数据。字段remote_read和remote_write。

remote_read

#远程读取的url
url: <string>

#通过标签来过滤读取的数据
required_matchers:
  [ <labelname>: <labelvalue> ... ]

[ remote_timeout: <duration> | default = 1m ]

#当远端不是存储的时候激活该项
[ read_recent: <boolean> | default = false ]

basic_auth:
  [ username: <string> ]
  [ password: <string> ]
  [ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
  [ <tls_config> ]
[ proxy_url: <string> ]

remote_write

url: <string>

[ remote_timeout: <duration> | default = 30s ]

#写入数据时候进行标签过滤
write_relabel_configs:
  [ - <relabel_config> ... ]

basic_auth:
  [ username: <string> ]
  [ password: <string> ]
  [ password_file: <string> ]

[ bearer_token: <string> ]

[ bearer_token_file: /path/to/bearer/token/file ]

tls_config:
  [ <tls_config> ]

[ proxy_url: <string> ]

#远端写细粒度配置，这里暂时仅仅列出官方注释
queue_config:
  # Number of samples to buffer per shard before we start dropping them.
  [ capacity: <int> | default = 10000 ]
  # Maximum number of shards, i.e. amount of concurrency.
  [ max_shards: <int> | default = 1000 ]
  # Maximum number of samples per send.
  [ max_samples_per_send: <int> | default = 100]
  # Maximum time a sample will wait in buffer.
  [ batch_send_deadline: <duration> | default = 5s ]
  # Maximum number of times to retry a batch on recoverable errors.
  [ max_retries: <int> | default = 3 ]
  # Initial retry delay. Gets doubled for every retry.
  [ min_backoff: <duration> | default = 30ms ]
  # Maximum retry delay.
  [ max_backoff: <duration> | default = 100ms ]

最近更新于0001-01-01