应用监控
应用监控
如果我们要在我们的生产环境真正使用
主机节点监控
首先我们来收集服务器的指标,这需要安装 node_exporter,这个*NIX
内核的系统,如果你的服务器是
和
$ curl http://localhost:9100/metrics
然后可以在
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["192.168.0.107:9090"]
- job_name: "server"
static_configs:
- targets: ["192.168.0.107:9100"]
一般情况下,
$ docker run -d \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter \
--path.rootfs /host
Python 应用监控
这里我们以简单的
import http.server
from prometheus_client import start_http_server
class MyHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
if __name__ == "__main__":
start_http_server(8000)
server = http.server.HTTPServer(('localhost', 8001), MyHandler)
server.serve_forever()
http://localhost:8000/
即可以获取到当前的
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 372.0
python_gc_objects_collected_total{generation="1"} 0.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 34.0
python_gc_collections_total{generation="1"} 3.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="0",version="3.8.0"} 1.0
然后我们在
global:
scrape_interval: 10s
scrape_configs:
- job_name: example
static_configs:
- targets:
- localhost:8000
在

Counter
from prometheus_client import Counter
REQUESTS = Counter('hello_worlds_total',
'Hello Worlds requested.')
class MyHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
REQUESTS.inc()
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
这里我们可以使用 rate(hello_worlds_total[1m])
这样的表达式来查看接口请求的变化率:

其他的常用场景譬如统计错误率:
import random
from prometheus_client import Counter
REQUESTS = Counter('hello_worlds_total',
'Hello Worlds requested.')
EXCEPTIONS = Counter('hello_world_exceptions_total',
'Exceptions serving Hello World.')
class MyHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
REQUESTS.inc()
with EXCEPTIONS.count_exceptions():
if random.random() < 0.2:
raise Exception
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
然后使用 rate(hello_world_exceptions_total[1m]) / rate(hello_worlds_total[1m])
这样的表达式来计算错误率。在
EXCEPTIONS = Counter('hello_world_exceptions_total',
'Exceptions serving Hello World.')
class MyHandler(http.server.BaseHTTPRequestHandler):
@EXCEPTIONS.count_exceptions()
def do_GET(self):
# ...
import random
from prometheus_client import Counter
REQUESTS = Counter('hello_worlds_total',
'Hello Worlds requested.')
SALES = Counter('hello_world_sales_euro_total',
'Euros made serving Hello World.')
class MyHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
REQUESTS.inc()
euros = random.random()
SALES.inc(euros)
self.send_response(200)
self.end_headers()
self.wfile.write("Hello World for {} euros.".format(euros).encode())
Gauge
import time
from prometheus_client import Gauge
INPROGRESS = Gauge('hello_worlds_inprogress',
'Number of Hello Worlds in progress.')
LAST = Gauge('hello_world_last_time_seconds',
'The last time a Hello World was served.')
class MyHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
INPROGRESS.inc()
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
LAST.set(time.time())
INPROGRESS.dec()
这些指标可以直接在表达式浏览器中使用,而无需任何其他功能。例如,time() -hello_world_last_time_seconds
将告诉您自上次请求以来有多少秒。
from prometheus_client import Gauge
INPROGRESS = Gauge('hello_worlds_inprogress',
'Number of Hello Worlds in progress.')
LAST = Gauge('hello_world_last_time_seconds',
'The last time a Hello World was served.')
class MyHandler(http.server.BaseHTTPRequestHandler):
@INPROGRESS.track_inprogress()
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
LAST.set_to_current_time()
Summary
当您试图了解系统性能时,了解您的应用程序响应请求所花的时间或后端的延迟是至关重要的指标。其他仪器系统提供某种形式的
import time
from prometheus_client import Summary
LATENCY = Summary('hello_world_latency_seconds',
'Time for a request Hello World.')
class MyHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
start = time.time()
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
LATENCY.observe(time.time() - start)
如果查看rate(hello_world_latency_seconds_count[1m])
将返回rate(hello_world_latency_seconds_sum[1m])
是每秒响应请求所花费的时间。如果将这两个表达式相除,您将获得最后一分钟的平均延迟。平均延迟的完整表达式为:
rate(hello_world_latency_seconds_sum[1m]) / rate(hello_world_latency_seconds_count[1m])
Histogram
from prometheus_client import Histogram
LATENCY = Histogram('hello_world_latency_seconds',
'Time for a request Hello World.')
class MyHandler(http.server.BaseHTTPRequestHandler):
@LATENCY.time()
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b"Hello World")
这将产生一组名称为
histogram_quantile(0.95, rate(hello_world_latency_seconds_bucket[1m]))
Buckets
默认存储桶的延迟范围为
LATENCY = Histogram('hello_world_latency_seconds',
'Time for a request Hello World.',
buckets=[0.0001, 0.0002, 0.0005, 0.001, 0.01, 0.1])
如果您想要线性或指数存储桶,则可以使用
buckets=[0.1 * x for x in range(1, 10)] # Linear
buckets=[0.1 * 2**x for x in range(1, 10)] # Exponential