==========================
OpenStack Rally monitoring
==========================

Rally is a test framework for OpenStack. Basically, it is used for checking
performance and concurrency issues.

As for SOME_COMPANY_NAME usage, there is a special VM in our cloud which executes Rally
task against all our public clouds periodically. This Rally task collects
performance metrics for key OpenStack projects (i.e. Nova, Neutron, Glance,
Keystone,..).

Periodic rally job exports all metrics to `ElasticSearch cluster`_ and the
historical results are represented via graphs in special `Grafana dashboard`_.

Problem description
===================

What does numbers on the graphs mean and what we should do we them?!

Proposed change
===============

Setup proper SLAs
-----------------

Rally supports SLA (success criterion) mechanism which allows to mark the
workload as failed if individual action(i.e. create network, boot vm, etc)
doesn't pass specified criteria (it can be
``average duration of action X should not be bigger then N seconds``
or even more strict
``max duration of action X should not be bigger then N seconds``).

Having some historical data (there is data for about ~2 months) allows to
identify critical points and setup proper SLAs for each action

Add alerts in case of failures
------------------------------

Even ideal employee can forget to check `Grafana dashboard`_ for errors.
It is not quite convenient and creates a possibility to miss something
important.
Setup alerts is critical thing of periodic Rally job.

There are options to do it: Sensu or MWS.
Since we already have a configured job to which launches Rally somehow,
have own SLA mechanism and do not want to flood anything except
`ElasticSearch cluster`_, Sensu looks like overkill.
MWS (the suggested by monitoring team way) provides a simple REST API
(while Sensu supports only rabbitmq) which can insert alerts into moog
and then into service now.

The Rally periodic job should take care about:

* sending alerts about failures
* sending notifications "Ok" if the error is not constant and the next
  operation succeeded.

Extend periodic job with "smart SLA"
------------------------------------

Rally itself doesn't provide SLA based on historical data.
The periodic job can be extended to analyze SLA failures (relates to max or
avg durations, not about errors) and send them to alerting system only if the
failure repeats at least two times in a row.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  c_akurilin

Other contributors:
  n/a