========================== OpenStack Rally monitoring ========================== Rally is a test framework for OpenStack. Basically, it is used for checking performance and concurrency issues. As for SOME_COMPANY_NAME usage, there is a special VM in our cloud which executes Rally task against all our public clouds periodically. This Rally task collects performance metrics for key OpenStack projects (i.e. Nova, Neutron, Glance, Keystone,..). Periodic rally job exports all metrics to `ElasticSearch cluster`_ and the historical results are represented via graphs in special `Grafana dashboard`_. Problem description =================== What does numbers on the graphs mean and what we should do we them?! Proposed change =============== Setup proper SLAs ----------------- Rally supports SLA (success criterion) mechanism which allows to mark the workload as failed if individual action(i.e. create network, boot vm, etc) doesn't pass specified criteria (it can be ``average duration of action X should not be bigger then N seconds`` or even more strict ``max duration of action X should not be bigger then N seconds``). Having some historical data (there is data for about ~2 months) allows to identify critical points and setup proper SLAs for each action Add alerts in case of failures ------------------------------ Even ideal employee can forget to check `Grafana dashboard`_ for errors. It is not quite convenient and creates a possibility to miss something important. Setup alerts is critical thing of periodic Rally job. There are options to do it: Sensu or MWS. Since we already have a configured job to which launches Rally somehow, have own SLA mechanism and do not want to flood anything except `ElasticSearch cluster`_, Sensu looks like overkill. MWS (the suggested by monitoring team way) provides a simple REST API (while Sensu supports only rabbitmq) which can insert alerts into moog and then into service now. The Rally periodic job should take care about: * sending alerts about failures * sending notifications "Ok" if the error is not constant and the next operation succeeded. Extend periodic job with "smart SLA" ------------------------------------ Rally itself doesn't provide SLA based on historical data. The periodic job can be extended to analyze SLA failures (relates to max or avg durations, not about errors) and send them to alerting system only if the failure repeats at least two times in a row. Implementation ============== Assignee(s) ----------- Primary assignee: c_akurilin Other contributors: n/a