Tuesday, August 21, 2007

Queue size monitoring

One of the things we would like to build into the server before moving on to any features and improvements is monitoring. First of all, we will measure queues' growth.
Monitoring is always a trade-off between overhead and precision. In our monitoring class, QueueSizeMonitor, we will have a parameter, called granularity, in milliseconds. It determines how often the measurements are taken. What does "take measurement" mean here?
Our monitoring of queue sizes is based on "counting". Messaging server calls count(long delta, String key) method on the monitoring object every time when the size of any queue changes. Argument delta is the value of the change. Argument key would normally be an event name (like queue X grows, queue X shrinks). This method accumulates values in a thread-local storage (i.e. every thread has its own copy of the counter). Every now and then (depending on granularity value) data from thread-local storages are promoted to the common storage. This operation entails some locking, that is why the smaller is the granularity, the bigger is the overhead of monitoring.
A background thread periodically flushes collected data to a file. Frequency of such flushing is determined by the window parameter. Flushing interval = granularity * window. File is written in XML format that looks like following:


<series>
<serie timestamp="1187734787600">
<item key="10|poll" value="54"/>
<item key="1|poll" value="165"/>
<item key="2|poll" value="4"/>
<item key="Distribution|poll" value="112"/>
<item key="Distribution|put" value="109"/>
</serie>
<serie timestamp="1187734787700">
<item key="1|poll" value="193"/>
<item key="2|poll" value="258"/>
<item key="4|poll" value="18"/>
<item key="5|poll" value="430"/>
<item key="7|poll" value="4"/>
<item key="8|poll" value="2"/>
<item key="Distribution|poll" value="452"/>
<item key="Distribution|put" value="454"/>
</serie>
</series>

Although the queue size monitor adds data dynamically, the XML file it produces is always well-formed. This is because it shifts position in the file back by 9 bytes (size of the </series> tag), then writes several <serie> tags, then adds </series> in the end. In most cases, XML will be well-formed even if the server crashes.

I haven't worked out yet the best set of values to monitor and the best way of representing them. However, as a start, I added simple class Chart that parses XML produced by the queue size monitor and shows line chart. It doesn't look very useful at the moment:

No comments: