Correlator status log lines

UPDATE (17/10/2019) : there is now a community-developed tool that can help with analyzing correlator log files – see https://github.com/ApamaCommunity/apama-log-analyzer for more information.

Correlator status lines allow you to monitor the runtime operational status of a running correlator. When a project is running, correlator status lines are logged every 5 seconds when you start the correlator.

You can specify the interval at which the correlator logs status lines by providing the --logQueueSizePeriod option when you start the correlator.

Where to look for status log lines

The default behavior when a correlator is run from the command line is for the correlator log (which includes the status information) to be written to standard output. When run from Designer, status lines are shown in Console view.

Format of correlator status lines

When logging at INFO level the status lines are in the format shown below. The meaning of each field is described in the table in the Meaning of fields section.

Correlator Status: sm=11 nctx=1 ls=60 rq=0 eq=0 iq=0 oq=0 icq=0 lcn="<none>" lcq=0 lct=0.0 rx=5 tx=17 rt=31 nc=2 vm=325556 pm=81068 runq=0 si=915.3 so=0.0 srn="<none>" srq=0

How to redirect logs to file

To send correlator logs to a file, specify the -f file or --logfile file option, where file is the name of the file that you want to be the log file.

If you specify the name of an existing file, the correlator overwrites it on Windows or appends to it on UNIX. To avoid overwriting log, specify ${START_TIME} in the log file name when you start the correlator.

For example: --logfile correlator_${START_TIME}.log

This command opens a log with a name something like correlator_2016-10-06_16:51:33.log

In UNIX shells, you need to escape the tag symbols like correlator -l license --logfile correlator_\${START_TIME}.log when starting the correlator.

How status log lines can be useful

For diagnosing correlator busy scenarios when connected to external applications.

These logs are helpful in identifying memory leaks, listener leaks, etc. For example:

  • If vm increases continuously, it may be an indication of a memory leak.
  • An increase in ls may indicate a listener leak.
  • If the runq parameter is consistently more than 0 and latency is a problem then it indicates performance issue and the application is not keeping up. You should reduce the amount of work being done in the correlator or add more/faster CPUs to your configuration.

In this way, you can assess  the performance of applications based on the values of the fields shown in status lines.

Meaning of fields

The table describes the meaning of the fields in correlator status lines and a few examples of how we can diagnose issues from them:

FieldDescriptionUnitsExamplesComments
smNumber of sub monitors/monitor instancesNumber of monitor instancesIf there is a consistent increase in this number, it is an indication that monitors are being spawned but not terminating. This will eventually lead to an out of memory condition.This number should usually be steady
nctxNumber of contexts (The number of contexts in the correlator. This includes the main context plus any user-defined contexts)Number of contextsThis number should usually be steady
lsNumber of listeners (This is the sum of listeners in all contexts. This includes each on statement and each stream source template)Number of listeners
If “ls” is increasing but “sm” isn’t, the culprit is probably custom MonitorScript or JMON that is creating listeners that are never satisfied or quit.This number should usually be steady
rqRoute queue (Sum of the number of routed events on the input queues of all contexts)Number of eventsIf this number is largish (>50) there is probably a very lengthy loop (possibly infinite) routing events.This number is typically 0 or near 0
eqEnqueue queue (Number of events on the enqueued events queue)Number of eventsThis number is typically 0 or near 0
iqInput queue (Sum of the number of entries on the input queues of all contexts)Number of events
Every context has an input queue and each queue has a maximum size of 20,000 events and it is perfectly normal to see this number in the hundreds. But if this number is steadily increasing this indicates the system is not keeping up and may eventually block senders. Either the correlator simply cannot process the events at the rate they arrive, or there is back pressure from the output queue.This number is typically 0 or near 0, if it is growing then it indicates contexts are not keeping up with incoming events.
oqOutput queue (The number of events waiting on the correlator’s output queue to be dispatched to any registered receivers)Number of eventsThe output queue is fed by the “emit” or “send to” keyword. There may be multiple processes attached as consumers to the output queue. If any one of them cannot consume events at the rate the correlator is emitting them, the queue will back up and this number will start getting large. By default, it has a limit of 10,000 events per receiver. If this limit is reached, the correlator will, by default, stop processing events and wait for the output queue to drain.This number is typically 0 or near 0, if it is growing then it indicates one or more slow receivers.
icqThe total number of enqueued events on the input queues of all public contextsNumber of eventsThis will be a subset of the events included in the iq count.This number is typically 0 or near 0, if it is growing then it indicates contexts are not keeping up with incoming events.
lcnName of the slowest contextN/A
lcqFor the context identified by lcn, this is the number of entries on its queueNumber of events
lctFor the context identified by lcn, this is the time difference between its current logical time and the most recent time tick added to its input queueSeconds
rxNumber of events the correlator has received from external sources since the correlator was startedNumber of eventsThis number increases as events are received. It only increases, it can never go down.
txNumber of events the correlator has sent to receivers since the correlator was started. This number includes duplicate events sent to multiple receiversNumber of eventsIf you inject EPL code that emits an event, and there are five receivers that are subscribed to channels that publish that event. In this situation, the tx count goes up by five. Although there was 1 event, it was sent five times.This number normally increases as events are emitted
rtRoute total. Total number of events that have been routed across all contexts since the correlator was startedNumber of eventsThis number normally increases as events are routed
ncNumber of consumers (the number of components connected to receive emitted events)
Number of connectionsThis number should usually be steady
vmVirtual memoryKilobytesThis number should usually be steady, though may vary slightly
pmPhysical memoryKilobytesThis number should usually be steady, though may vary slightly
runqRun queue. Number of contexts that have work to do but are not currently running.Number of contexts.If this number increases, it indicates a performance issue and the application is not keeping up.This number is typically 0 or near 0
siThe rate at which pages are being read from swap spacePages/secondIf “si” is non-zero, the host is running out of memory and will not perform well. This severely affects latency and it is possible that this will prevent Status lines being regularly logged every 5 seconds.This number is typically 0 or near 0
soThe rate at which pages are being written to swap spacePages/secondIf “so” is non-zero, the host is running out of memory and will not perform well. This severely affects latency and it is possible that this will prevent Status lines being regularly logged every 5 seconds.This number is typically 0 or near 0
srnSlowest receiver name (The receiver with the largest number of incoming events waiting to be processed)N/A
srqSlowest receiver queue size (For the receiver with the largest number of incoming events waiting to be processed, this is the number of events that are waiting)Number of events
 

Apama implements flow control for its queues. The input queue (one per context) has a maximum size of 20,000 events and the output queue (one per receiver) has a size of 10,000 events. If a receiver is slow, the output queue will grow until it is full and then block contexts that are sending to it. If contexts are not keeping up (or blocked), the input queue will grow until 20,000  and then block senders.

Use the  --qdisconnect command line flag to prevent the correlator from consuming an unbounded amount of memory and failing with out of memory in the case where receiver is slow and events continue to be sent in.

Conclusion

The correlator status log lines are the first place you should look if you are experiencing problems (throughput, latency, memory use, high CPU, etc) with Apama and give a great insight into the internal status of the correlator.  Please use the forum if you have any questions.

– Kusuma