Microsoft Idea

Proactively monitor batch servers on high CPU usage/Batch jobs executing too long should send an alert

Sven Van Gils on 4/6/2023 11:12:38 AM

We have experienced that when a batch job has an issue, it can take up to 100% CPU on the batch servers and reserve all 12 threads. This doesn't appear on any proactive monitoring, the batch jobs are getting blocked and no alert is being sent. Only entering LCS environment telemetry shows the issue and of course the batch jobs that stay on executing.

This causes all batch related business to get blocked until the batch job that causes the issue is cancelled/aborted manually. In a 24/5 business, this leads to a business disruption, since no proactive monitoring is done.

Expected behaviour: if a batch job takes up to 100% of resources over a longer period of time (threshold which can be set by sysadmin), the batch job should be automatically cancelled or at least an alert should be send out, so that we can cancel the job.

Alternative: if a batch jobs takes too long executing (threshold to be set on the batch job), an alert is sent out.

In AX 2012 we had such basic monitoring setup over SQL monitoring, but in SAAS that's no longer possible.

Note: Batch priority scheduling is enabled.

STATUS DETAILS

New