How to troubleshoot a stalling job

Hello there,

Sharing some super useful info & tips on stalled jobs/builds. The most common reason for stalled builds is a process that refuses to shut down properly. Either a debug statement or a cleanup procedure in the catch statement. Reproducing this can be hard sometimes and these are the steps we recommend:

  1. Start a build on a branch and let it get stale.
  2. Attach to a running job: sem attach [job-id].
  3. Now, you should be in the instance of the job’s virtual machine.

In the running instance, you can:

  • List the running processes with ps aux or top. Is there any suspicious process running?
  • Run a strace on the running process: sudo strace -p to see the last kernel instruction that it is waiting for. For example, select(1, ... can mean the process is waiting for the user’s input.
  • Look into the system metrics at /tmp/system-metrics. This tracks memory and disk usage. A lack of disk space or free memory can introduce unwanted stalling into jobs.
  • Look into the Agent logs at /tmp/agent_logs . The logs could indicate waiting for some conditions.
  • Look into the Job logs at /tmp/job_logs.json. The logs could also indicate waiting for some conditions.
  • Check the syslog, as it can be also a valuable source of information: tail /var/log/syslog. It can indicate ‘Out of memory’ conditions.

While this issue is ongoing, you might consider using a shorter execution_time_limit in your pipelines. This will prevent stale builds to run for a full hour, and fail sooner.

1 Like