Sharing some super useful info & tips on stalled jobs/builds. The most common reason for stalled builds is a process that refuses to shut down properly. Either a debug statement or a cleanup procedure in the catch statement. Reproducing this can be hard sometimes and these are the steps we recommend:
- Start a build on a branch and let it get stale.
Attach to a running job:
sem attach [job-id].
- Now, you should be in the instance of the job’s virtual machine.
In the running instance, you can:
- List the running processes with
top. Is there any suspicious process running?
- Run a
straceon the running process:
sudo strace -pto see the last kernel instruction that it is waiting for. For example,
select(1, ...can mean the process is waiting for the user’s input.
- Look into the system metrics at
/tmp/system-metrics. This tracks memory and disk usage. A lack of disk space or free memory can introduce unwanted stalling into jobs.
- Look into the Agent logs at
/tmp/agent_logs. The logs could indicate waiting for some conditions.
- Look into the Job logs at
/tmp/job_logs.json. The logs could also indicate waiting for some conditions.
- Check the syslog, as it can be also a valuable source of information:
tail /var/log/syslog. It can indicate ‘Out of memory’ conditions.
While this issue is ongoing, you might consider using a shorter
execution_time_limit in your pipelines. This will prevent stale builds to run for a full hour, and fail sooner.