Trace reality first
I start with observed behavior: packet flow, logs, counters, workload shape, timing, configuration, topology, and user impact.
The rules I use when signals are noisy, ownership boundaries are unclear, and the issue has to be resolved with evidence instead of assumption.
I start with observed behavior: packet flow, logs, counters, workload shape, timing, configuration, topology, and user impact.
A system can be correct at idle and still fail under concurrency, fan-out, queue depth, metadata pressure, retry storms, or scheduler pressure.
Client, kernel, hypervisor, container runtime, fabric, storage service, scheduler, and telemetry layers all influence the final workload outcome.
Good telemetry reduces uncertainty. Bad telemetry creates confidence without proof. I care whether the signal explains behavior.
Useful diagnostics should be reproducible, documented, scriptable where practical, and understandable by the next engineer.
The work often requires turning low-level evidence into language engineering, support, product, customer success, and leadership teams can act on.
A practical habit from years of customer escalation and formal issue-resolution coursework.
My diagnostic approach has been shaped by customer issue-resolution training, including Kepner-Tregoe coursework, but I do not present it as a rigid script. The useful part is the discipline it reinforces: slow the problem down, separate facts from assumptions, and keep the customer impact visible.
In practice, that means defining the problem clearly, asking where it is and is not seen, checking what changed, testing which evidence is strong enough to act on, and choosing the next safe step without pretending certainty.
That structure fits infrastructure work because complex failures rarely stay in one layer. The method is a support for judgment, not a substitute for it.