Compute & GPU Clusters
GPU workload pressure, Slurm/Kubernetes scheduler behavior, host behavior, firmware, drivers, Linux, and resource contention.
Distributed Infrastructure · AI/HPC Infrastructure · Storage Platforms · Virtualization · Reliability Engineering
I work across distributed infrastructure where compute, storage, networking, and workload behavior have to be understood together.
My work follows the full data path: client behavior, compute, network fabric, storage, virtualization, orchestration, telemetry, and the assumptions teams make between those layers.
I am most useful in ambiguous platform issues where the system is technically available, but workload behavior, latency, throughput, or reliability shows that something deeper is wrong.
I solve problems at the boundaries between systems, teams, and assumptions.
I do not reduce infrastructure work to a product list. The hard problems usually live in the interaction between layers: a workload pattern that exposes a storage constraint, a fabric behavior that looks like an application problem, a telemetry gap that hides the true failure domain, or a platform decision that only fails under pressure.
That is the work I am built for: tracing reality, narrowing ambiguity, explaining the path clearly, and helping teams move from scattered symptoms to a practical operating picture.
Compute, storage, networking, and workload behavior.
GPU workload pressure, Slurm/Kubernetes scheduler behavior, host behavior, firmware, drivers, Linux, and resource contention.
Scale-out NAS, S3/object, metadata behavior, NVMe/NVMe-oF, data-path latency, and platform constraints.
RDMA, RoCE, InfiniBand, Ethernet, MTU, PMTUD, congestion, retransmits, and fabric-level evidence.
Data access patterns, throughput, pipeline reliability, workload readiness, GPU starvation, and infrastructure behavior under load.
Practical work, not buzzwords.
I help determine why infrastructure that appears healthy is still failing the workload, customer, or operational objective.
I translate ambiguous requirements into practical platform designs, validation paths, runbooks, and implementation guidance.
I reason through NAS, S3/object, metadata, NVMe/NVMe-oF, RDMA, latency, throughput, and topology constraints.
I look at GPU utilization, data access patterns, scheduling, fabric behavior, and storage pressure as one system.
I turn scattered symptoms into an operating picture that engineering, support, product, and customer teams can act on.
I build repeatable diagnostics, evidence collection, automation-backed validation, runbooks, and workflows that reduce time-to-understanding.