Compute & GPU Clusters
GPU workload pressure, scheduler placement, host behavior, firmware, drivers, Linux, and resource contention.
Principal Systems · Reliability · Platform Engineering
I work across distributed infrastructure where compute, storage, networking, and workload behavior intersect.
My work follows the full data path: client behavior, compute, network fabric, storage, virtualization, orchestration, telemetry, and the assumptions teams make between those layers.
I am most useful in ambiguous platform issues where the system is technically available, but workload behavior, latency, throughput, or reliability shows that something deeper is wrong.
Why me
I do not reduce infrastructure work to a product list. The hard problems usually live in the interaction between layers: a workload pattern that exposes a storage constraint, a fabric behavior that looks like an application problem, a telemetry gap that hides the true failure domain, or a platform decision that only fails under pressure.
That is the work I am built for: tracing reality, narrowing ambiguity, explaining the path clearly, and helping teams move from scattered symptoms to a practical operating picture.
Core domains
GPU workload pressure, scheduler placement, host behavior, firmware, drivers, Linux, and resource contention.
Scale-out NAS, S3/object, metadata behavior, NVMe/NVMe-oF, data-path latency, and platform constraints.
RDMA, RoCE, InfiniBand, Ethernet, MTU, PMTUD, congestion, retransmits, and fabric-level evidence.
Data access patterns, throughput, pipeline reliability, GPU starvation, and infrastructure behavior under load.
Proof signal
What I actually do
I help determine why infrastructure that appears healthy is still failing the workload, customer, or operational objective.
I trace behavior across client, compute, network, storage, virtualization, telemetry, and orchestration layers.
I reason through NAS, S3/object, metadata, NVMe/NVMe-oF, RDMA, latency, throughput, and topology constraints.
I look at GPU utilization, data access patterns, scheduling, fabric behavior, and storage pressure as one system.
I turn scattered symptoms into an operating picture that engineering, support, product, and customer teams can act on.
I build repeatable diagnostics, evidence collection, runbooks, and workflows that reduce time-to-understanding.