Checklist: Building Ultra‑High‑Density AI Data Centers

A field-guide checklist turning power, liquid-cooling, and rack-density needs into procurement, deployment and SRE runbook actions for multi‑megawatt AI clusters.

Building Data Centers for Ultra-High-Density AI: A Practical Checklist for DevOps and SREs

This field guide translates power, liquid-cooling, and rack-density requirements for GPU and accelerator clusters into actionable procurement, deployment, and runbook items. It's written for technology professionals, developers, DevOps and SREs running or planning multi-megawatt AI data center deployments.

Why this checklist matters

Traditional colocation and enterprise data centers were designed around racks drawing a few kilowatts. Modern AI workloads push racks into tens or hundreds of kilowatts. That changes assumptions across procurement, contracts, cabling, cooling, monitoring, and operational runbooks. Use this guide to translate capacity planning and vendor datasheets into concrete acceptance tests and on-call procedures.

Quick glossary

Rack density: power or heat per rack, usually expressed in kW/rack.
Multi-megawatt: sites or pods delivering multiple megawatts (MW) of IT power.
Liquid cooling: heat removal approaches using liquid—includes RDHx (rear-door heat exchangers) and direct-to-chip cold plates.
Direct-to-chip (D2C): coolant delivered to a cold plate mounted on the processor/accelerator for high-efficiency heat transfer.
SRE runbook: documented operational procedures for incidents and maintenance.

Procurement checklist: lock the right guarantees

Procurement is where you translate power and cooling needs into contractual guarantees. Ask vendors and colocation providers for the following minimum assurances:

Power capacity and delivery
- Guaranteed dedicated power per rack in kW (not just campus-level MW). Example acceptance: "50 kW/rack guaranteed for 20 racks".
- Breaker and feeder sizes documented (e.g., 3-phase 400A or 800A per row) and single-point-of-failure details.
- Metering by cabinet or pod (billing by kW rather than flat space).
- Right to increase density and dates/costs for upgrades.

Cooling guarantees

Support for your chosen method: RDHx, rear-door heat exchangers, in-row chillers, or direct-to-chip. The provider must list compatible hookups and fittings.
Minimum delta-T and flow rates for required rack kW. Example: "RDHx sized for 35 kW/rack with 10°C delta-T at 2 L/min per rack".
Redundancy SLAs for chillers and pumps (N+1 or better) and MTTR targets.

Multi-megawatt readiness

Right to install multi-megawatt capacity within specified timelines (e.g., 3 months to provision 1 MW).
Transformer and substation capacity and site-level diversity information.

Safety, compliance, and insurance

Leak detection and fluid compatibilities documented.
Compliance with local electrical and fire codes for liquid-cooled cabinets.

Capacity planning: translate kW and MW into racks and spares

Capacity planning turns abstract numbers into rack counts, spare parts, and procurement horizons.

Define density targets
- Conservative baseline: 10–20 kW/rack (traditional high-performance racks).
- Ultra-high-density: 40–80+ kW/rack (RDHx common) and 80–200+ kW/rack (direct-to-chip/D2C in specialized pods).
- Plan with a buffer (e.g., provision 10–20% extra capacity for headroom and unexpected load spikes).

Pod sizing

Define a pod as a logical unit (e.g., 10 racks at 50 kW = 0.5 MW). This simplifies scaling and colocation discussions.
Map pod power draw to breaker and transformer sizing up to the multi-megawatt level.

Procurement lead times

GPUs and accelerators: 8–24+ weeks depending on market—order ahead of capacity installations.
Liquid-cooling components and custom manifolds: 6–12 weeks; keep vendor-approved spares on hand.

Spare parts and hot spares

Keep spare pumps, valves, quick-disconnects, and at least one spare RDHx or cold-plate assembly per X racks (define X per your risk tolerance).

Liquid cooling & RDHx: deployment checklist

Liquid cooling choices change the physical layout and operational safety model. This checklist focuses on RDHx and direct-to-chip patterns.

Choose the right approach
- RDHx: lower integration effort, removes a large fraction of rack heat, typical for 20–60 kW/rack.
- Direct-to-chip (D2C): highest heat-removal efficiency and smallest thermal resistance, required for 80+ kW/rack deployments.
- Hybrid designs can use RDHx for some racks and D2C for denser nodes—document compatibility.

Hydraulics and plumbing

Use industry-standard quick-disconnects rated for data-center use; test for dry-break performance.
Document line lengths, pressure drops, and pump curves for each pod. Acceptance test: measured flow within ±10% of design under full load.

Coolant selection and safety

Choose coolant compatible with seals and cold plates. For water-glycol blends, specify corrosion inhibitors and conductivity limits.
Plan for leak containment: tray routing, drip pans, and leak sensors at every valve and rack manifold.

Thermal commissioning

Run heat-load tests using load banks or representative GPU loads. Acceptance test: inlet/outlet temps meet design delta-T at full load.
Thermal imaging pass: no hot-spots beyond spec thresholds.

Rack, power distribution and cabling checklist

Rack planning
- Confirm mechanical capacity (weight and center-of-gravity) for equipment and coolant plumbing.
- Plan vertical PDUs with redundancy and sufficient breaker ratings. Use north-south airflow containment where air cooling remains.

Electrical distribution

Use per-rack or per-row metering; avoid metering only at campus-level for chargeback and troubleshooting.
Ensure UPS and generator capacity are sized for your MW and test generator failover under load.

Cabling and connectors

Plan for high-density fiber and power cabling. Label all connections end-to-end and include patching diagrams in the runbook.

SRE runbook: operationalizing incidents and maintenance

Translate infrastructure behavior into clear, testable runbook actions. Each item should include alert thresholds, first-responder steps, escalation, and postmortem triggers.

Monitoring & alerts
- Monitor inlet/outlet temps, coolant flow, pressure, pump status, breaker loading, per-rack kW, and leak sensors.
- Set alerts with multi-stage thresholds: warning (5–10% above normal), critical (sustained >15% above normal or falling flow).

Primary incident flows

Cooling loss (flow drop): immediate actions — throttle workloads, shift jobs off affected racks, disable affected rack PDUs if needed, and dispatch facilities.
Power trip/UPS failover: follow power failover checklist—identify failed component, transfer loads to alternate feed, engage generator if needed.
Leak detection: isolate affected manifolds, enable containment, and run sampling tests for fluid compatibility. Evacuate affected racks per safety SOPs.

Maintenance and rolling upgrades

Define maintenance windows and a clear rolling plan to avoid simultaneous service-impacting work on redundant systems.
Practice failover drills quarterly: power failover, cooling pump swap, and forced throttling tests.

On-call playbooks and runbook snippets

Alert: Rack X coolant flow < 70% setpoint
1) Pager to on-call infra SRE
2) Verify telemetry: flow sensor, pump status, inlet temp
3) If flow degraded, throttle GPU jobs on Rack X and live-migrate critical tasks
4) Contact facilities to inspect pump manifold; enable N+1 pump if available
5) If leak detected, power down rack after migrating live workloads
6) Open incident, tag postmortem if service degraded > 30m

Testing and acceptance criteria

Before handing over a pod to operations, require the provider and vendors to demonstrate:

Full-load thermal test at the promised kW/rack for at least 24 hours.
Power failover test under load (UPS to generator) with MTTR measured.
Hydraulic acceptance: flow, leak, and pressure tests; dry-break exercise for quick-disconnects.
Network integration: ensure BMS and telemetry endpoints are available to your monitoring stack.

Colocation contract red flags

Watch for ambiguous language that can cause capacity or cost surprises:

"Up to" power promises without guaranteed per-rack allocations.
Billing models that charge for headroom or peak usage without a clear cap.
Lack of documented upgrade paths to achieve multi-megawatt capacity within a fixed timeline.
No clause for fluid compatibility, leak liability, or remediation responsibility.

Operational tips from the field

Instrument everything: per-rack kW metering and flow sensors are worth their weight in reduced debugging time.
Treat liquid-cooling manifolds as first-class change-controlled assets; require two-person approvals for isolations.
Practice capacity planning conversations with procurement 6–12 months before expected expansions—GPU supply chains and civil work lead times are long.
Link infrastructure telemetry to your incident management pipeline and create canned runbook tasks that can be auto-triggered for known-failure states.

Building Data Centers for Ultra‑High‑Density AI: A Practical Checklist for DevOps and SREs

Building Data Centers for Ultra-High-Density AI: A Practical Checklist for DevOps and SREs

Why this checklist matters

Quick glossary

Procurement checklist: lock the right guarantees

Capacity planning: translate kW and MW into racks and spares

Liquid cooling & RDHx: deployment checklist

Rack, power distribution and cabling checklist

SRE runbook: operationalizing incidents and maintenance

Testing and acceptance criteria

Colocation contract red flags

Operational tips from the field

Further reading and where to apply this guide

Final checklist snapshot

Related Topics

Alex Mercer

Up Next

When your platform outsources its brain: architectural and business risks of relying on third-party LLMs

Running CI/CD under regulatory scrutiny: controls, audit trails and automation for IVDs and medical software

From regulator to builder: practical DevOps patterns for healthcare and med-tech teams