Revisiting Bare Metal Server Security in the Age of AI

By: Chase Snyder

July 23, 2025

The adoption of bare metal cloud services for AI workloads has accelerated significantly, driven by performance requirements that virtualized environments struggle to meet. Neocloud AI data center companies like CoreWeave, Crusoe, Lambda, Voltage Park, and Corvex have heavily promoted the performance benefits of running GenAI workloads on bare metal. However, this performance advantage introduces unique security challenges that organizations must carefully evaluate.

Hardware-level attack vectors create persistent vulnerabilities

Bare metal environments face security risks that persist across tenant transitions, fundamentally different from virtualized cloud security models. The most significant threats operate at the hardware level, where traditional security measures prove insufficient.

For example, the Cloudborne attacks discovered by Eclypsium in 2019 exploit firmware vulnerabilities in server management systems. Eclypsium researchers demonstrated how attackers could implant malicious code in Baseboard Management Controller (BMC) firmware through simple bit-flips or additional IPMI user accounts. The critical vulnerability occurs during server reclamation processes between tenants. Cloud providers may fail to reflash BMC firmware when transitioning servers between customers, allowing malicious code to persist and affect subsequent tenants.

The attack achieved a CVSS score of 9.3 (Critical) because it provides persistent hardware-level access that survives OS reinstallation, hypervisor resets, and standard security measures. This persistence makes Cloudborne particularly dangerous for bare metal environments where hardware isolation is the primary security boundary.

More recently, a BMC vulnerability also discovered by Eclypsium was added to CISA’s Known Exploited Vulnerabilities (KEV) catalog for the first time. CVE-2024-54085, disclosed by Eclypsium in March, 2025, is a high risk (CVSS 10.0), remotely exploitable authentication bypass affecting AMI MegaRAC BMC firmware, known to be widely deployed in data centers and likely AI infrastructure worldwide. BMCs and other low-level hardware are appealing targets for

Also in 2025, the GPUHammer vulnerabilities demonstrated the possibility of exploiting RowHammer vulnerabilities in GPU memory, specifically targeting NVIDIA GPUs with GDDR6 memory. Researchers demonstrated successful bit-flips in NVIDIA A6000 GPUs that can reduce AI model accuracy from 80% to less than 1% through strategic corruption of floating-point model weights. The attack overcomes GDDR6 memory protections through parallel hammering techniques and bypasses existing mitigations like Target Row Refresh.

NVIDIA’s recommendation for mitigating GPUHammer risk was to turn on System-level Error Correction Codes (ECC), but this has the potential to introduce performance overhead against workloads running on affected GPUs.

These examples represent a broader trend of vulnerabilities and cyberattacker activity affecting low level compute infrastructure, including hardware and firmware, which tends to be more difficult to monitor and secure than the operating systems and applications running on top of the infrastructure.

The tug of war between performance and security isn’t new, but the increased desirability of bare metal for GenAI workloads reopens a front that had previously receded into the background.

Case Study: Learn How Eclypsium Helps One Neocloud Secure Bare Metal Between Tenants

Performance advantages may justify security trade-offs for specific workloads

The performance difference between bare metal and virtualized environments varies significantly by workload type. Modern virtualization has largely eliminated performance gaps for general computing tasks, but AI workloads still benefit substantially from bare metal deployment.

VMware’s research shows that virtualized environments can achieve 94-105% of bare metal performance in MLPerf benchmarks using advanced GPU virtualization techniques. However, these results require sophisticated configuration and may not reflect real-world deployment scenarios.

For large language model training, bare metal environments provide measurable advantages including faster distributed training and improved throughput with custom CUDA kernels. For real-time inference applications, bare metal infrastructure can achieve sub-10ms latency compared to higher latency in traditional virtualized cloud environments.

The networking performance differential proves particularly significant for distributed AI workloads. Bare metal environments typically provide dedicated high-bandwidth network connections, while virtualized environments often involve shared networking resources that can introduce bottlenecks during peak usage periods.

Any enterprise deploying in-house AI infrastructure, or using a cloud provider for GenAI workloads, should examine the tradeoffs between performance and security when selecting between bare metal or virtualized options.

Multi-tenancy security requires comprehensive isolation mechanisms

Bare metal environments eliminate hypervisor-based isolation, creating both security benefits and challenges. While this removes hypervisor attack vectors, it places greater responsibility on physical and logical isolation mechanisms.

Leading bare metal providers implement hardware-level isolation through single-tenant dedicated hardware with comprehensive physical separation. This includes network isolation through VLAN/VRF configurations, storage isolation with tenant-specific volumes, and dedicated infrastructure components.

The ClusterMAX rating system developed by SemiAnalysis evaluates isolation effectiveness across providers using 9 categories with over 50 requirements. This framework provides organizations with objective criteria for assessing bare metal security implementations.

Firmware security is a complex challenge in bare metal isolation

Enterprise deployment of bare metal cloud services demands security frameworks specifically designed for hardware-level threats. The shared responsibility model shifts greater security obligations to enterprises compared to virtualized environments.

Hardware-level rootkits can persist across tenant transitions if providers fail to implement comprehensive firmware sanitization procedures. Effective mitigation requires hardware attestation capabilities, secure boot implementation, and continuous firmware monitoring systems.Hardware attestation forms the foundation of bare metal security architecture. This requires TPM 2.0 implementation, secure boot configuration, and continuous firmware security monitoring capabilities. Organizations must implement measured boot processes that record component measurements in Platform Configuration Registers, enabling remote attestation and hardware integrity verification.

Compliance frameworks including SOC 2 Type II, ISO 27001, and FedRAMP require adapted controls for hardware-level security in bare metal environments. Standard cloud security frameworks must be enhanced to address firmware security, hardware attestation, and incident response procedures specific to bare metal infrastructure.

Incident response in bare metal environments presents unique challenges due to limited access to hypervisor logs and the ephemeral nature of cloud infrastructure. Organizations require comprehensive logging strategies, cloud-native forensics capabilities, and established vendor communication protocols for effective incident management.

Hyperscaler cloud providers implement different security approaches

The three major cloud providers have developed distinct approaches to bare metal security, reflecting different architectural philosophies and market positioning strategies. Additionally, the proliferation of different hardware and firmware combinations from numerous vendors that make up the GenAI stack in any cloud provider, make it exceedingly difficult to identify foundational infrastructure risks.

AWS offers EC2 bare metal instances through their Nitro System, providing hardware-level security isolation through custom Nitro cards and dedicated hardware allocation. This approach integrates bare metal capabilities with existing AWS security services while maintaining hardware-level isolation guarantees.

Google Cloud provides Bare Metal Solution (BMS) with emphasis on sole-tenant nodes and regional extensions with high-performance connections to Google Cloud services. Their approach includes Titan security chips for hardware root of trust and hypervisor control capabilities.

Azure’s BareMetal Infrastructure provides single-tenant physical servers with extensive enterprise integration, offering over 30 SKUs ranging from 2-socket to 24-socket servers. Microsoft’s approach emphasizes custom silicon development including Maia AI accelerators and Cobalt CPUs for purpose-built AI infrastructure.

The sheer variety in hardware and security approaches across hyperscalers, neoclouds, and in-house GenAI infrastructure increase the risk of vulnerabilities and misconfigurations being introduced through the supply chain, or even while the hardware is in production. Continuous firmware and hardware inventory and integrity monitoring is increasingly important for assuring the reliability of AI models, training data, and inference outputs.

Webinar: GenAI Compute Infrastructure Under Siege: Defending the Foundation of AI

Strategic considerations for enterprise adoption

Organizations evaluating bare metal cloud services should implement comprehensive risk assessment frameworks that consider workload characteristics, security requirements, and operational complexity. Performance-critical AI workloads with predictable resource requirements often benefit most from bare metal deployment.

Security architecture must address hardware-level threats through comprehensive controls including firmware integrity verification, hardware attestation, continuous monitoring capabilities, and specialized incident response procedures. Organizations should prioritize providers with established security certifications, proven isolation mechanisms, and transparent security practices.

Hybrid deployment strategies often provide optimal balance between performance and risk management. Organizations can utilize bare metal infrastructure for performance-critical production workloads while maintaining virtualized environments for development, testing, and variable workloads. This approach allows capture of performance benefits while limiting security exposure and preserving operational flexibility.

Market analysts note that the AI infrastructure market continues to experience significant growth, with demand often exceeding original projections despite various market uncertainties. ABI Research forecasts that neocloud providers will generate over $65 billion in GPU-as-a-Service revenues by 2030, indicating sustained market demand for specialized AI infrastructure.

Take Responsibility For Your GenAI Workload Security

Bare metal cloud services provide compelling performance advantages for AI workloads, particularly for large-scale training and latency-sensitive inference applications. However, these benefits require careful evaluation against significant security trade-offs including hardware persistence attacks, firmware vulnerabilities, and complex isolation challenges.

Furthermore, AI hardware in general is earlier in its security maturity journey than generalized compute hardware. The Rowhammer attacks that were disclosed years ago on general compute hardware were just recently applied to GPUs. We predict many more such cases. AI hardware is different enough to have its own set of security challenges, but not different enough to dodge all the security risks faced by previous generations of computer hardware.

To take advantage of the bare metal performance boost for GenAI workloads, organizations must understand the specific risks involved, implement appropriate security measures, and maintain specialized expertise to successfully deploy bare metal infrastructure. Success depends on matching deployment strategies to specific workload requirements while maintaining comprehensive security practices throughout the infrastructure lifecycle.

Download the Eclypsium Ultimate Guide to AI Data Center Security for help evaluating the security practices, performance tradeoffs, and other critical considerations in securing GenAI workloads across private and public clouds.

Blog

Latest Blogs

Revisiting Bare Metal Server Security in the Age of AI

Vulnerabilities in Netgear Firmware-Based IoT Devices In The Enterprise

Latest Blogs

Revisiting Bare Metal Server Security in the Age of AI

Vulnerabilities in Netgear Firmware-Based IoT Devices In The Enterprise

Revisiting Bare Metal Server Security in the Age of AI

Hardware-level attack vectors create persistent vulnerabilities

Performance advantages may justify security trade-offs for specific workloads

Multi-tenancy security requires comprehensive isolation mechanisms

Firmware security is a complex challenge in bare metal isolation

Strategic considerations for enterprise adoption

Take Responsibility For Your GenAI Workload Security

Vulnerabilities in Netgear Firmware-Based IoT Devices In The Enterprise

Eclypsium Releases Tools for Detecting AMI MegaRAC BMC Vulnerabilities

A Historic First: BMC Vulnerability CVE-2024-54085 Joins CISA's Most Critical List

The Cisco Vulnerability Salt Typhoon Weaponized Against Canadian Telcos and Viasat

Platform

Research

Solutions

Resources

Company

Privacy Policy | Terms of Use | Sitemap