
BTS #52 - Securing the Future of AI Infrastructure
In this episode, the hosts discuss the evolving landscape of AI infrastructure security, focusing on the complexities of building and maintaining AI data centers. They explore the critical role of Baseboard Management Controllers (BMCs) as an attack surface, the importance of supply chain security, and best practices for hardware procurement. The conversation underscores the importance of validating hardware and firmware integrity for organizations while also addressing the significant security risks associated with AI workloads. As AI data centers continue to grow, understanding these challenges and implementing robust security measures will be essential for future success.
Below the Surface Episode 52: AI Data Center Security and BMC Vulnerabilities
Host: Paul Asadoorian
Guests: Chase Snyder, Wes Dobry
Technical Setup and Introduction
Paul Asadoorian: This week, my coworkers Wes and Chase joined me to discuss how to protect AI data centers and BMCs, that soft underbelly of your IT infrastructure. Stay tuned—Below the Surface, coming up next.
Welcome to Below the Surface. It’s episode number 52 being recorded on Wednesday, June 18th, 2025. I’m Paul Asadoorian, joined by Chase Snyder. Chase, welcome. And Wes, you’ve decided to hang out with us. I thank you for that. You’re always welcome. Wes Dobry is here with us. Wes, welcome.
Chase Snyder: Hey guys.
Wes Dobry: Thank you, thank you, joining you from Florida there.
Paul: Just a quick announcement before we get into it—Below the Surface listeners can learn more about Eclypsium by visiting eclypsium.com/go. You’ll find stuff there like the Ultimate Guide to Supply Chain Security and an on-demand webinar I presented called “Unraveling Digital Supply Chain Threats and Risk,” a paper on the relationship between ransomware and the supply chain, and a customer case study with DigitalOcean.
The Rise of AI Infrastructure
Discussion about why AI infrastructure has become a distinct category and the economics driving its adoption.
Paul: Let’s talk about AI infrastructure security. I think in a previous episode I had Josh Marpet on talking about why AI infrastructure is a thing. Maybe we could just kind of recap that for our audience.
Wes: Computing is a very cyclical process where things move from a core, monolithic architecture out to edge and then generally comes back. A lot of that follows technical adoption curves where new infrastructure becomes incredibly expensive, so only certain players with large capital can actually do those large infrastructure purchases.
As AI has started to come to the forefront, purpose-built data centers bringing in AI infrastructure have started to centralize into Neo clouds and specific organizations designed for that. What that has resulted in is this technology, while it’s flowing out to the market, still tends to be expensive and specialized, which has ultimately resulted in organizations asking: do we have the capability, specialization, and interest in doing this at scale for ourselves?
Paul: You said that so eloquently, Wes. I was just going to pitch it as: you moved a bunch of stuff to the cloud, so you have room in your data center. And to do the AI workloads in the cloud is super expensive, so you’re putting it back in those empty racks in the data center. You’re now getting filled back up with some of your own infrastructure, right?
Chase: Much like the cloud providers initially did a really awesome job of selling, “Hey, you should move your stuff into the cloud. It’ll make it so much easier.” It’s repeating itself with the AI data center stuff where it’s genuinely complicated to build out AI infrastructure internally. Even just to buy the GPU clusters and deploy them, it’s different. It super prioritizes performance. You’re working with bare metal. It’s complicated in the same way that running your own data center is complicated, except more so in various ways.
Economic Drivers and Technology Turnover
Paul: But what I think forced it was the large cloud providers put in a hefty price tag to run these GPU workloads, because largely this stuff runs on specialized GPUs for AI workloads. Amazon recently said they’re slashing prices because I think they realized their market share might be shrinking because they basically priced themselves out of the market.
Wes: The whole AI workload conversation is really just a conversation about efficiency and converting power to tokens. That’s what all of this really comes down to. One of the primary concerns is when you look at it from the lifecycle of something like a GPU—it’s very specialized equipment. That in itself is causing an adoption curve that’s accelerated and ultimately a shortened lifecycle.
When I talk to AI providers out there, most of the time they’re saying, “In a year from now, this is no longer going to be the efficient system to do this.”
Paul: So there’s a huge turnover now because as the technology gets better, it’s worth the investment to go to the newer technology if I can put the same workload and get it done in half the time with the latest hardware.
Wes: Yeah, and we’re actually going to start seeing quite a bit of fracturing between training and inference as well. When we look at the training side, you’re building the models with just hard compute. You’re looking at completely different infrastructure than you would for inference. Now we have things like GROKs and TPUs that are very good at the inference side—they can just turn out tokens like crazy.
Understanding AI Hardware and Performance Metrics
Technical explanation of AI hardware components and performance measurements.
Paul: Explain to me what a TPU is and a token in this context, Wes.
Wes: I’ll summarize with tokens being thought process and output. When you go to ChatGPT and you say, “Define the color blue,” every word that it spits out is effectively a token. So increasing that performance gives you the opportunity to spit out more tokens concurrently within a set of infrastructure.
When you think of hardware that was available today versus hardware that was available a year ago, we may be talking about scenarios where one setup is running at three to four tokens per second, whereas if you’re using the latest bleeding-edge hardware that’s ten times as expensive, you may only see 10 to 15 tokens per second. All of these are abstract numbers that could vary widely depending upon the infrastructure, architecture, and models.
The core of this is that you are effectively using this hardware to take power and generate output and heat. That’s why we’re seeing them pop up in locations like North Dakota and Wyoming. I think you’re going to see Canada become an AI hotbed in the next couple of years with hydropower.
Training vs. Inference Infrastructure
Paul: Right, it’s more costly compute-wise to train the model than it is when you’re running the inference aspect of it.
Wes: Absolutely. The hardware to do the training generally requires things like more memory so that it can do larger model sets and training, then paring down or customizing those models. That’s where you’re seeing things like Nvidia’s GB200s coming into play for doing these huge scalable datasets, or the open compute style architectures that Facebook and Meta are using for doing their model training.
Security Threats to AI Data Centers
Discussion of unique security challenges facing AI infrastructure, including performance-based attacks.
Chase: I feel like there was a huge time period where people’s personal computers were getting roped into Monero mining botnets. If a hacker broke into a data center, one of these AI data centers somehow, and just carved off a little bit of the capacity to do some crypto mining—Wes, you’re talking about these organizations having like six months to a year to squeeze profit out of these GPU servers before they’re essentially obsolete.
The possibility that a security incident could have on an AI data center via impacting performance—if you exact a one or two percent performance hit on a whole AI data center because you break in there and start mining Monero with their GB200s, that could nuke the whole thing. That could tank the whole data center because the price competition per hour of GPU capacity is so competitive right now.
Paul: It’s interesting to think about threat actors going after some of this infrastructure because of the high compute capacity. I tell the story a lot on the podcast about when I was at the university—someone had broken into a Cray supercomputer. The only reason they knew someone broke in is because they patched the supercomputer to close out the flaws that they had used to maintain their access. But it could be a target for threat actors just based on their computing capabilities.
Chase: Yeah, well-known attacker goal is to just steal capacity.
Firmware Security Challenges in GPUs
Wes: I just recently bought a new server at my house that I’m doing some GPU playing around with. One of the things I found was for me to actually unlock all the functionality in the GPU that I purchased, I actually have to use third-party firmware on it. They took Nvidia’s firmware and then someone out there has modified it to unlock additional features so they can actually use its full capacity.
That drives me crazy, but now I have to go out and download firmware from a less-than-good site probably, and then throw it on my piece of hardware. That gives me questions about anything that would ever run on that GPU after that.
Paul: I think a lot of folks don’t necessarily focus on or even know that these GPUs run multiple different types of firmware for multiple different subsystems on the GPU. It’s essentially a computer inside your computer. Even just these cards today—if you’ve ever installed a huge graphics card before, you know it’s tons of power cables going to it and you need the little support brace because if gravity is working against you, it’s so heavy it would snap the PCIe connection.
These are crazy and they have multiple firmware components. Our product does some enumeration of this firmware, which is how I was able to see there are certain subsystems that use different chips with different firmware. Because again, like your motherboard with tons of chips and firmware, your GPU is basically just a multiplier now on top of that.
Wes: When you look at Nvidia, they actually can allow you to do measurements and attestation of the GPU and the code running on it, as well as verification of that versus the driver within the operating system or kernel. It’s actually quite amazing—they built a framework where it’s very similar to what we’re used to on x86 devices with secure boot, where you want to be monitoring the integrity of the firmware and the driver.
You’re now building these layers upon layers to actually validate the integrity that when you are doing inference or training, you actually can trust everything bottom to top.
Bare Metal Cloud Security Risks
Analysis of security challenges when AI infrastructure is provided as a service through bare metal cloud offerings.
Chase: This feels like a good moment to segue into the potential challenges involved when an AI data center—one of the big Neo clouds—is renting out capacity to their customers. Because of the performance demands, the customer needs bare metal, so they’re getting pretty deep access to hardware inside the AI data center. They can do whatever they want on there contractually, and then the Neo cloud AI provider is going to take that back when they’re done and rent it out to someone else.
Put on your black hat right now—if you’re a nefarious actor and you want to rent out some capacity from an AI data center, bare metal style, what would you do if you were trying to introduce risk?
Paul: Well, it’s a nice attack because it carries over to multiple customers. So I just got to get some code running on one GPU. Every time that gets recycled to a new customer, now potentially I have access to their data.
Wes: There’s absolutely that attack vector. When someone’s provisioning a bare metal system to get the highest level of performance, you’re basically giving access directly to hardware APIs. When you get that level of access, anything that you could do around updating firmware or settings is also acceptable for not only read but write in most cases.
If I were an AI attacker researcher, the first thing I would try to do is embed myself either in the host operating system or in the GPU. The other thing I would do is try to see if I could query other potential attack vectors like looking for BMCs (Baseboard Management Controllers). If I have an IPMI driver available to me, first thing I’m going to do is look at that and see if I can get any data out of the host system’s infrastructure itself.
If I do manage to get into that BMC, now I can implant in there. Now I not only own the physical hardware, I can get anything out of the operating system, anything out of the hardware. I can also use that to move laterally within a management segment to other hosts within the data center too.
BMC: The Soft Underbelly of IT Infrastructure
Detailed discussion about Baseboard Management Controllers as critical attack vectors in data centers.
Paul: And that’s super dangerous. It’s always been that way since we’ve had BMCs—if you put all your BMCs on your management network, it only takes one and then they can gain access to all of it. I remember years ago having conversations with a security researcher who did security research on BMCs, saying the exact same thing.
We are actually publishing a paper today where we worked in conjunction with AMI to analyze some of this space. The long story short: gaining insights into how AMI is managing their BMC code, they’re doing an amazing job. Then you look at how other people are doing it—it’s just not as good because they’re basing it largely off OpenBMC, implementing their own APIs and code based on the Redfish specification.
If you look at our previous research we’ve published at Eclypsium, we have three different rounds of vulnerabilities that primarily our researcher Vlad has spearheaded. So this is a ripe attack surface.
CloudBorn: Real-World BMC Attack Precedent
Chase: There is extremely real precedent for this too. Before I worked at Eclypsium, back in 2019, the research team here discovered this attack vector they named CloudBorn, which is this exact scenario. Bare metal cloud infrastructure is getting leased out. They describe it as a problem with the reclamation process where the firmware was not reflashed between customers.
So the cloud provider reclaims it, doesn’t reflash the firmware on it. In that CloudBorn scenario, it was a vulnerability in Super Micro BMCs that was used to overwrite the firmware of the baseboard management controller that allowed the adversary to then implant a backdoor. So that was a real one that happened in 2019, before the major Neo cloud AI data center situation started really taking off.
BMC Architecture and Attack Surface
Paul: I described it when I was on another podcast: BMC is very simple at a high level. Like, you like computers so much, I put a computer inside your computer. Your BMC is typically an ARM-based computer that runs Linux inside of your computer. You can run whatever operating system you want, but there is a Linux server basically inside of your computer. Linux on ARM is a ripe attack surface. It is another Linux box basically. It just runs ARM.
Chase: I’m looking at the Nvidia docs page right now and it says Nvidia BMC is based on the OpenBMC open software framework which builds a complete Linux image for a board management controller. You got a whole Linux system here.
Paul: Yeah, and OpenBMC isn’t bad. So I’ll give you some insights from the paper that we analyzed. OpenBMC is great. They use the Yocto build system, and they do a good job of trying to keep their base images up to date. The build system is actually great—I used it, I built an image, it worked.
It didn’t scan horribly, but you can customize that however you like. The moment you bring in a proprietary component, now you’re stuck on a certain Linux kernel version and that software starts getting crusty. So it depends on how well-maintained your OpenBMC implementation is.
Attestation and Transparency Challenges
Discussion about the lack of verification capabilities for BMC firmware and the implications for security.
Wes: I’m going to get on my soapbox about transparency here. When we look at all the BMCs out there, whether you’re using Super Micro BMCs, ILOs, iDRACs, or others, none of them give a capability to do an attestation on that device or provide any sort of reference integrity manifest or even using newer technologies like SPDM where I need to be able to verify that that BMC is what it says it is.
On some of the older ASPEED stuff, there was actually a way of extracting it so that you could do independent integrity verification. Nowadays, they’ve all built it so that it’s kind of a one-way API. You can write a BMC update through the BMC APIs, but you can’t verify it in any way, shape, or form.
If I take a BMC and I grab a copy of OpenBMC and modify it slightly to have some kind of implant and throw it on a device in a data center where they may not be as mature as others, do you think they’re going to notice that I’ve now put a compromised OpenBMC image inside of their environment when they don’t have the tools to do it?
Paul: So many places I want to go from here, Wes, because what you’re reminding me of—I believe Supermicro had a vulnerability (and they fixed it) that allowed an attacker to bypass the signature verification of the image on the BMC. Supermicro is not alone in this. That means what Wes said: I can put my own customized backdoored firmware inside of your server.
Also, it is super frustrating—when we look at SPI flash that holds UEFI, in most cases we can take an image through software of your SPI flash, basically a DD copy of the storage that holds your firmware. In the case of BMCs, the technology is designed and implemented such that we can’t go to the SPI flash chip that holds the BMC image, pull it off, and then analyze it.
So when I say we analyze firmware for the AMI paper, we analyze the firmware before it goes on the chip. There are differences—what’s running on your BMC could be different from the image that was installed.
Wes: This will always echo my call for transparency from the OEMs and ODMs. If they just gave me something as simple as a PCR or a hash of what that firmware is so that I could independently verify it and validate it and track it for change, that’s all I care about.
Industry Response and Research Efforts
Discussion of recent industry initiatives to address BMC security challenges.
Chase: This is a topic that has been getting attention. Nvidia published this great article recently called “Analyzing Baseboard Management Controllers to Secure Data Center Infrastructure.” They’re connecting the same dots that we are—that BMCs are an attack surface for these AI data centers.
Their offensive security research team analyzed BMC firmware used in data center environments, identified 18 vulnerabilities, and developed nine working exploits, and they published a whole paper about this.
Paul: It’s an awesome paper. I recognize the name Alex Tereshkin—he has published numerous research articles and discovered vulnerabilities, so much so that I recognize his name in this context of UEFI and BMCs. They brought their A-Team to look at the BMCs, and they did a great job.
Chase: There’s this whole global narrative happening where tons of new AI data centers are getting built. AI is getting treated as critical infrastructure in a more general sense—it’s getting used for defense applications and power grid management. It is critical infrastructure and it’s getting built at breakneck pace all over the world.
When you’re critical infrastructure like that, you have a target on your back. The most sophisticated nation-state APTs are increasingly incentivized to target these. We’ve already seen a huge swath of APT groups targeting telecommunications worldwide using sophisticated techniques and targeting the IT supply chain specifically using software vulnerabilities and hardware supply chain vulnerabilities.
Supply Chain Security and Validation
Analysis of supply chain risks and validation requirements for AI data center infrastructure.
Wes: What you just said goes into the supply chain security aspect of things. That’s twofold. One is software supply chain—validating vulnerabilities and integrity and threats of software coming into my infrastructure. But the hardware supply chain is another critical piece, especially in these scenarios where we’re talking about huge data center buildouts with huge amounts of infrastructure coming into it.
Building a program to do the proper amounts of validation of that infrastructure—I have some customers that go all the way to having a pipeline that X-rays equipment as it comes in to verify against known good chip layouts, all the way to others that are using our technology to validate firmware, and others that are using things like SPDM and DICE and building a C-SCRM (Cyber Supply Chain Risk Management) program using SBOMs for software and HBOMs for hardware and my favorite, FBOMs for firmware.
We’re now at a point where if organizations are not looking at every layer of the stack from every attack vector, that attack vector that they’re not looking at will be the area that APTs concentrate on. If you don’t protect your supply chain, that could be something like dropping USB keys in the parking lot to intercepting things from value-added resellers and strategic integrators to implant something in your organization.
Paul: It’s interesting to think about threat actors going, “If only I could put a small computer inside of all the servers so that I could maintain access, and if that little computer had access to all of the hardware”—and it’s like, wait, that’s already built in. And it has a legitimate use case for it. But if it falls into the wrong hands, then it’s game over.
Wes: And not only is it already built in, it has a driver to directly interface with the running operating system. It has hardware interfaces to communicate on SPI to the processor. And it also has direct memory access in most cases too, where it can directly exfiltrate information if you’re talented enough.
Building Comprehensive Security for AI Infrastructure
Recommendations for implementing security validation and monitoring for AI data center deployments.
Chase: If you guys were going to build out the attestation and security validation from supply chain—from receiving hardware onwards—for an AI data center or any high-performance computing data center where you’re onboarding hardware into a situation where it’s going to handle sensitive data and you don’t want it to be compromised, what are the basic requirements that you would put in place?
Wes: The first one includes Eclypsium helping on the hardware validation, firmware validation on most of the components in the system, and then also the runtime and operational monitoring of that device for change, risk, vulnerability, and those types of things.
If I wasn’t using Eclypsium and I had a Linux-based architecture—and most of this stuff these days is built upon open source and Linux—Nvidia has released an attestation suite of products that validates the measurements and reference integrity of the GPUs and the code running on them, as well as the driver to validate that layer.
Below that layer, you’re stuck with things like TPM-based measurements in most situations. You can use things like IMA (Integrity Measurement Architecture), which is maintained by the Linux team. You can also use things like ELCHIM, which is a Linux kernel integrity measure that allows you to validate that your kernel matches what you’re expecting.
A lot of this gets to a very high level of maintenance because unless you are maintaining exactly what you know your kernel to be that you’re rolling out and not doing anything like DKMS (Dynamic Kernel Module Support), you’ve got a scenario where you’re having to understand exactly every piece in your infrastructure.
Practical Security Implementation
Paul: When you’re deploying new infrastructure such as for AI research, what do I have available to me to do the attestation activities that Wes was mentioning? Whether that be secure boot, whether that be some other attestation, all the way down to your Linux kernel and kernel modules and drivers. You should definitely be doing that because all of that is going to help you have a more resilient system.
Then there’s the management end of it—locking down your management interfaces, monitoring your management interfaces. It varies per manufacturer in terms of authentication to your BMCs. Can you enable multi-factor authentication? How is the API secured? It varies per vendor, but those need to be discussed and have a proper implementation as well.
Wes: Many of them will leverage an external authentication mechanism. I typically see things like LDAP being leveraged. If you can use LDAP or OAuth, you move that MFA outside the BMC. The BMC itself is still going to be basic, and many of them will have hard-coded credentials that may be able to be derived if you have access to the hardware.
When you buy an HP server, you get a little card that you pull out of the server and it gives you your ILO password on it. If you don’t change that password, and I’m an attacker and I have access to the system, I’m pretty confident I could generate that password. Then I always have a persistent access mechanism with full admin on the BMC.
Monitoring and Incident Response Challenges
Discussion about the limitations of traditional security tools when dealing with BMC environments.
Paul: The BMCs that run Linux inside of your computers don’t give you the capability to do forensics, incident response, or put EDR software for monitoring on that Linux subsystem, much like VPN appliances and network security appliances have that locked-down environment. So it’s super hard—it’s impossible, next to impossible—to do any kind of monitoring.
There’s this open source Linux subsystem that no one’s able to monitor on their system. So that may be where threat actors want to live, much like VPN and security appliances.
Wes: You hit the nail on the head. BMC architecture in general is a black box. We are getting to a point where we’re getting some management interfaces like SSH to a BMC, where you can then start doing some active real-time monitoring for IOCs (Indicators of Compromise), as well as getting some high-level logging from them.
Simply put, having BMCs send audit logs or events and alerts out to centralized SOC solutions like SIEMs is a base level of what needs to be accomplished to monitor for indicators of compromise and indicators of attack. I think we’re going to see an expansion of the toolset capabilities.
I can tell you I absolutely have a project where I’m looking at implantation into BMCs in real time so that we can actively monitor on it. We might see more tools like that in the future.
Procurement and Organizational Security Maturity
Analysis of how security considerations are integrated into IT procurement processes.
Chase: I’m curious about how much security gets into these various stages of the device lifecycle. Is the security team getting consulted about asking those kinds of questions—like are we going to have SSH access to this, or what level of visibility? Is the security team getting consulted for what level of security requirements should be in place before new IT stuff gets onboarded?
Wes: It heavily depends upon the maturity of the organization. Some organizations think, “Hey, as long as you’re buying from a trusted vendor, we really don’t care about the hardware.” As long as it meets their minimum controls—which in some organizations is as simple as running their enterprise-managed operating system with their enterprise-managed security tools—that’s it.
Places simply do not think about the hardware and firmware built into that hardware from a security perspective. That is an area that I am heavily pushing for organizations to change, where validation and capabilities to verify everything coming in from the supply chain perspective and then ongoing verifications against it is going to become critical and paramount.
Some of our more bleeding-edge customers are building strategies toward that, even moving away from things like Secure Boot into situations like Measured Boot, where instead of having a root of trust managed by the OEM or by Microsoft, they’re looking at managing it themselves so that they can be in control of that verification.
Trust vs. Control in Hardware Security
Wes: A huge part of this always goes back to the conversation about the root of trust of devices—some organizations are not comfortable with taking the operational risk of a device not booting because of some change. So they let the OEM manage it and trust the OEM.
I’ve done plenty of purchases over my career where I’ve said, “Nobody gets fired for buying HP, IBM, Cisco,” because they’re a trusted vendor and I have limits of liability and insurance contracts to protect my organization. Nowadays, I talk to customers that say, “I don’t care about any of those people because they’re not doing the things that we need to validate the trust of it.”
So either they’re going to build their own systems because then they can have full transparency to what goes into it, or they’re going to use things like OpenUEFI or OpenBMC to apply their principles that they need on these devices.
Current Attack Trends and Future Implications
Discussion of evolving attack vectors and their implications for AI infrastructure security.
Chase: We’re seeing trends where historically credential abuse has been one of the major initial intrusion vectors. According to the Verizon Data Breach Investigation Report from this year, credential abuse is on the decline and exploitation of vulnerabilities is catching up to it. They’ve almost met in the middle at this point.
Specifically, exploitation of vulnerabilities on network edge devices—VPN gateways especially, but routers, switches, firewalls, load balancers—is being targeted with these vulnerabilities. It’s because that’s the low-hanging fruit for attackers now. Organizations have done a much better job of enforcing MFA and doing baseline protections.
They’re now discovering that these opaque network devices, including even firewalls and VPNs which you install for their security benefits, turn out to be the door through which the attacker enters the environment because of these vulnerabilities. In many cases, because of how these appliances are built, you don’t have that visibility into where the vulnerability is.
AI Infrastructure as the Next Target
Chase: In the AI data center context, we see so much conversation about preventing the models from getting jailbroken or model poisoning during training. We don’t see a lot of discussion about the infrastructure level—the hardware and firmware and GPU level security—but that’s going to be where the attackers target because it’s under-attended to and it’s just a lower priority.
As they build out these AI data centers, the concerns are power, cooling, and GPUs—being able to buy them, the constraint on the supply chain for the hardware itself, and then the ability to power and cool it to keep it operational so that you can sell as many of those hours of utilization as possible. Security of that hardware and infrastructure is not among those top priorities.
But I think that once we see an AI data center get compromised through a BMC firmware vulnerability, there’s going to be more attention paid to the security at that level.
Underwater Data Centers and Operational Dependencies
Discussion about Microsoft’s underwater data center project and its implications for BMC importance.
Paul: It reminds me of the Microsoft data center that was deployed underwater. For some of the reasons you talked about—obviously cooling, plenty of seawater to help cool it. But you also don’t have ready access to that data center, which is the point they were making about why BMCs are important operationally.
Now if they’re important operationally and they’re kind of low-hanging fruit security-wise, if we’ve established that, that is going to be where attackers go. You’re likely not going to see people turn off the BMC functionality altogether because they need it to manage their data centers. That’s where attackers are going to gravitate to.
Conclusion and Resources
Final thoughts and references to additional resources for listeners.
Paul: Make sure I’ll put those links in the show notes for this episode. There is a paper being published that we did in conjunction with AMI that’s amazing and you should absolutely read it to gain some more insights into the BMC threat and vulnerability landscape. And a recent blog post on our support for AI data center infrastructure to help you secure that environment is on our blog.
Chase, Wes, thank you for joining me today.
Wes: Happy to be here.
Chase: It’s been a delight.
Paul: Thank you everyone for listening and watching this edition of Below the Surface. That concludes this episode. We’ll see you next time.
Key Takeaways
- AI Infrastructure Economics: The high cost of cloud-based AI compute is driving organizations to build their own AI data centers, creating new security challenges.
- Performance-Based Attacks: AI data centers face unique risks where even small performance impacts from crypto-mining malware could be economically devastating due to tight profit margins.
- BMC Vulnerabilities: Baseboard Management Controllers represent a critical attack surface in AI data centers, providing attackers with persistent, privileged access that’s difficult to detect.
- Supply Chain Risks: The rapid buildout of AI infrastructure often prioritizes performance and availability over security, creating opportunities for supply chain attacks.
- Attestation Gaps: Current BMC implementations lack proper attestation capabilities, making it nearly impossible to verify firmware integrity.
- CloudBorn Precedent: Real-world attacks like CloudBorn demonstrate how firmware implants can persist across customer transitions in bare metal cloud environments.
- Industry Response: Major players like Nvidia are beginning to address BMC security, but widespread adoption of security best practices remains inconsistent.
Monitoring Challenges: Traditional security tools cannot effectively monitor BMC environments, creating blind spots in security operations.