Learnings from the XZ Utils Supply Chain Compromise
The XZ Utils backdoor (CVE-2024-3094) was discovered in March 2024 and is an example of a software supply chain attack that would have allowed hackers in possession of a specific private key to connect to the backdoored system and run their own commands as an administrator. XZ Utils is a widely used set of data compression tools for Unix-like operating systems, pre-installed on most Linux distributions (including Debian, Fedora, Ubuntu, and Arch) and found on macOS. Affected devices primarily include servers, workstations, and network infrastructure, such as firewalls, VPN concentrators, and routers. The backdoor was so well-hidden that traditional security scanning tools failed to detect it. A Microsoft engineer inadvertently found it by noticing unexpected system performance issues, not through automated detection. Here’s a great overview of it to refresh your memory: The Internet Was Weeks Away From Disaster and No One Knew
This incident exposed a real problem in cybersecurity: conventional tools rely on known signatures and obvious patterns. When attackers hide malicious code in the build process rather than the source code, or design it to activate only under specific conditions, traditional defenses simply don’t work.
Eclypsium tackles this issue by enhancing static analysis with machine learning to address this gap. By analyzing the behavior and structure of software with the aid of Automata (the core detection technology behind Eclypsium’s platform) and the automated analysis of our vast dataset of compiled binaries, our AI-driven technology can identify threats that signature-based tools miss entirely.
What Made XZ Hard to Detect
The XZ backdoor used several techniques specifically designed to avoid detection, including:
- Hidden in the Build Process: The malicious code wasn’t in the source repository that developers could review. Instead, it was injected during compilation and activated only on specific Linux distributions with specific build configurations.
- IFUNC Hijacking: The backdoor exploited a legitimate feature of C libraries called IFUNC to intercept SSH authentication calls before they are executed.
- Selective Activation: The backdoor only worked on Debian and RPM-based systems running on x86-64. Testing the code on other systems would show nothing suspicious.
- Entropy Manipulation: The malicious code increased the randomness (entropy) of certain binary sections, mimicking how legitimate compressed data looks.
- Custom Obfuscation: The actual malicious instructions were deobfuscated during the build process using common Unix tools, leaving no obvious traces.
None of these characteristics would trigger alerts in traditional vulnerability scanners or signature-based detection tools. They’re looking for known attack patterns, not for the subtle structural differences that distinguish malicious code from legitimate code.
Anomaly Detection Driven by Static Analysis
Eclypsium’s Static Analysis For Building Patterns
Eclypsium’s Advanced Research team has spent years dissecting firmware, bootloaders, system libraries, and supply chain malware. From that work, we’ve learned what doesn’t look normal in a compiled binary. However, that knowledge doesn’t scale on its own.
Automata encodes that expertise into formalized binary-level behaviors and structural indicators. These are not vague heuristics but precise, testable signals derived from real-world attacks such as the XZ backdoor. Examples include:
- Abnormal use of dynamic dispatch mechanisms such as IFUNC
- Structural inconsistencies between sections, symbols, and relocation data
- Entropy distributions that are statistically valid but contextually wrong
- Instruction-level patterns that emerge only after compilation
- Import/export relationships that violate expected build semantics
These features aren’t randomly selected. They’re based on actual technical indicators of how the XZ backdoor worked. Once these research-driven signals are extracted, machine learning takes over not to “guess,” but to measure deviation at scale.
Unsupervised Learning for Anomaly Detection: Recognizing Off Patterns
Eclypsium maintains a dataset of over 40 million different legitimate binaries across firmware, system libraries and applications. This data provides a reference baseline for building an unsupervised anomaly detection system that is able to learn “normal” patterns from the information extracted from these binaries using Automata.
When applied to suspicious binaries, Automata extracts features with which our anomaly detection systems detect off patterns coming from those binaries that deviate from the reference dataset. These anomalies are flagged to be studied in detail in order to assess possible security threats. Different, complimentary anomaly detection models, which measure different dimensions from the features of Automata, are applied to increase the coverage and reduce the possibilities of false positives.
The logic is straightforward: malware is rare and structurally distinct from legitimate code. A system trained on millions of normal binaries will recognize when something is fundamentally wrong, even if it’s a new attack the system has never seen before.
We use three complementary outlier detection techniques that each approach the problem differently:
- Isolation Forest identifies binaries that are structurally unlike the majority of known-good samples
- ECOD flags individual characteristics like file size or entropy that fall outside normal ranges
- COPOD detects when expected relationships between features break down.
By combining all three, we catch anomalies that any single method might miss.
These 3 algorithms are, in fact, complementary, and using an ensemble system that weighs each algorithm’s decision to produce the final score ensures they are properly balanced across the features used in training. Neither method requires knowledge of the XZ backdoor specifically. They work by learning what legitimate code looks like and flagging deviations. Our platform combines these machine learning techniques into a practical system that runs 24/7 with the following features:
- Analyzes 100% of new binaries in an environment, not just random samples
- Maintains a database of over 40 million legitimate binaries for comparison
- Uses multiple detection methods simultaneously, reducing false positives
- Provides explainability about why a binary was flagged
- Integrates findings into existing security tools and incident response workflows
Why This Matters for Supply Chain Security
Supply chain attacks are difficult because they hide in plain sight within software used by millions of organizations. The XZ backdoor almost succeeded, not because it was technically superior, but because it was inserted at a point in the software development process where traditional tools weren’t looking.
Machine learning-based detection changes the game because:
- No Signatures Required: You don’t need to know about an attack to detect it. You only need to know what legitimate code looks like.
- Behavioral Anomaly Detection: Rather than matching against a database of known bad code, the system measures whether a binary’s characteristics fall within the expected range for legitimate software.
- Continuous Monitoring: Unlike manual code reviews or periodic audits, machine learning systems can analyze 100% of new binaries as they’re built and distributed.
- Scalability: Analyzing millions of binaries per day is impossible to do manually, but machine learning can handle it routinely.
The Bigger Picture
The XZ backdoor represents a category of attack that traditional tools are structurally unable to stop. It wasn’t a zero-day in the traditional sense, where a software bug is exploited. It was deliberate code inserted at the right moment in the supply chain, designed to hide in the compiled output while being absent from the visible source code.
Detecting this type of attack requires a different approach. Rather than asking “Is this code exploiting a known vulnerability?”, we ask “Does this binary look normal?” Machine learning provides a systematic way to answer that question across millions of samples, continuously, at scale.
As supply chain attacks become more sophisticated, organizations need defenses equally advanced. AI-powered analysis of binaries and firmware offers that capability. You can learn more about Eclypsium’s Automata feature with the following resources:
