Zscaler CEO Jay Chaudhry set off a firestorm when he boasted that the company harnesses over half a trillion daily transactions—including full URLs and proprietary logs—to train its AI models, prompting security researchers and enterprise customers to demand just what exactly is being fed into the firm’s machine-learning pipelines. The controversy exploded in August 2025 after independent outlets paraphrased Chaudhry’s remarks from a Cloud Security Alliance virtual summit and earlier earnings calls, describing a practice of using “complete logs” from the Zscaler Zero Trust Exchange to power threat detection, risk modeling, and AI copilots. Almost immediately, reports and social-media reactions framed the statements as an admission that raw customer data—potentially containing access tokens, internal hostnames, and sensitive query strings—was being used to train shared AI models without explicit consent.
What the CEO Actually Said
During the CSA summit, Chaudhry highlighted Zscaler’s massive telemetry: “We have over 500 billion transactions per day and hundreds of trillions of signals every day. We can use that technology for wonderful threat detection, cyber detection.” The same session saw him reference structured and unstructured log data drawn from deep packet inspection—precisely the kind of traffic enterprises assume remains encrypted and protected. Security manager Steven Swift amplified the remarks in a widely shared post, noting that Zscaler encourages customers to enable SSL inspection, effectively giving the platform visibility into the payloads of encrypted sessions.
The SDxCentral report and others quoted the CEO’s references to “trillions of customer logs” and a “massive data lake” built from transaction-level logs. Researchers reacted strongly because such logs routinely contain URLs with embedded secrets, cloud storage object names, file paths, and personal identifiers. The perception that Zscaler was training generative AI models on this high-fidelity data raised immediate privacy, contractual, and regulatory red flags.
Zscaler’s Official Response: Data Containment
Within days, Zscaler published a corporate blog post titled “Zscaler’s Commitment to Responsible AI” to clarify its practices. The company stated unequivocally that it does not use customer proprietary or personal data to train shared AI models. Instead, it only employs aggregated, non-identifying metadata and platform signals—such as traffic patterns, risk scores, and anonymized telemetry—to improve global detection models. This concept, which the firm calls “data containment,” ensures that sensitive information never leaves a tenant’s logical boundary for model training purposes.
The blog explicitly assures customers that “raw logs are not used for model training,” and that all customer data remains isolated within the multi-tenant architecture. Zscaler further noted that customers can choose log storage regions and that data in transit is encrypted. However, the company’s follow-up messaging did not fully address the apparent gap between the CEO’s language and the written policy.
The Technical Reality of Transaction Logs
To understand the furor, one must appreciate what a “transaction log” contains in a zero-trust security service. Zscaler’s platform inspects all web and cloud traffic, applying threat detection, data loss prevention, and access controls. Full SSL inspection means the service can see request headers, URLs, query strings, and often the initial bytes of response bodies. A typical log entry can include:
- Full URL, including path and query parameters
- HTTP method and status code
- IP addresses (source and destination)
- User-agent strings and device fingerprints
- Authentication tokens, session cookies, or API keys (if present in headers or URLs)
- File names and cloud-object identifiers
For data scientists, the difference between “metadata” and “full URL” is enormous. A URL like https://healthcare.example.com/patient/1234?token=abc carries both a patient identifier and a secret. If such a URL is treated as “metadata” for training without sanitization, the model could memorize and later reproduce it. Researchers noted that even aggregated, de-identified signals can be re-identified when combined with other data sources, and generative models are notorious for regurgitating training data.
Why the Distinction Matters for Privacy
The controversy centers on whether Zscaler uses:
- (A) Raw, tenant-linked logs with full URLs and content, or
- (B) Aggregated, de-identified signals that cannot be mapped back to a specific customer
Zscaler insists on (B). The CEO’s public comments, as reported, seemed to describe (A). Enterprise security teams must reconcile these accounts because the legal and contractual stakes are high. If (A) were true, customers in regulated industries (healthcare, finance, defense) could face compliance violations, data breach notification obligations, and breach-of-contract claims. Even under (B), the technical safeguards for de-identification must be rigorous and independently verified.
A deeper risk is model memorization. Modern transformer-based models can reproduce training examples, and if a model is later exposed via an API or copilot feature, it might inadvertently leak sensitive patterns. Zscaler’s own AI agents and copilots would then become potential exfiltration vectors, contradicting the very zero-trust principles the platform is supposed to enforce.
Governance and Regulatory Implications
Regulatory obligations add another layer of urgency:
- GDPR: If logs contained personal data (IP addresses, usernames, patient IDs), training AI on them without a lawful basis and without informing data subjects would violate the regulation. Data Protection Authorities could impose significant fines.
- HIPAA / GLBA / DFARS: Sector-specific rules require explicit BAAs or contractual restrictions on secondary use of protected information. A healthcare provider using Zscaler could be in breach if patient data were fed into shared models.
- U.S. State Privacy Laws: Emerging laws like the California CPRA grant consumers rights over automated decision-making and data reuse, with similar transparency requirements.
Customers now face a due-diligence crunch: they must verify that Zscaler’s contractual Data Processing Agreements (DPAs) and technical measures align with these obligations. Many will demand revisions to explicitly prohibit the use of customer-identifiable data for model training and require audit rights.
What IT Leaders Should Do Now
For organizations running Zscaler—or any inline cloud security service—the immediate steps are both contractual and technical:
- Review Master Services Agreements and DPAs: Look for language covering secondary data use, model training, and telemetry. If it is vague or missing, request an addendum that explicitly forbids training on raw logs.
- Demand Technical Transparency: Ask Zscaler for a precise definition of “metadata” used in training. Which fields are redacted from URLs? Are query strings stripped of tokens? How are sanitization rules applied at scale?
- Verify Data Residency and Encryption Controls: Use customer-managed keys (BYOK) where available and confine log storage to approved geographic regions. This reduces the provider’s ability to access plaintext data.
- Implement Defensive Data Hygiene: Even before traffic hits the inspection service, apply token masking or URL rewriting at the proxy or endpoint to strip high-risk secrets. Enforce DLP policies that block sensitive data from appearing in URLs or headers.
- Negotiate Audit and Attestation Rights: Push for an independent third-party attestation (e.g., SOC 2 report explicitly covering AI training pipelines and dataset provenance) to verify that only aggregated, de-identified signals are used.
- Monitor AI Outputs: If your environment uses Zscaler copilot or similar generative features, log and review outputs for signs of data leakage, just as you would audit any other export path.
For the most security-conscious organizations—defense contractors, critical infrastructure, large financials—it may be wise to explore single-tenant or on-premises deployment options that physically isolate training pipelines from multi-tenant data.
The Bottom Line
Zscaler’s half-trillion-transaction daily volume is both a competitive advantage and a privacy landmine. The security benefits of cross-tenant threat intelligence are real, but they depend on airtight separation between raw customer logs and shared AI models. The gap between the CEO’s earnings-call bravado and the company’s formal assurance is more than a public-relations hiccup—it is a trust deficit that only verifiable controls and contractual clarity can close.
The episode serves as a warning for the entire cloud security industry: when every executive talks about “AI-driven defense,” the market will demand precision about what data fuels those models and what promises are being made to customers. For now, Zscaler customers must work on the assumption that they must protect their own data, verify their vendor’s claims, and lock down language that ties AI ambitions to the logs flowing through the platform.