Private AI Inference

Serve LLMs without exposing prompts or weights

Why It Matters

Why Private Inference Matters

Centralized inference can log prompts and leak IP. Phala enclaves ensure no operator—cloud or vendor—can peek.

Data security

Prompts can expose sensitive queries

Traditional cloud infrastructure exposes sensitive information to operators and administrators.

More Information
Confidential computing

Model weights are valuable IP

Hardware-enforced isolation prevents unauthorized access while maintaining computational efficiency.

More Information
Zero-trust architecture

Inference logs reveal business patterns

End-to-end encryption protects data in transit, at rest, and critically during computation.

More Information
Attestation

No operator access to runtime memory

Cryptographic verification ensures code integrity and proves execution in genuine TEE hardware.

More Information

Hover to Encrypt

GPU TEE Protection

Zero-Trust Inference

Confidential Serving

OpenAI-Compatible API with Hardware Encryption

GPU TEEs with Intel TDX and AMD SEV provide hardware-level memory encryption—your model weights, user prompts, and inference outputs stay encrypted in-use. Inputs, outputs, and weights stay inside attested GPU enclaves. Not even cloud admins or hypervisors can inspect runtime state.

Privacy as a human right—by design. Route requests via mTLS into enclave. Emit usage receipts; never store plaintext. OpenAI-compatible endpoints with verifiable attestation and zero-logging guarantees.

GPU memory encryption
OpenAI-compatible API
Zero-logging architecture

Real-World Success Stories

Discover how leading companies are leveraging Phala's confidential AI to build exceptional digital experiences, while maintaining complete data privacy and regulatory compliance.

On-prem Privacy, Cloud Simplicity

Deploy confidential AI inference with the flexibility of cloud and the security of on-premise infrastructure.

Your private request

End-to-end encrypted

Phala Cloud

Hardware-attested routing

DeepSeek V3

GPU TEE

by DeepSeek

Powerful Features, Simple Integration

OpenAI-compatible APIs with advanced capabilities running in TEE

Step 1

Make Secure API Requests

Use OpenAI-compatible SDK to access 200+ models with hardware-enforced privacy. Drop-in replacement with zero code changes.

secure_request.py
from openai import OpenAI

client = OpenAI(
    api_key="<API_KEY>",
    base_url="https://api.redpill.ai/api/v1"
)

response = client.chat.completions.create(
    model="phala/deepseek-chat-v3-0324",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "What is your model name?"},
    ],
    stream=True
)
print(response.choices[0].message.content)
Step 2

Verify TEE Execution

Every response includes cryptographic proof from NVIDIA and Intel TEE hardware. Verify attestation to ensure secure execution.

verify_attestation.py
import requests
import jwt

# Fetch attestation report
response = requests.get(
    "https://api.redpill.ai/v1/attestation/report?model=phala/deepseek-v3",
    headers={"Authorization": f"Bearer {api_key}"}
)
report = response.json()

# Verify NVIDIA GPU attestation
gpu_response = requests.post(
    "https://nras.attestation.nvidia.com/v3/attest/gpu",
    headers={"Content-Type": "application/json"},
    data=report["nvidia_payload"]
)

# Check verification result
gpu_tokens = gpu_response.json()[1]
for gpu_id, token in gpu_tokens.items():
    decoded = jwt.decode(token, options={"verify_signature": False})
    assert decoded.get("measres") == "success"
    print(f"{gpu_id}: Verified ✓")

Solutions for Every User

Choose the perfect privacy-first AI solution tailored to your needs

Personal

Individual

Private AI assistants for individuals who value data sovereignty and zero-logging guarantees.

What's included:

  • Private chat with zero data retention
  • Encrypted journal & notes
  • Personal data analysis

Developer

API

OpenAI-compatible APIs with TEE protection—drop-in replacement with hardware-enforced privacy.

What's included:

  • One-line API integration
  • Same SDKs & libraries
  • Verifiable attestation

Enterprise

Enterprise

Scalable confidential AI infrastructure with compliance, auditability, and flexible deployment options.

What's included:

  • Private RAG & AI copilots
  • Confidential fine-tuning
  • On-prem or cloud deployment
  • HIPAA/SOC2 compliance

Industry-Leading Enterprise Compliance

Meeting the highest compliance requirements for your business

AICPA SOC 2ISO 27001CCPAGDPR
FAQ

Frequently Asked Questions

Everything you need to know about Private AI Inference

1

How does Phala ensure truly private inference?

Phala uses GPU Trusted Execution Environments (TEEs) with Intel TDX and AMD SEV to encrypt all prompts, outputs, and model weights during inference. Not even cloud providers or system administrators can access data in use—only the attested enclave can decrypt your inputs.

2

Can cloud providers or Phala operators see my prompts?

No. Hardware-level memory encryption (Intel TDX/AMD SEV) prevents any operator—including Phala, cloud providers, or root users—from reading runtime memory. Data is encrypted from the moment it enters the TEE until it leaves.

3

How are GPU model weights protected?

Model weights are loaded directly into encrypted GPU memory inside TEEs. They never touch disk or CPU in plaintext. Each deployment is sealed with cryptographic measurements (mrenclave) you can verify before sending data.

4

What are attestation proofs, and how do I verify them?

Attestation proofs are cryptographic signatures from the CPU/GPU proving the exact code and environment running inside the TEE. Verify them via /v1/attestation endpoints before sending prompts—ensuring no tampering or backdoors exist.

5

How long does it take to deploy a private inference endpoint?

Under 5 minutes. Use Docker containers with pre-configured TEE images, or deploy via Phala Cloud's one-click interface. No custom firmware or low-level TEE programming required.

6

Do I need to modify my existing OpenAI-compatible code?

No. Phala provides drop-in OpenAI-compatible API endpoints (base_url = https://api.redpill.ai/v1). Use the same SDKs (openai-python, openai-node) and just point to Phala's attested endpoints.

7

Can I bring my own fine-tuned model weights?

Yes. Upload weights encrypted with your key, and Phala will load them into TEE memory without ever decrypting them in transit. Use /v1/attestation to verify the deployment before sending prompts.

8

What's the latency overhead of TEE inference?

5-15% compared to bare-metal GPUs. Memory encryption happens at hardware speed with Intel TDX/AMD SEV, so most workloads see negligible impact. Batching and caching reduce overhead further.

9

Can I use this for healthcare/law firm AI assistants?

Yes. Private inference is ideal for HIPAA/GDPR-regulated industries. Patient records or legal documents never leave the TEE in plaintext, and attestation proofs provide audit trails for compliance.

10

How does this help with internal document Q&A chatbots?

Embed your internal docs (HR policies, financial reports) into TEE-protected RAG pipelines. Employees query via private endpoints, and neither Phala nor cloud providers can read the documents or queries.

11

What's the maximum prompt/document size I can send?

Up to 128k tokens for most models (e.g., Qwen2.5-72B, DeepSeek-V3). For longer documents, use chunking strategies or contact us for enterprise deployments with extended context windows.

12

Are there production deployments using this today?

Yes. OpenRouter uses Phala for confidential enterprise routes, NEAR AI for verifiable ML inference, and OODA AI (NASDAQ-listed) for decentralized GPU TEE deployments. See case studies above.

Start Private AI Inference Today

Deploy confidential LLM endpoints with hardware-enforced encryption and zero-knowledge guarantees.

Deploy on Phala
  • Intel TDX & AMD SEV support
  • Remote attestation built-in
  • Zero-trust architecture
  • Enterprise-ready compliance
  • 24/7 technical support