Migração 100% grátis + 1 mês grátis com cupom MIGRAR1MES · novos clientes em planos até R$ 200/mês Migrar agora
Technical comparison

Ollama, Together.ai or RunPod: which to choose for open-source LLM?

Hosting Llama 3, Mistral, Qwen and other open-source models became common strategy in 2026 — be it for cost, privacy or customization. The three main options: run Ollama on VPS/CPU, use a managed API like Together.ai, or rent GPU on-demand at RunPod.

TL;DR

Ollama on Rollin VPS (CPU) runs small quantized models (Llama 3 8B, Phi-3, Qwen 2.5 7B) with 30-80ms latency for Brazil and fixed cost of R$ 89.90-199.90/month. Together.ai is a managed API with fast inference and per-token pricing, no ops. RunPod offers GPU on-demand for large models (70B+) or fine-tuning.

Comparison table

FeatureOllama on Rollin VPSTogether.aiRunPod
Service typeSelf-hosted CPUManaged API (serverless)GPU on-demand (IaaS)
HardwareAMD EPYC + NVMe (CPU)Managed GPUsRTX 4090 / A100 / H100
Supported modelsLlama 3 8B, Mistral 7B, Qwen 2.5, Phi-3Llama 3 70B, Mixtral, DeepSeek, +100Any open-source model
Brazil latency30-80ms (SP datacenter)200-400ms (US/EU)150-350ms (varies)
Entry costR$ 89.90/mo (Pro 10)Pay-per-tokenUS$ 0.30-3.50/hr GPU
Cost per million tokensAmortized in fixed~US$ 0.20-0.90Calculated per GPU hour
Large 70B+ modelsUnfeasible on CPUNatively supportedYes, adequate GPU
Fine-tuningLimited (slow CPU)Yes, managedYes, full GPU control
Privacy (where is data?)Your VPS in Brazil (LGPD)Together servers (US, +DPA)RunPod servers (US/EU)
Cold startZero~1-5s (serverless)30-120s (GPU boot)
Throughput tokens/sec20-60 tok/s (8B on CPU)50-200 tok/s100-500 tok/s
BillingFixed in BRL (R$)Per use in USDPer hour in USD
Vendor lock-inZero (open-source)Medium (proprietary API)Low
OperationYou manage OllamaZero — just APIYou set up Docker + container
Portuguese human supportYes, 24/7 via RollinEnglish onlyEnglish only

Ollama on Rollin VPS Pros

  • Predictable fixed BRL cost
  • Minimum latency for Brazil
  • Data never leaves your VPS
  • Runs on same machine as n8n, EvolutionAPI, Qdrant
  • Zero cold start
  • 24/7 Portuguese human support
  • Open-source: Llama, Mistral, Qwen, Phi

Together.ai Pros

  • Immediate access to 100+ models
  • Fast inference with managed GPU
  • Pay-per-token
  • No ops
  • Excellent for variable loads
  • Good documentation and SDKs
  • Supports managed fine-tuning

RunPod Pros

  • GPU on-demand: RTX 4090, A100, H100
  • Flexible hourly pricing
  • Runs any model
  • Ideal for fine-tuning
  • Serverless mode also available
  • Active community, Docker templates

Ollama on Rollin VPS Cons

  • CPU limits large models
  • Lower throughput than GPU
  • No auto-scaling
  • Serious fine-tuning needs GPU
  • You manage Ollama updates

Together.ai Cons

  • 200-400ms latency from Brazil
  • USD billing with IOF
  • Data on Together servers — DPA for LGPD
  • At high stable volume, more expensive than self-hosted
  • No Portuguese support
  • Rate limits on popular models

RunPod Cons

  • 30-120s cold start
  • USD hourly billing
  • You are responsible for Docker
  • Variable GPU availability
  • No Portuguese support
  • No BR region

When to choose each

Use Ollama on Rollin VPS if:

You run chat or RAG with models up to 13B in Portuguese, predictable volume. Critical privacy. Want minimum latency for Brazilian users.

Use Together.ai if:

You need Llama 3 70B without investing in GPU. Variable loads — prototypes, peaks. Small team without DevOps.

Use RunPod if:

You will do fine-tuning. Need GPU for specific workloads. Want full environment control.

Use combination if:

Ollama on Rollin VPS for production + Together.ai for fallback on large models + RunPod for fine-tuning.

Verdict

For most Brazilian cases, Ollama on Rollin VPS delivers the best cost-benefit with total privacy and minimum latency. Honestly, if you need 70B+ model, Together.ai is clearly superior. RunPod is the right tool for fine-tuning. Rollin Host does not offer dedicated GPU in 2026, so if your case is serious fine-tuning, use RunPod without guilt.

Frequently asked questions

Can I run Llama 3 8B on CPU?

Yes. With Q4 or Q5 quantization (GGUF), Llama 3 8B runs on a VPS with 8-16 GB RAM and AMD EPYC delivers 20-40 tokens/second.

How much does Together.ai cost in 2026?

Together.ai charges per token. Llama 3 8B costs ~US$ 0.20/M tokens, Llama 3 70B around US$ 0.90/M tokens.

Does RunPod have a Brazilian datacenter?

In 2026, RunPod has no Brazilian region. Most-used regions are US-East, US-West and EU.

Does Ollama support function calling?

Yes, since version 0.3+ Ollama supports tool/function calling with compatible models.

Can I fine-tune on Ollama?

Technically yes, but impractical on CPU. For serious fine-tuning, use RunPod with GPU.

Is Together.ai LGPD compliant?

Together.ai offers signable DPA. Since data passes through US servers, review the case with your DPO.

Which Rollin VPS for Ollama?

Pro 10 (R$ 89.90/month) runs Llama 3 8B Q4. For Mistral 7B + simultaneous RAG, Pro 20.

What is the throughput of Llama 3 70B on RunPod?

On an A100 80GB, Llama 3 70B FP8 delivers ~80-150 tokens/second. On H100, rises to ~200-400 tok/s.

Can I embed with Ollama?

Yes. Ollama supports embedding models like nomic-embed-text and mxbai-embed-large.

Does Together.ai have chat playground?

Yes, Together.ai has web playground to test models before integrating via API.

How to set up Ollama on a Rollin VPS?

curl -fsSL https://ollama.com/install.sh | sh — then ollama pull llama3.1:8b-instruct-q4_K_M.

RunPod vs Vast.ai, which is best?

RunPod has more polished UX and official templates. Vast.ai is cheaper but with more operational friction. For teams without DevOps, RunPod.

Ready to host your open-source LLM with privacy?

VPS Cloud AMD EPYC + NVMe from R$ 89.90/month. Ollama, Llama 3, Mistral in minutes.

See VPS for LLM