Technical comparison

Ollama, Together.ai or RunPod: which to choose for open-source LLM?

Hosting Llama 3, Mistral, Qwen and other open-source models became common strategy in 2026 — be it for cost, privacy or customization. The three main options: run Ollama on VPS/CPU, use a managed API like Together.ai, or rent GPU on-demand at RunPod.

TL;DR

Ollama on Rollin VPS (CPU) runs small quantized models (Llama 3 8B, Phi-3, Qwen 2.5 7B) with stable latency, no API queue, and fixed cost of R$ 169.90-439.00/month. Together.ai is a managed API with fast inference and per-token pricing, no ops. RunPod offers GPU on-demand for large models (70B+) or fine-tuning.

Comparison table

Feature	Ollama on Rollin VPS	Together.ai	RunPod
Service type	Self-hosted CPU	Managed API (serverless)	GPU on-demand (IaaS)
Hardware	AMD EPYC + NVMe (CPU)	Managed GPUs	RTX 4090 / A100 / H100
Supported models	Llama 3 8B, Mistral 7B, Qwen 2.5, Phi-3	Llama 3 70B, Mixtral, DeepSeek, +100	Any open-source model
Latency	Stable — no shared API queue	200-400ms (US/EU)	150-350ms (varies)
Entry cost	R$ 169.90/mo (Pro 4)	Pay-per-token	US$ 0.30-3.50/hr GPU
Cost per million tokens	Amortized in fixed	~US$ 0.20-0.90	Calculated per GPU hour
Large 70B+ models	Unfeasible on CPU	Natively supported	Yes, adequate GPU
Fine-tuning	Limited (slow CPU)	Yes, managed	Yes, full GPU control
Privacy (where is data?)	Your dedicated VPS — under your control (LGPD w/ safeguards)	Together servers (US, +DPA)	RunPod servers (US/EU)
Cold start	Zero	~1-5s (serverless)	30-120s (GPU boot)
Throughput tokens/sec	20-60 tok/s (8B on CPU)	50-200 tok/s	100-500 tok/s
Billing	Fixed in BRL (R$)	Per use in USD	Per hour in USD
Vendor lock-in	Zero (open-source)	Medium (proprietary API)	Low
Operation	You manage Ollama	Zero — just API	You set up Docker + container
Portuguese human support	Yes, 24/7 via Rollin	English only	English only

Ollama on Rollin VPS Pros

Predictable fixed BRL cost
Stable latency, no shared API queue
Data never leaves your VPS
Runs on same machine as n8n, EvolutionAPI, Qdrant
Zero cold start
24/7 Portuguese human support
Open-source: Llama, Mistral, Qwen, Phi

Together.ai Pros

Immediate access to 100+ models
Fast inference with managed GPU
Pay-per-token
No ops
Excellent for variable loads
Good documentation and SDKs
Supports managed fine-tuning

RunPod Pros

GPU on-demand: RTX 4090, A100, H100
Flexible hourly pricing
Runs any model
Ideal for fine-tuning
Serverless mode also available
Active community, Docker templates

Ollama on Rollin VPS Cons

CPU limits large models
Lower throughput than GPU
No auto-scaling
Serious fine-tuning needs GPU
You manage Ollama updates

Together.ai Cons

200-400ms latency from Brazil
USD billing with IOF
Data on Together servers — DPA for LGPD
At high stable volume, more expensive than self-hosted
No Portuguese support
Rate limits on popular models

RunPod Cons

30-120s cold start
USD hourly billing
You are responsible for Docker
Variable GPU availability
No Portuguese support
No BR region

When to choose each

Use Ollama on Rollin VPS if:

You run chat or RAG with models up to 13B in Portuguese, predictable volume. Critical privacy. Want stable latency with no shared API queue.

Use Together.ai if:

You need Llama 3 70B without investing in GPU. Variable loads — prototypes, peaks. Small team without DevOps.

Use RunPod if:

You will do fine-tuning. Need GPU for specific workloads. Want full environment control.

Use combination if:

Ollama on Rollin VPS for production + Together.ai for fallback on large models + RunPod for fine-tuning.

Verdict

For most Brazilian cases, Ollama on Rollin VPS delivers the best cost-benefit with total privacy and stable latency. Honestly, if you need 70B+ model, Together.ai is clearly superior. RunPod is the right tool for fine-tuning. Rollin Host does not offer dedicated GPU in 2026, so if your case is serious fine-tuning, use RunPod without guilt.

Frequently asked questions

Can I run Llama 3 8B on CPU?

Yes. With Q4 or Q5 quantization (GGUF), Llama 3 8B runs on a VPS with 8-16 GB RAM and AMD EPYC delivers 20-40 tokens/second.

How much does Together.ai cost in 2026?

Together.ai charges per token. Llama 3 8B costs ~US$ 0.20/M tokens, Llama 3 70B around US$ 0.90/M tokens.

Does RunPod have a Brazilian datacenter?

In 2026, RunPod has no Brazilian region. Most-used regions are US-East, US-West and EU.

Does Ollama support function calling?

Yes, since version 0.3+ Ollama supports tool/function calling with compatible models.

Can I fine-tune on Ollama?

Technically yes, but impractical on CPU. For serious fine-tuning, use RunPod with GPU.

Is Together.ai LGPD compliant?

Together.ai offers signable DPA. Since data passes through US servers, review the case with your DPO.

Which Rollin VPS for Ollama?

Pro 4 (R$ 169.90/month) runs Llama 3 8B Q4. For Mistral 7B + simultaneous RAG, Pro 6.

What is the throughput of Llama 3 70B on RunPod?

On an A100 80GB, Llama 3 70B FP8 delivers ~80-150 tokens/second. On H100, rises to ~200-400 tok/s.

Can I embed with Ollama?

Yes. Ollama supports embedding models like nomic-embed-text and mxbai-embed-large.

Does Together.ai have chat playground?

Yes, Together.ai has web playground to test models before integrating via API.

How to set up Ollama on a Rollin VPS?

curl -fsSL https://ollama.com/install.sh | sh — then ollama pull llama3.1:8b-instruct-q4_K_M.

RunPod vs Vast.ai, which is best?

RunPod has more polished UX and official templates. Vast.ai is cheaper but with more operational friction. For teams without DevOps, RunPod.

Ready to host your open-source LLM with privacy?

VPS Cloud AMD EPYC + NVMe from R$ 169.90/month. Ollama, Llama 3, Mistral in minutes.

See VPS for LLM