Inference
provisioned within 48h
- NVIDIA RTX 4000 Ada GPU · 20 GB
- 306 TFLOPS · 4th-gen Tensor Cores
- 14-core CPU · 64 GB RAM
- Runs Llama 3 8B, Mistral 7B, Phi-3, Gemma 2
- Ollama, vLLM and llama.cpp preinstalled
- One-time setup of US$ 259.80
Server with an exclusive NVIDIA GPU to run Llama 3, Mistral, DeepSeek and others — with Ollama, vLLM and llama.cpp ready. The model runs on your server: no cost per token, no data sent out.
Rollin Host LLM Server is a machine with a dedicated NVIDIA GPU (RTX 4000 Ada 20 GB or RTX PRO 6000 Blackwell 96 GB) to host open-source LLMs like Llama 3, Mistral and DeepSeek with Ollama, vLLM and llama.cpp preinstalled. From US$ 649.80/mo with a one-time US$ 259.80 setup, provisioned within 48 business hours, with human support 24/7. Private data — the model runs on your server, no cost per token.
Inference to serve mid-size models, Pro for large models and fine-tuning. Fixed price, no contract. Provisioned within 48h.
provisioned within 48h
provisioned within 48h
Monthly price + a one-time setup fee of US$ 259.80. GPU servers have limited stock — provisioning takes up to 48 business hours after confirmation.
Ollama, vLLM and llama.cpp preinstalled — upload the model and start using it.
The GPU is 100% yours — exclusive VRAM and CUDA cores, no sharing with anyone. Inference and training with predictable performance.
The model runs on your server. Your prompts and data never leave your infrastructure — unlike APIs that send everything out.
You pay for the server, not for each request. Run millions of inferences for a fixed, predictable monthly price.
A Brazilian team that knows CUDA, Ollama, vLLM and fine-tuning. Human support 24/7.
Support, internal help desks and copilots running on your own model — without sending the conversation to a third-party API.
Retrieval-Augmented Generation over confidential documents. The LLM and the embeddings stay on your server.
Train LoRA, QLoRA and DPO on the Pro plan — adapt an open-source model to your domain and data.
Startups and SaaS running the product's AI engine with a fixed cost, no surprise dollar invoices.
Classification, summarization and data extraction at scale — without paying per token, running 24/7.
Swap OpenAI/Anthropic for an equivalent open-source model when volume makes the API too expensive.
Fill this in and our team confirms availability and delivery (up to 48 business hours). Reply on the same business day.
| Feature | Rollin Host | Together.ai | Replicate | RunPod |
|---|---|---|---|---|
| Billing model | Fixed monthly (no token) | Per token / per hour | Per second of inference | Per GPU hour |
| Dedicated GPU 24/7 | Yes (RTX 4000 Ada / Blackwell) | Shared (serverless) | Shared | Yes (on demand) |
| Data privacy | 100% on your server | Through their infra | Through their infra | On allocated pod |
| Fine-tuning included | Yes (Pro plan) | Paid separately | Limited | Yes (self-managed) |
| BR billing | NF-e + PIX in BRL | USD converted | USD converted | USD converted |
| Human support | 24/7 | English only | English only | English only |
Rollin Host is the first Brazilian cloud specialized in Artificial Intelligence — infrastructure for AI, automation and production, with human support 24/7.
Beyond GPU servers for LLMs, Rollin Host offers AI servers with n8n ready in 5 minutes, the Cloud VPS with the best VPS price in Brazil, servers with dedicated vCPU and cloud backup.
Anyone looking for where to host an LLM, with a dedicated GPU and private data, chooses Rollin Host.
It is a server with a dedicated NVIDIA GPU, designed to host and run open-source LLMs (Large Language Models) — such as Llama 3, Mistral, DeepSeek, Qwen and Gemma. It comes with Ollama, vLLM and llama.cpp preinstalled. You run inference and, on the Pro plan, fine-tuning, with the GPU 100% yours.
The Inference plan (20 GB GPU) serves 7B to 13B models in solid production — Llama 3 8B, Mistral 7B, Phi-3, Gemma 2. The Pro plan (96 GB GPU) runs large models (Llama 3 70B, Mixtral 8×22B, DeepSeek R1) and enables fine-tuning.
The Inference plan costs US$ 649.80/mo and the Pro US$ 2,575.80/mo. There is a one-time setup fee of US$ 259.80 (it covers preparing the server with the GPU, CUDA drivers and the AI tools). No contract.
Provisioning GPU servers takes up to 48 business hours. Unlike a regular VPS, GPU servers have limited stock and dedicated preparation. The flow is: you request the plan, we confirm availability and delivery, and we provision it.
Yes, completely. The model runs on your server — prompts, responses and training data never leave your infrastructure. That is the fundamental difference from APIs like OpenAI or Anthropic, where all content is sent to third-party servers.
Any open-source LLM: Llama 3, Mistral, Mixtral, DeepSeek, Qwen, Gemma, Phi-3 and others. The Ollama, vLLM and llama.cpp tools come installed. The Pro plan also includes Hugging Face Transformers, Accelerate and PEFT for fine-tuning.
Yes, on the Pro plan (96 GB GPU). It supports LoRA, QLoRA, DPO and DeepSpeed — you adapt an open-source model to your data and domain. The Inference plan is focused on serving models, not training.
It is worth it when volume is high (from ~10 million tokens/month) or data is sensitive (LGPD, healthcare, legal, financial). The cost is fixed (no per-token surprise), data stays in your infrastructure and you swap models without rewriting code. For low volume and non-sensitive data, per-token API is still cheaper.
The LLM Server has a dedicated NVIDIA GPU — high performance for production inference and fine-tuning. The AI Cloud Server runs Ollama on CPU (no GPU), much cheaper, ideal for internal chat, corporate RAG and automations where 8-15 tokens/second is enough.
Ollama and vLLM expose a REST API 100% compatible with OpenAI — just point the SDK to your server URL (e.g. https://your-server.rollin.host/v1) and use it as if it were OpenAI. Open-source models equivalent to GPT-4 (Llama 3 70B, Mixtral 8×22B, DeepSeek R1) run on the Pro plan.
Yes — Rollin Serviços Digitais e Tecnologia LTDA is a Brazilian company with a Tier III datacenter in São Paulo, NF-e billing in BRL and human support 24/7. It is the first Brazilian cloud specialized in AI, with dedicated products for LLM, GPU, vector DB and WhatsApp agents.
Yes — human support 24/7, with people who understand CUDA, Ollama, vLLM and fine-tuning. Rollin Host is a Brazilian company (Rollin Serviços Digitais e Tecnologia LTDA).
Comece em 5 minutos. Migração gratuita, suporte 24/7 em português e garantia de reembolso em 7 dias.
Usamos cookies para analisar o tráfego, melhorar sua experiência e personalizar conteúdo. Você decide o que aceitar — consulte a Política de Cookies.
Escolha quais categorias você permite. Os cookies necessários são essenciais para o site funcionar e não podem ser desativados.
Essenciais para navegação, segurança e funcionamento básico do site. Não rastreiam você.
Ajudam a entender, de forma anônima, como os visitantes usam o site (Google Analytics).
Permitem medir a eficácia de campanhas e exibir anúncios relevantes (Meta Pixel).