Self-Host Ollama: Local AI Models | TinyPod

Ollama lets you run large language models locally. Llama, Mistral, Gemma — run AI without sending data to cloud APIs.

What Is Ollama?

Ollama makes it easy to run large language models locally. Download a model, run it, and interact via API or CLI.

Llama 3 (8B, 70B) — Meta's open model

Mistral (7B) and Mixtral (8x7B)

Gemma (2B, 7B) — Google's open model

Phi-3 — Microsoft's small model

CodeLlama — code-focused

Qwen — Alibaba's model

And many more community models

bash

ollama run llama3

That's it. Ollama downloads the model and starts a chat session.

Ollama provides an OpenAI-compatible API:

bash

curl http://localhost:11434/v1/chat/completions \

-d '{"model": "llama3", "messages": [{"role": "user", "content": "Hello"}]}'

Private AI assistant (no data leaves your server)

Code completion and review

Document summarization

Data extraction

Chatbots for internal tools

RAG (Retrieval-Augmented Generation)

7B models: 8 GB RAM, runs on CPU

13B models: 16 GB RAM

70B models: 64 GB RAM or GPU

GPU recommended for speed (NVIDIA CUDA, Apple Metal)

Combine Ollama with a web UI:

1. Ollama runs the model

2. Lobe Chat or Open WebUI provides the ChatGPT-like interface

3. Everything runs locally

Deploy on TinyPod with sufficient RAM. CPU-only is fine for smaller models.

Ollama democratizes AI. Run capable models on commodity hardware with complete privacy.