#ollama

4 posts

Tesla P40 in a Homelab: 24GB of Inference on a Budget

Tesla P40 in a Homelab: 24GB of Inference on a Budget

Running a Tesla P40 for LLM inference. Why I ditched GPU passthrough for host-level drivers to stop the constant Proxmox crashes.

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

How to build a routing layer for AI agents that ensures sensitive data stays on local hardware while leveraging cloud LLMs for non-private tasks.

Three-Layer Safety for Autonomous Agents: Stopping the Infinite Loop

Three-Layer Safety for Autonomous Agents: Stopping the Infinite Loop

Moving beyond prompt engineering to implement token-level schema enforcement, pre-execution gates, and shell-safe execution pipelines for AI agents.

Ollama on Kubernetes: Recreate Strategy and Single-GPU Deadlock

Ollama on Kubernetes: Recreate Strategy and Single-GPU Deadlock

Deploying Ollama on Kubernetes can lead to GPU deadlocks. Here's how to avoid them.

← All tags