Using llama-cpp on spaces

how can I use llama-cpp-python in a basic (free) Hugging Face Space? Just adding it to requirements.txt leads to infinite building until it reaches the time limit. I know the official repo has a docker image, but it’s an old version.

1 Like

The quickest way is to use subprocess.run from app.py, but I think the most proper method is to build your own CPU build wheel for your Space.


You are hitting the default failure mode for llama-cpp-python on HF “cpu-basic” Spaces: pip falls back to a source build (CMake + C++ compile of llama.cpp), and the build can run long enough to hit the Space build/start timeout. The project also does not consistently publish “normal” manylinux CPU wheels on PyPI, so “just add it to requirements.txt” often means “compile from source”. (GitHub)

The reliable fix is: do not compile inside the Space. Install a wheel.

Below are the practical options, in the order that usually works best on a free CPU Space.


Option 1 (fastest to try): Use the project’s CPU wheel index, but be aware of musl issues

llama-cpp-python documents a CPU wheel index here: https://abetlen.github.io/llama-cpp-python/whl/cpu. (PyPI)

requirements.txt (try)

--prefer-binary
--only-binary llama-cpp-python
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

llama-cpp-python
gradio
huggingface_hub

Why this helps:

  • --only-binary llama-cpp-python prevents pip from choosing the sdist and compiling.
  • --extra-index-url …/cpu gives pip a place to find wheels. (PyPI)

Big caveat:

  • There is a known problem where the CPU wheels depend on musl (libc.musl-x86_64.so.1), which is usually absent on Ubuntu/Debian (HF Gradio Spaces are typically glibc-based). That shows up exactly as an import/runtime error about libc.musl-x86_64.so.1. (GitHub)

If you see the musl error, stop here and switch to Option 2.


Option 2 (recommended): Use a glibc “manylinux” wheel you control, then install by URL in requirements.txt

This avoids:

  • long C++ builds on the Space
  • musl-linked wheels that fail on glibc
  • surprise dependency drift

Step A: Build a wheel on GitHub Actions (manylinux)

Use cibuildwheel to build Linux wheels inside a manylinux container. cibuildwheel runs auditwheel repair by default on Linux. (cibuildwheel.pypa.io)
If auditwheel repair causes problems for you, you can disable repair (HF-only wheel) or make it verbose. (cibuildwheel.pypa.io)

Minimal HF-focused workflow (CPU-only, avoid BLAS, disable repair to keep it simple):

name: build-cpu-wheel-hf

on:
  workflow_dispatch:
  push:
    tags: ["v*"]

jobs:
  build:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: recursive

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"   # runs cibuildwheel; targets can still be py3.10

      - run: |
          python -m pip install -U pip
          python -m pip install -U cibuildwheel

      - name: Build wheel (CPU only, no auditwheel repair)
        env:
          CIBW_BUILD: "cp310-manylinux_x86_64"
          CIBW_SKIP: "*-musllinux_*"
          CIBW_REPAIR_WHEEL_COMMAND_LINUX: ""   # disables default auditwheel repair :contentReference[oaicite:6]{index=6}
          CIBW_ENVIRONMENT_LINUX: >-
            CMAKE_ARGS="-DGGML_BLAS=OFF -DGGML_NATIVE=OFF"
        run: |
          python -m cibuildwheel --output-dir wheelhouse

      - uses: actions/upload-artifact@v4
        with:
          name: wheelhouse
          path: wheelhouse/*.whl

Notes:

  • GGML_NATIVE=OFF is important for portability. It avoids “built for the CI CPU only” behavior.
  • Start with GGML_BLAS=OFF to avoid extra runtime .so dependencies. If you later want OpenBLAS speed, see Option 3.

Step B: Publish the wheel (GitHub Releases)

Easiest is: push a tag v… and let your workflow attach the wheel to the release (needs contents: write permissions). If you prefer manual, you can upload the .whl from the Actions artifact to a Release.

Step C: Install that wheel in your Space via URL

requirements.txt in the Space:

llama-cpp-python @ https://github.com/<you>/<repo>/releases/download/<tag>/<wheel>.whl

gradio
huggingface_hub

This makes the Space install instant because it skips compilation.

Step D: Pin Python version in README.yaml to match the wheel

HF Spaces default is Python 3.10. You can and should set it explicitly. (Hugging Face)

README.md top block:

---
sdk: gradio
python_version: 3.10
---

Match cp310 wheels with Python 3.10, cp311 wheels with 3.11, etc. (Hugging Face)


Option 3 (CPU performance): Enable OpenBLAS, and install the OS library with packages.txt

If your wheel is linked against OpenBLAS, you can get runtime errors like libopenblas.so.0: cannot open shared object file. That means the OS library is missing.

HF Spaces supports Debian packages via packages.txt. Each line is installed via apt-get install. (Hugging Face)

packages.txt

libopenblas0-pthread

Then build your wheel with BLAS enabled:

  • CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS …" (documented by the project). (PyPI)

Tradeoff:

  • Faster inference
  • More moving parts (OS libs must exist)

Reduce “first start” pain: preload the GGUF repo and increase startup timeout

Even with a wheel, first run can be slow if you download a GGUF from the Hub during startup. Two Space features help:

  1. preload_from_hub downloads repos/files at build time. (Hugging Face)
  2. startup_duration_timeout can be raised from the default 30 minutes. (Hugging Face)

Example README.md YAML:

---
sdk: gradio
python_version: 3.10
startup_duration_timeout: 1h
preload_from_hub:
  - Qwen/Qwen2.5-0.5B-Instruct-GGUF
---

Avoid this (works sometimes, breaks often): pip-install inside app.py at runtime

You can call subprocess.run([sys.executable, "-m", "pip", ...]) at app startup, but it is fragile on Spaces. Users report “manual installs work until sleep/restart, then revert”. (Hugging Face Forums)
Also it slows cold start and does not solve missing OS .so dependencies. (Hugging Face)

Use it only as a temporary debug hack, not as your deployment strategy.


What I would do for a free CPU Space

  1. Build a glibc wheel with GitHub Actions (Option 2).
  2. Install it by URL in requirements.txt.
  3. Keep GGML_BLAS=OFF initially. Add OpenBLAS later only if you need speed.
  4. Use preload_from_hub and startup_duration_timeout to avoid first-start failures. (Hugging Face)

High-quality references

  • HF docs on requirements.txt, pre-requirements.txt, and packages.txt. (Hugging Face)
  • HF Spaces config: python_version, startup_duration_timeout, preload_from_hub. (Hugging Face)
  • llama-cpp-python PyPI page showing the CPU wheel index usage. (PyPI)
  • Known musl-linked CPU wheel breakage (libc.musl-x86_64.so.1). (GitHub)
  • cibuildwheel default Linux repair command and how repair works. (cibuildwheel.pypa.io)
  • HF forum report that manual installs do not persist after sleep. (Hugging Face Forums)

Summary

  • Your build times out because pip compiles llama-cpp-python from source on the Space. (GitHub)
  • The clean fix is “install a wheel”, not “make the Space build faster”.
  • The project’s CPU wheel index can work, but can fail on glibc due to musl dependency. (GitHub)
  • Best practice on free CPU Spaces: build a manylinux wheel in GitHub Actions and install it by URL. (cibuildwheel.pypa.io)
1 Like

Wheels (Most are for CUDA environments. CPU ones often need to be built yourself)