how can I use llama-cpp-python in a basic (free) Hugging Face Space? Just adding it to requirements.txt leads to infinite building until it reaches the time limit. I know the official repo has a docker image, but it’s an old version.
The quickest way is to use subprocess.run from app.py, but I think the most proper method is to build your own CPU build wheel for your Space.
You are hitting the default failure mode for llama-cpp-python on HF “cpu-basic” Spaces: pip falls back to a source build (CMake + C++ compile of llama.cpp), and the build can run long enough to hit the Space build/start timeout. The project also does not consistently publish “normal” manylinux CPU wheels on PyPI, so “just add it to requirements.txt” often means “compile from source”. (GitHub)
The reliable fix is: do not compile inside the Space. Install a wheel.
Below are the practical options, in the order that usually works best on a free CPU Space.
Option 1 (fastest to try): Use the project’s CPU wheel index, but be aware of musl issues
llama-cpp-python documents a CPU wheel index here: https://abetlen.github.io/llama-cpp-python/whl/cpu. (PyPI)
requirements.txt (try)
--prefer-binary
--only-binary llama-cpp-python
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
llama-cpp-python
gradio
huggingface_hub
Why this helps:
--only-binary llama-cpp-pythonprevents pip from choosing the sdist and compiling.--extra-index-url …/cpugives pip a place to find wheels. (PyPI)
Big caveat:
- There is a known problem where the CPU wheels depend on musl (
libc.musl-x86_64.so.1), which is usually absent on Ubuntu/Debian (HF Gradio Spaces are typically glibc-based). That shows up exactly as an import/runtime error aboutlibc.musl-x86_64.so.1. (GitHub)
If you see the musl error, stop here and switch to Option 2.
Option 2 (recommended): Use a glibc “manylinux” wheel you control, then install by URL in requirements.txt
This avoids:
- long C++ builds on the Space
- musl-linked wheels that fail on glibc
- surprise dependency drift
Step A: Build a wheel on GitHub Actions (manylinux)
Use cibuildwheel to build Linux wheels inside a manylinux container. cibuildwheel runs auditwheel repair by default on Linux. (cibuildwheel.pypa.io)
If auditwheel repair causes problems for you, you can disable repair (HF-only wheel) or make it verbose. (cibuildwheel.pypa.io)
Minimal HF-focused workflow (CPU-only, avoid BLAS, disable repair to keep it simple):
name: build-cpu-wheel-hf
on:
workflow_dispatch:
push:
tags: ["v*"]
jobs:
build:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- uses: actions/setup-python@v5
with:
python-version: "3.11" # runs cibuildwheel; targets can still be py3.10
- run: |
python -m pip install -U pip
python -m pip install -U cibuildwheel
- name: Build wheel (CPU only, no auditwheel repair)
env:
CIBW_BUILD: "cp310-manylinux_x86_64"
CIBW_SKIP: "*-musllinux_*"
CIBW_REPAIR_WHEEL_COMMAND_LINUX: "" # disables default auditwheel repair :contentReference[oaicite:6]{index=6}
CIBW_ENVIRONMENT_LINUX: >-
CMAKE_ARGS="-DGGML_BLAS=OFF -DGGML_NATIVE=OFF"
run: |
python -m cibuildwheel --output-dir wheelhouse
- uses: actions/upload-artifact@v4
with:
name: wheelhouse
path: wheelhouse/*.whl
Notes:
GGML_NATIVE=OFFis important for portability. It avoids “built for the CI CPU only” behavior.- Start with
GGML_BLAS=OFFto avoid extra runtime.sodependencies. If you later want OpenBLAS speed, see Option 3.
Step B: Publish the wheel (GitHub Releases)
Easiest is: push a tag v… and let your workflow attach the wheel to the release (needs contents: write permissions). If you prefer manual, you can upload the .whl from the Actions artifact to a Release.
Step C: Install that wheel in your Space via URL
requirements.txt in the Space:
llama-cpp-python @ https://github.com/<you>/<repo>/releases/download/<tag>/<wheel>.whl
gradio
huggingface_hub
This makes the Space install instant because it skips compilation.
Step D: Pin Python version in README.yaml to match the wheel
HF Spaces default is Python 3.10. You can and should set it explicitly. (Hugging Face)
README.md top block:
---
sdk: gradio
python_version: 3.10
---
Match cp310 wheels with Python 3.10, cp311 wheels with 3.11, etc. (Hugging Face)
Option 3 (CPU performance): Enable OpenBLAS, and install the OS library with packages.txt
If your wheel is linked against OpenBLAS, you can get runtime errors like libopenblas.so.0: cannot open shared object file. That means the OS library is missing.
HF Spaces supports Debian packages via packages.txt. Each line is installed via apt-get install. (Hugging Face)
packages.txt
libopenblas0-pthread
Then build your wheel with BLAS enabled:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS …"(documented by the project). (PyPI)
Tradeoff:
- Faster inference
- More moving parts (OS libs must exist)
Reduce “first start” pain: preload the GGUF repo and increase startup timeout
Even with a wheel, first run can be slow if you download a GGUF from the Hub during startup. Two Space features help:
preload_from_hubdownloads repos/files at build time. (Hugging Face)startup_duration_timeoutcan be raised from the default 30 minutes. (Hugging Face)
Example README.md YAML:
---
sdk: gradio
python_version: 3.10
startup_duration_timeout: 1h
preload_from_hub:
- Qwen/Qwen2.5-0.5B-Instruct-GGUF
---
Avoid this (works sometimes, breaks often): pip-install inside app.py at runtime
You can call subprocess.run([sys.executable, "-m", "pip", ...]) at app startup, but it is fragile on Spaces. Users report “manual installs work until sleep/restart, then revert”. (Hugging Face Forums)
Also it slows cold start and does not solve missing OS .so dependencies. (Hugging Face)
Use it only as a temporary debug hack, not as your deployment strategy.
What I would do for a free CPU Space
- Build a glibc wheel with GitHub Actions (Option 2).
- Install it by URL in
requirements.txt. - Keep
GGML_BLAS=OFFinitially. Add OpenBLAS later only if you need speed. - Use
preload_from_hubandstartup_duration_timeoutto avoid first-start failures. (Hugging Face)
High-quality references
- HF docs on
requirements.txt,pre-requirements.txt, andpackages.txt. (Hugging Face) - HF Spaces config:
python_version,startup_duration_timeout,preload_from_hub. (Hugging Face) llama-cpp-pythonPyPI page showing the CPU wheel index usage. (PyPI)- Known musl-linked CPU wheel breakage (
libc.musl-x86_64.so.1). (GitHub) cibuildwheeldefault Linux repair command and how repair works. (cibuildwheel.pypa.io)- HF forum report that manual installs do not persist after sleep. (Hugging Face Forums)
Summary
- Your build times out because pip compiles
llama-cpp-pythonfrom source on the Space. (GitHub) - The clean fix is “install a wheel”, not “make the Space build faster”.
- The project’s CPU wheel index can work, but can fail on glibc due to musl dependency. (GitHub)
- Best practice on free CPU Spaces: build a manylinux wheel in GitHub Actions and install it by URL. (cibuildwheel.pypa.io)