Junya Morioka 5d44bd8cd6 ci: refactor Python installation and enable manylinux wheel generation
- Replace actions/setup-python with uv-based Python installation
- Consolidate build dependencies into Python installation step
- Enable auditwheel repair and manylinux wheel generation across all Linux builds
- Add patchelf as build dependency for glibc compatibility
- Update Python version to 3.14 in build and test workflows
2025-12-13 17:16:05 +09:00
2025-12-13 02:03:59 +00:00
2025-12-11 01:15:11 +09:00
2025-12-05 12:33:24 +09:00
2025-11-02 01:50:29 +09:00
2025-10-08 15:19:15 +09:00
2025-12-11 01:56:28 +09:00

flash-attention pre-build wheels

GitHub Downloads (all assets, all releases)

This repository provides wheels for the pre-built flash-attention.

Since building flash-attention takes a very long time and is resource-intensive, I also build and provide combinations of CUDA and PyTorch that are not officially distributed.

The building Github Actions Workflow can be found here.
The built packages are available on the release page.

This repository uses a self-hosted runner and AWS CodeBuild for building the wheels. If you find this project helpful, please consider sponsoring to help maintain the infrastructure!

github-sponsor buy-me-a-coffee

Special thanks to @KiralyCraft for providing the computing resources used to build wheels. Thank you!!

Install

  1. Select the versions for Python, CUDA, PyTorch, and flash_attn.
flash_attn-[flash_attn Version]+cu[CUDA Version]torch[PyTorch Version]-cp[Python Version]-cp[Python Version]-linux_x86_64.whl

# Example: Python 3.11, CUDA 12.4, PyTorch 2.5, and flash_attn 2.6.3
flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl
  1. Find the corresponding version of a wheel from the Packages page and releases page.

  2. Direct Install or Download and Local Install

# Direct Install
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.0/flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl

# Download and Local Install
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.0/flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl
pip install ./flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl

Packages

Note

Since v0.5.0, wheels are built with a local version label indicating the CUDA and PyTorch versions.
Example: pip list -> flash_attn==2.8.3 -> flash_attn==2.8.3+cu130torch2.9

See ./docs/packages.md for the full list of available packages.

History

History of this repository is available here.

Citation

If you use this repository in your research and find it helpful, please cite this repository!

@misc{flash-attention-prebuild-wheels,
 author = {Morioka, Junya},
 year = {2025},
 title = {mjun0812/flash-attention-prebuild-wheels},
 url = {https://github.com/mjun0812/flash-attention-prebuild-wheels},
 howpublished = {https://github.com/mjun0812/flash-attention-prebuild-wheels},
}

Acknowledgments

Star History and Download Statistics

Star History Chart

Original Repository

repo

@inproceedings{dao2022flashattention,
  title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}
@inproceedings{dao2023flashattention2,
  title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author={Dao, Tri},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

Self build

If you cannot find the version you are looking for, you can fork this repository and create a wheel on GitHub Actions.

  1. Fork this repository
  2. Edit Python script create_matrix.py to set the version you want to build.
  3. Add tag v*.*.* to trigger the build workflow. git tag v*.*.* && git push --tags

Please note that depending on the combination of versions, it may not be possible to build.

Self-Hosted Runner Build

In some version combinations, you cannot build wheels on GitHub-hosted runners due to job time limitations. To build the wheels for these versions, you can use self-hosted runners.

Setup x86_64 Runner

Clone the repository and navigate to the self-hosted-runner directory.

git clone https://github.com/mjun0812/flash-attention-prebuild-wheels.git
cd flash-attention-prebuild-wheels/self-hosted-runner

Create the environment file from the template.

cp env.template env

Edit the env file to set the environment variables.

# Registry Token for GitHub Personal Access Token
PERSONAL_ACCESS_TOKEN=[Github Personal Access Token]
# or Registry Token for GitHub Actions Runner
REGISTRY_TOKEN=[Runner Registry Token]

# Optional
RUNNER_LABELS=Linux,self-hosted

Edit the compose.yml file if you use a repository forked from this repository.

services:
  runner:
    privileged: true
    restart: always
    env_file:
      - .env
    environment:
      REPOSITORY_URL: https://github.com/[OWNER]/[REPOSITORY]
      RUNNER_NAME: self-hosted-runner
      RUNNER_GROUP: default
      TARGET_ARCH: x64
    build:
      context: .
      dockerfile: Dockerfile
      args:
        GH_RUNNER_VERSION: 2.329.0
        TARGET_ARCH: x64

Build and run the docker container.

# Build and run
docker compose build runner
docker compose up -d runner

(Optional) Setup ARM64 Runner

If you also want to build wheels for ARM64 architecture, follow these additional steps.

Install qemu-user-static for ARM64 support.

sudo apt install qemu-user-static

Create the environment file for ARM64 runner.

cp env.template env.arm

Edit the env.arm file with the same configuration as the env file.

Add the ARM64 runner service to your compose.yml file.

services:
  runner:
    # ... (existing x86_64 runner configuration)

  runner-arm:
    privileged: true
    restart: always
    env_file:
      - .env.arm
    environment:
      REPOSITORY_URL: https://github.com/[OWNER]/[REPOSITORY]
      RUNNER_NAME: self-hosted-runner-arm
      RUNNER_GROUP: default
      TARGET_ARCH: arm64
    build:
      context: .
      dockerfile: Dockerfile
      args:
        GH_RUNNER_VERSION: 2.329.0
        TARGET_ARCH: arm64
        PLATFORM: linux/arm64

Build and run the ARM64 runner container.

# Build and run both x86_64 and ARM64 runners
docker compose build runner-arm
docker compose up -d runner-arm

Getting One-Time Registry Token for GitHub Actions Runner

gh api \
  -X POST \
  /repos/[OWNER]/[REPOSITORY]/actions/runners/registration-token
S
Description
Provide with pre-build flash-attention package wheels on Linux and Windows platforms using GitHub Actions
Readme BSD-3-Clause 8.9 MiB
Latest
2026-02-10 21:26:17 -05:00
Languages
Python 84%
Shell 10.1%
PowerShell 3.7%
Dockerfile 2.2%