yindo/flash-attention-prebuild-wheels-rocm

Fork 0

mirror of https://github.com/BillyOutlast/flash-attention-prebuild-wheels-rocm.git synced 2026-07-01 01:27:54 -04:00

Files

T

Junya Morioka 10dd6d0baf update README.md

2025-05-18 02:45:41 +09:00

7.6 KiB

Raw Blame History

flash-attention pre-build wheels

This repository provides wheels for the pre-built flash-attention.

Since building flash-attention takes a very long time and is resource-intensive, I also build and provide combinations of CUDA and PyTorch that are not officially distributed.

The building Github Actions Workflow can be found here.
The built packages are available on the release page.

This repository uses a self-hosted runner for building the wheels. If you find this project helpful, please consider supporting or sponsoring to help maintain the infrastructure.

Install

Select the versions for Python, CUDA, PyTorch, and flash_attn.

flash_attn-[flash_attn Version]+cu[CUDA Version]torch[PyTorch Version]-cp[Python Version]-cp[Python Version]-linux_x86_64.whl

# Example: Python 3.11, CUDA 12.4, PyTorch 2.5, and flash_attn 2.6.3
flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl

Find the corresponding version of a wheel from the below table and releases
Direct Install or Download and Local Install

# Direct Install
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.0/flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl

# Download and Local Install
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.0/flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl
pip install ./flash_attn-2.6.3+cu124torch2.5-cp312-cp312-linux_x86_64.whl

Packages

v0.0.9

Release

Flash-Attention	Python	PyTorch	CUDA
2.4.3, 2.5.9, 2.6.3	3.10, 3.11, 3.12	2.7.0	12.8.1

v0.0.8

Release

Flash-Attention	Python	PyTorch	CUDA
2.4.3, 2.5.9, 2.6.3, 2.7.4.post1	3.10, 3.11, 3.12	2.4.1, 2.5.1, 2.6.0, 2.7.0	11.8.0, 12.4.1, 12.6.3

v0.0.7

Skip for experimental reasons.

v0.0.6

Release

Flash-Attention	Python	PyTorch	CUDA
2.4.3, 2.5.9, 2.6.3, 2.7.4.post1	3.10, 3.11, 3.12	2.2.2, 2.3.1, 2.4.1, 2.5.1, 2.6.0	12.4.1, 12.6.3

v0.0.5

Release

Flash-Attention	Python	PyTorch	CUDA
2.6.3, 2.7.4.post1	3.10, 3.11, 3.12	2.0.1, 2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1, 2.6.0	12.4.1, 12.6.3

v0.0.4

Release

Flash-Attention	Python	PyTorch	CUDA
2.7.3	3.10, 3.11, 3.12	2.0.1, 2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1	11.8.0, 12.1.1, 12.4.1

v0.0.3

Release

Flash-Attention	Python	PyTorch	CUDA
2.7.2.post1	3.10, 3.11, 3.12	2.0.1, 2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1	11.8.0, 12.1.1, 12.4.1

v0.0.2

Release

Flash-Attention	Python	PyTorch	CUDA
2.4.3, 2.5.6, 2.6.3, 2.7.0.post2	3.10, 3.11, 3.12	2.0.1, 2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.1	11.8.0, 12.1.1, 12.4.1

v0.0.1

Release

flash-attention	Python	PyTorch	CUDA
1.0.9, 2.4.3, 2.5.6, 2.5.9, 2.6.3	3.10, 3.11, 3.12	2.0.1, 2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.0	11.8.0, 12.1.1, 12.4.1

v0.0.0

Release

flash-attention	Python	PyTorch	CUDA
2.4.3, 2.5.6, 2.5.9, 2.6.3	3.11, 3.12	2.0.1, 2.1.2, 2.2.2, 2.3.1, 2.4.1, 2.5.0	11.8.0, 12.1.1, 12.4.1

Self build

If you want to build the wheels yourself, you can folk this repository and run the build workflow.

Fork this repository
Edit workflow file .github/workflows/build.yml to set the version you want to build.
Add tag v*.*.* to trigger the build workflow.

Self-Hosted Runner Build

In some version combinations, you cannot build wheels on GitHub-hosted runners due to job time limitations. To build the wheels for these versions, you can use self-hosted runners.

git clone https://github.com/mjun0812/flash-attention-prebuild-wheels.git
cd self-hosted-runner
cp env.template env

Edit env file to set the environment variables.

# Edit env
PERSONAL_ACCESS_TOKEN=[Github Personal Access Token]

Edit compose.yml file if you use repository folked from this repository.

services:
  runner:
    privileged: true
    build:
      context: .
      dockerfile: Dockerfile
      args:
        REPOSITORY_URL: [Target Repository URL]
        PERSONAL_ACCESS_TOKEN: $PERSONAL_ACCESS_TOKEN
        GH_RUNNER_VERSION: 2.324.0
        RUNNER_NAME: self-hosted-runner
        RUNNER_GROUP: default
        RUNNER_LABELS: self-hosted
        TARGET_ARCH: x64

Then, build and run the docker container.

# Build and run
docker compose build
docker compose up -d

Original Repository

repo

@inproceedings{dao2022flashattention,
  title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}
@inproceedings{dao2023flashattention2,
  title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author={Dao, Tri},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

7.6 KiB Raw Blame History

flash-attention pre-build wheels

Install

Packages

v0.0.9

v0.0.8

v0.0.7

v0.0.6

v0.0.5

v0.0.4

v0.0.3

v0.0.2

v0.0.1

v0.0.0

Self build

Self-Hosted Runner Build

Original Repository

7.6 KiB

Raw Blame History