pyannote-audio-legacy/tutorials/community/offline_usage_speaker_diarization.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Offline Speaker Diarization (speaker-diarization-3.1)\n",
    "\n",
    "This notebooks gives a short introduction how to use the [speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) pipeline with local models.\n",
    "\n",
    "In order to use local models, you first need to download them from huggingface and place them in a local folder. \n",
    "Then you need to create a local config file, similar to the one in HF, but with local model paths.\n",
    "\n",
    "❗ **Naming of the model files is REALLY important! See end of notebook for details.** ❗\n",
    "\n",
    "## Get the models\n",
    "\n",
    "1. Install the `pyannote-audio` package: `!pip install pyannote.audio`\n",
    "2. Create a huggingface account https://huggingface.co/join\n",
    "3. Accept [pyannote/segmentation-3.0](https://hf.co/pyannote/segmentation-3.0) user conditions\n",
    "4. Create a local folder `models`, place all downloaded files there\n",
    "   1. [wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM/blob/main/pytorch_model.bin), to be placed in `models/pyannote_model_wespeaker-voxceleb-resnet34-LM.bin`\n",
    "   2. [segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0/blob/main/pytorch_model.bin), to be placed in `models/pyannote_model_segmentation-3.0.bin`\n",
    "\n",
    "Running `ls models` should show the following files:\n",
    "```\n",
    "pyannote_model_segmentation-3.0.bin (5.7M)\n",
    "pyannote_model_wespeaker-voxceleb-resnet34-LM.bin (26MB)\n",
    "```\n",
    "\n",
    "❗ **make sure the 'wespeaker-voxceleb-resnet34-LM' model is named 'pyannote_model_wespeaker-voxceleb-resnet34-LM.bin'** ❗"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Config for local models\n",
    "\n",
    "Create a local config, similar to the one in HF: [speaker-diarization-3.1/blob/main/config.yaml](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/config.yaml), but with local model paths\n",
    "\n",
    "Contents of `models/pyannote_diarization_config.yaml`:\n",
    "\n",
    "```yaml\n",
    "version: 3.1.0\n",
    "\n",
    "pipeline:\n",
    "  name: pyannote.audio.pipelines.SpeakerDiarization\n",
    "  params:\n",
    "    clustering: AgglomerativeClustering\n",
    "    # embedding: pyannote/wespeaker-voxceleb-resnet34-LM  # if you want to use the HF model\n",
    "    embedding: models/pyannote_model_wespeaker-voxceleb-resnet34-LM.bin  # if you want to use the local model\n",
    "    embedding_batch_size: 32\n",
    "    embedding_exclude_overlap: true\n",
    "    # segmentation: pyannote/segmentation-3.0  # if you want to use the HF model\n",
    "    segmentation: models/pyannote_model_segmentation-3.0.bin  # if you want to use the local model\n",
    "    segmentation_batch_size: 32\n",
    "\n",
    "params:\n",
    "  clustering:\n",
    "    method: centroid\n",
    "    min_cluster_size: 12\n",
    "    threshold: 0.7045654963945799\n",
    "  segmentation:\n",
    "    min_duration_off: 0.0\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the local pipeline\n",
    "\n",
    "**Hint**: The paths in the config are relative to the current working directory, not relative to the config file.\n",
    "If you want to start your notebook/script from a different directory, you can use `os.chdir` temporarily, to 'emulate' config-relative paths.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from pyannote.audio import Pipeline\n",
    "\n",
    "def load_pipeline_from_pretrained(path_to_config: str | Path) -> Pipeline:\n",
    "    path_to_config = Path(path_to_config)\n",
    "\n",
    "    print(f\"Loading pyannote pipeline from {path_to_config}...\")\n",
    "    # the paths in the config are relative to the current working directory\n",
    "    # so we need to change the working directory to the model path\n",
    "    # and then change it back\n",
    "\n",
    "    cwd = Path.cwd().resolve()  # store current working directory\n",
    "\n",
    "    # first .parent is the folder of the config, second .parent is the folder containing the 'models' folder\n",
    "    cd_to = path_to_config.parent.parent.resolve()\n",
    "\n",
    "    print(f\"Changing working directory to {cd_to}\")\n",
    "    os.chdir(cd_to)\n",
    "\n",
    "    pipeline = Pipeline.from_pretrained(path_to_config)\n",
    "\n",
    "    print(f\"Changing working directory back to {cwd}\")\n",
    "    os.chdir(cwd)\n",
    "\n",
    "    return pipeline\n",
    "\n",
    "PATH_TO_CONFIG = \"path/to/your/pyannote_diarization_config.yaml\"\n",
    "pipeline = load_pipeline_from_pretrained(PATH_TO_CONFIG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notes on file naming (pyannote-audio 3.1.1)\n",
    "\n",
    "Pyannote uses some internal logic to determine the model type.\n",
    "\n",
    "The funtion `def PretrainedSpeakerEmbedding(...` in (speaker_verification.py)[https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/pipelines/speaker_verification.py#L712] uses the the file path of the model to infer the model type.\n",
    "\n",
    "```python\n",
    "def PretrainedSpeakerEmbedding(\n",
    "    embedding: PipelineModel,\n",
    "    device: torch.device = None,\n",
    "    use_auth_token: Union[Text, None] = None,\n",
    "):\n",
    "    #...\n",
    "    if isinstance(embedding, str) and \"pyannote\" in embedding:\n",
    "        return PyannoteAudioPretrainedSpeakerEmbedding(\n",
    "            embedding, device=device, use_auth_token=use_auth_token\n",
    "        )\n",
    "\n",
    "    elif isinstance(embedding, str) and \"speechbrain\" in embedding:\n",
    "        return SpeechBrainPretrainedSpeakerEmbedding(\n",
    "            embedding, device=device, use_auth_token=use_auth_token\n",
    "        )\n",
    "\n",
    "    elif isinstance(embedding, str) and \"nvidia\" in embedding:\n",
    "        return NeMoPretrainedSpeakerEmbedding(embedding, device=device)\n",
    "\n",
    "    elif isinstance(embedding, str) and \"wespeaker\" in embedding:\n",
    "        return ONNXWeSpeakerPretrainedSpeakerEmbedding(embedding, device=device)  # <-- this is called, but the wespeaker-voxceleb-resnet34-LM is not an ONNX model\n",
    "\n",
    "    else:\n",
    "        # fallback to pyannote in case we are loading a local model\n",
    "        return PyannoteAudioPretrainedSpeakerEmbedding(\n",
    "            embedding, device=device, use_auth_token=use_auth_token\n",
    "        )\n",
    "```\n",
    "\n",
    "The [wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM/blob/main/pytorch_model.bin) model is not an ONNX model, but a `PyannoteAudioPretrainedSpeakerEmbedding`. So if `wespeaker` is in the file name, the code will infer the model type incorrectly. If `pyannote` is somewhere in the file name, the model type will be inferred correctly, as the first if statement will be true..."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}