Multi-modal researcher

2026-07-01 19:55:06 -04:00 · 2025-06-24 18:12:43 -07:00
commit 65ddb019d1
8 changed files with 658 additions and 0 deletions
@@ -0,0 +1 @@
+# GEMINI_API_KEY=
@@ -0,0 +1,152 @@
+# Multi-Modal Research Agent
+
+This project is a simple research and podcast generation system that uses the unique capabilities of Google's Gemini 2.5 model family. It combines three useful features of the Gemini 2.5 model family. You can pass a research topic and, optionally, a YouTube video URL. The system will then perform research on the topic using search, analyze the video, combine the insights, and generate a report with citations as well as a short podcast on the topic for you. It takes advantage of a few of Gemini's native capabilities:
+
+- 🎥 [Video understanding and native YouTube tool](https://developers.googleblog.com/en/gemini-2-5-video-understanding/): Integrated processing of YouTube videos
+- 🔍 [Google search tool](https://developers.googleblog.com/en/gemini-2-5-thinking-model-updates/): Native Google Search tool integration with real-time web results
+- 🎙️ [Multi-speaker text-to-speech](https://ai.google.dev/gemini-api/docs/speech-generation): Generate natural conversations with distinct speaker voices
+
+## Quick Start
+
+### Prerequisites
+
+- Python 3.11+
+- [uv](https://docs.astral.sh/uv/) package manager
+- Google Gemini API key
+
+### Setup
+
+1. **Clone and navigate to the project**:
+```bash
+git clone <repository-url>
+cd mutli-modal-researcher
+```
+
+2. **Set up environment variables**:
+```bash
+cp .env.example .env
+```
+Edit `.env` and add your Google Gemini API key:
+```bash
+GEMINI_API_KEY=your_api_key_here
+```
+
+3. **Run the development server**:
+
+```bash
+# Install uv package manager
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Install dependencies and start the LangGraph server
+uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.11 langgraph dev --allow-blocking
+```
+
+4. **Access the application**:
+
+LangGraph will open in your browser.
+
+```bash
+╦  ┌─┐┌┐┌┌─┐╔═╗┬─┐┌─┐┌─┐┬ ┬
+║  ├─┤││││ ┬║ ╦├┬┘├─┤├─┘├─┤
+╩═╝┴ ┴┘└┘└─┘╚═╝┴└─┴ ┴┴  ┴ ┴
+
+- 🚀 API: http://127.0.0.1:2024
+- 🎨 Studio UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
+- 📚 API Docs: http://127.0.0.1:2024/docs
+```
+
+5. Pass a `topic` and optionally a `video_url`.
+
+## Architecture
+
+The system implements a LangGraph workflow with the following nodes:
+
+1. **Search Research Node**: Performs web search using Gemini's Google Search integration
+2. **Analyze Video Node**: Analyzes YouTube videos when provided (conditional)
+3. **Create Report Node**: Synthesizes findings into a comprehensive markdown report
+4. **Create Podcast Node**: Generates a 2-speaker podcast discussion with TTS audio
+
+### Workflow
+
+```
+START → search_research → [analyze_video?] → create_report → create_podcast → END
+```
+
+The workflow conditionally includes video analysis if a YouTube URL is provided, otherwise proceeds directly to report generation.
+
+### Output
+
+The system generates:
+
+- **Research Report**: Comprehensive markdown report with executive summary and sources
+- **Podcast Script**: Natural dialogue between Dr. Sarah (expert) and Mike (interviewer)  
+- **Audio File**: Multi-speaker TTS audio file (`research_podcast_*.wav`)
+
+## Configuration
+
+The system supports runtime configuration through the `Configuration` class:
+
+### Model Settings
+- `search_model`: Model for web search (default: "gemini-2.5-flash")
+- `synthesis_model`: Model for report synthesis (default: "gemini-2.5-flash")
+- `video_model`: Model for video analysis (default: "gemini-2.5-flash")
+- `tts_model`: Model for text-to-speech (default: "gemini-2.5-flash-preview-tts")
+
+### Temperature Settings
+- `search_temperature`: Factual search queries (default: 0.0)
+- `synthesis_temperature`: Balanced synthesis (default: 0.3)
+- `podcast_script_temperature`: Creative dialogue (default: 0.4)
+
+### TTS Settings
+- `mike_voice`: Voice for interviewer (default: "Kore")
+- `sarah_voice`: Voice for expert (default: "Puck")
+- Audio format settings for output quality
+
+## Project Structure
+
+```
+├── src/agent/
+│   ├── state.py           # State definitions (input/output schemas)
+│   ├── configuration.py   # Runtime configuration class
+│   ├── utils.py          # Utility functions (TTS, report generation)
+│   └── graph.py          # LangGraph workflow definition
+├── langgraph.json        # LangGraph deployment configuration
+├── pyproject.toml        # Python package configuration
+└── .env                  # Environment variables
+```
+
+## Key Components
+
+### State Management
+
+- **ResearchStateInput**: Input schema (topic, optional video_url)
+- **ResearchStateOutput**: Output schema (report, podcast_script, podcast_filename)
+- **ResearchState**: Complete state including intermediate results
+
+### Utility Functions
+
+- **display_gemini_response()**: Processes Gemini responses with grounding metadata
+- **create_podcast_discussion()**: Generates scripted dialogue and TTS audio
+- **create_research_report()**: Synthesizes multi-modal research into reports
+- **wave_file()**: Saves audio data to WAV format
+
+## Deployment
+
+The application is configured for deployment on:
+
+- **Local Development**: Using LangGraph CLI with in-memory storage
+- **LangGraph Platform**: Production deployment with persistent storage
+- **Self-Hosted**: Using Docker containers
+
+## Dependencies
+
+Core dependencies managed via `pyproject.toml`:
+
+- `langgraph>=0.2.6` - Workflow orchestration
+- `google-genai` - Gemini API client
+- `langchain>=0.3.19` - LangChain integrations
+- `rich` - Enhanced terminal output
+- `python-dotenv` - Environment management
+
+## License
+
+MIT License - see LICENSE file for details.
@@ -0,0 +1,9 @@
+{
+    "dependencies": [
+        "."
+    ],
+    "graphs": {
+        "research_agent": "agent.graph:create_compiled_graph"
+    },
+    "env": ".env"
+}
@@ -0,0 +1,60 @@
+[project]
+name = "agent"
+version = "0.0.1"
+description = "Multi-modal researcher with Gemini"
+authors = [
+    { name = "Lance Martin", email = "lance@langchain.dev" },
+]
+readme = "README.md"
+license = { text = "MIT" }
+requires-python = ">=3.11,<4.0"
+dependencies = [
+    "langgraph>=0.2.6",
+    "langchain>=0.3.19",
+    "langchain-google-genai",
+    "python-dotenv>=1.0.1",
+    "langgraph-sdk>=0.1.57",
+    "langgraph-cli",
+    "langgraph-api",
+    "fastapi",
+    "google-genai",
+    "rich",
+]
+
+
+[project.optional-dependencies]
+dev = ["mypy>=1.11.1", "ruff>=0.6.1"]
+
+[build-system]
+requires = ["setuptools>=73.0.0", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[tool.ruff]
+lint.select = [
+    "E",    # pycodestyle
+    "F",    # pyflakes
+    "I",    # isort
+    "D",    # pydocstyle
+    "D401", # First line should be in imperative mood
+    "T201",
+    "UP",
+]
+lint.ignore = [
+    "UP006",
+    "UP007",
+    # We actually do want to import from typing_extensions
+    "UP035",
+    # Relax the convention by _not_ requiring documentation for every function parameter.
+    "D417",
+    "E501",
+]
+[tool.ruff.lint.per-file-ignores]
+"tests/*" = ["D", "UP"]
+[tool.ruff.lint.pydocstyle]
+convention = "google"
+
+[dependency-groups]
+dev = [
+    "langgraph-cli[inmem]>=0.1.71",
+    "pytest>=8.3.5",
+]
@@ -0,0 +1,46 @@
+"""Configuration settings for the research and podcast generation app"""
+
+import os
+from dataclasses import dataclass, fields
+from typing import Optional, Any
+from typing_extensions import TypedDict
+from langchain_core.runnables import RunnableConfig
+
+
+@dataclass(kw_only=True)
+class Configuration:
+    """LangGraph Configuration for the deep research agent."""
+
+    # Model settings
+    search_model: str = "gemini-2.5-flash"  # Web search supported model
+    synthesis_model: str = "gemini-2.5-flash"  # Citations supported model
+    video_model: str = "gemini-2.5-flash"  # Citations supported model
+    tts_model: str = "gemini-2.5-flash-preview-tts"
+    
+    # Temperature settings for different use cases
+    search_temperature: float = 0.0           # Factual search queries
+    synthesis_temperature: float = 0.3        # Balanced synthesis
+    podcast_script_temperature: float = 0.4   # Creative dialogue
+    
+    # TTS Configuration
+    mike_voice: str = "Kore"
+    sarah_voice: str = "Puck"
+    tts_channels: int = 1
+    tts_rate: int = 24000
+    tts_sample_width: int = 2
+
+    @classmethod
+    def from_runnable_config(
+        cls, config: Optional[RunnableConfig] = None
+    ) -> "Configuration":
+        """Create a Configuration instance from a RunnableConfig."""
+        configurable = (
+            config["configurable"] if config and "configurable" in config else {}
+        )
+        values: dict[str, Any] = {
+            f.name: os.environ.get(f.name.upper(), configurable.get(f.name))
+            for f in fields(cls)
+            if f.init
+        }
+        return cls(**{k: v for k, v in values.items() if v})
+
@@ -0,0 +1,147 @@
+"""LangGraph implementation of the research and podcast generation workflow"""
+
+from langgraph.graph import StateGraph, START, END
+from langchain_core.runnables import RunnableConfig
+from google.genai import types
+
+from agent.state import ResearchState, ResearchStateInput, ResearchStateOutput
+from agent.utils import display_gemini_response, create_podcast_discussion, create_research_report, genai_client
+from agent.configuration import Configuration
+
+
+def search_research_node(state: ResearchState, config: RunnableConfig) -> dict:
+    """Node that performs web search research on the topic"""
+    configuration = Configuration.from_runnable_config(config)
+    topic = state["topic"]
+    
+    search_response = genai_client.models.generate_content(
+        model=configuration.search_model,
+        contents=f"Research this topic and give me an overview: {topic}",
+        config={
+            "tools": [{"google_search": {}}],
+            "temperature": configuration.search_temperature,
+        },
+    )
+    
+    search_text, search_sources_text = display_gemini_response(search_response)
+    
+    return {
+        "search_text": search_text,
+        "search_sources_text": search_sources_text
+    }
+
+
+def analyze_video_node(state: ResearchState, config: RunnableConfig) -> dict:
+    """Node that analyzes video content if video URL is provided"""
+    configuration = Configuration.from_runnable_config(config)
+    video_url = state.get("video_url")
+    topic = state["topic"]
+    
+    if not video_url:
+        return {"video_text": "No video provided for analysis."}
+    
+    video_response = genai_client.models.generate_content(
+        model=configuration.video_model,
+        contents=types.Content(
+            parts=[
+                types.Part(
+                    file_data=types.FileData(file_uri=video_url)
+                ),
+                types.Part(text=f'Based on the video content, give me an overview of this topic: {topic}')
+            ]
+        )
+    )
+    
+    video_text, _ = display_gemini_response(video_response)
+    
+    return {"video_text": video_text}
+
+
+def create_report_node(state: ResearchState, config: RunnableConfig) -> dict:
+    """Node that creates a comprehensive research report"""
+    configuration = Configuration.from_runnable_config(config)
+    topic = state["topic"]
+    search_text = state.get("search_text", "")
+    video_text = state.get("video_text", "")
+    search_sources_text = state.get("search_sources_text", "")
+    video_url = state.get("video_url", "")
+    
+    report, synthesis_text = create_research_report(
+        topic, search_text, video_text, search_sources_text, video_url, configuration
+    )
+    
+    return {
+        "report": report,
+        "synthesis_text": synthesis_text
+    }
+
+
+def create_podcast_node(state: ResearchState, config: RunnableConfig) -> dict:
+    """Node that creates a podcast discussion"""
+    configuration = Configuration.from_runnable_config(config)
+    topic = state["topic"]
+    search_text = state.get("search_text", "")
+    video_text = state.get("video_text", "")
+    search_sources_text = state.get("search_sources_text", "")
+    video_url = state.get("video_url", "")
+    
+    # Create unique filename based on topic
+    safe_topic = "".join(c for c in topic if c.isalnum() or c in (' ', '-', '_')).rstrip()
+    filename = f"research_podcast_{safe_topic.replace(' ', '_')}.wav"
+    
+    podcast_script, podcast_filename = create_podcast_discussion(
+        topic, search_text, video_text, search_sources_text, video_url, filename, configuration
+    )
+    
+    return {
+        "podcast_script": podcast_script,
+        "podcast_filename": podcast_filename
+    }
+
+
+def should_analyze_video(state: ResearchState) -> str:
+    """Conditional edge to determine if video analysis should be performed"""
+    if state.get("video_url"):
+        return "analyze_video"
+    else:
+        return "create_report"
+
+
+def create_research_graph() -> StateGraph:
+    """Create and return the research workflow graph"""
+    
+    # Create the graph with configuration schema
+    graph = StateGraph(
+        ResearchState, 
+        input=ResearchStateInput, 
+        output=ResearchStateOutput,
+        config_schema=Configuration
+    )
+    
+    # Add nodes
+    graph.add_node("search_research", search_research_node)
+    graph.add_node("analyze_video", analyze_video_node)
+    graph.add_node("create_report", create_report_node)
+    graph.add_node("create_podcast", create_podcast_node)
+    
+    # Add edges
+    graph.add_edge(START, "search_research")
+    graph.add_conditional_edges(
+        "search_research",
+        should_analyze_video,
+        {
+            "analyze_video": "analyze_video",
+            "create_report": "create_report"
+        }
+    )
+    graph.add_edge("analyze_video", "create_report")
+    graph.add_edge("create_report", "create_podcast")
+    graph.add_edge("create_podcast", END)
+    
+    return graph
+
+
+def create_compiled_graph():
+    """Create and compile the research graph"""
+    graph = create_research_graph()
+    return graph.compile()
@@ -0,0 +1,34 @@
+from typing_extensions import TypedDict
+from typing import Optional
+
+
+class ResearchStateInput(TypedDict):
+    """State for the research and podcast generation workflow"""
+    # Input fields
+    topic: str
+    video_url: Optional[str]
+
+class ResearchStateOutput(TypedDict):
+    """State for the research and podcast generation workflow"""
+
+    # Final outputs
+    report: Optional[str]
+    podcast_script: Optional[str]
+    podcast_filename: Optional[str]
+
+class ResearchState(TypedDict):
+    """State for the research and podcast generation workflow"""
+    # Input fields
+    topic: str
+    video_url: Optional[str]
+    
+    # Intermediate results
+    search_text: Optional[str]
+    search_sources_text: Optional[str]
+    video_text: Optional[str]
+    
+    # Final outputs
+    report: Optional[str]
+    synthesis_text: Optional[str]
+    podcast_script: Optional[str]
+    podcast_filename: Optional[str]
@@ -0,0 +1,209 @@
+import os
+import wave
+from google.genai import Client, types
+from rich.console import Console
+from rich.markdown import Markdown
+from dotenv import load_dotenv
+
+load_dotenv()
+
+# Initialize client
+genai_client = Client(api_key=os.getenv("GEMINI_API_KEY"))
+
+
+def display_gemini_response(response):
+    """Extract text from Gemini response and display as markdown with references"""
+    console = Console()
+    
+    # Extract main content
+    text = response.candidates[0].content.parts[0].text
+    md = Markdown(text)
+    console.print(md)
+    
+    # Get candidate for grounding metadata
+    candidate = response.candidates[0]
+    
+    # Build sources text block
+    sources_text = ""
+    
+    # Display grounding metadata if available
+    if hasattr(candidate, 'grounding_metadata') and candidate.grounding_metadata:
+        console.print("\n" + "="*50)
+        console.print("[bold blue]References & Sources[/bold blue]")
+        console.print("="*50)
+        
+        # Display and collect source URLs
+        if candidate.grounding_metadata.grounding_chunks:
+            console.print(f"\n[bold]Sources ({len(candidate.grounding_metadata.grounding_chunks)}):[/bold]")
+            sources_list = []
+            for i, chunk in enumerate(candidate.grounding_metadata.grounding_chunks, 1):
+                if hasattr(chunk, 'web') and chunk.web:
+                    title = getattr(chunk.web, 'title', 'No title') or "No title"
+                    uri = getattr(chunk.web, 'uri', 'No URI') or "No URI"
+                    console.print(f"{i}. {title}")
+                    console.print(f"   [dim]{uri}[/dim]")
+                    sources_list.append(f"{i}. {title}\n   {uri}")
+            
+            sources_text = "\n".join(sources_list)
+        
+        # Display grounding supports (which text is backed by which sources)
+        if candidate.grounding_metadata.grounding_supports:
+            console.print(f"\n[bold]Text segments with source backing:[/bold]")
+            for support in candidate.grounding_metadata.grounding_supports[:5]:  # Show first 5
+                if hasattr(support, 'segment') and support.segment:
+                    snippet = support.segment.text[:100] + "..." if len(support.segment.text) > 100 else support.segment.text
+                    source_nums = [str(i+1) for i in support.grounding_chunk_indices]
+                    console.print(f"• \"{snippet}\" [dim](sources: {', '.join(source_nums)})[/dim]")
+    
+    return text, sources_text
+
+
+def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
+    """Save PCM data to a wave file"""
+    with wave.open(filename, "wb") as wf:
+        wf.setnchannels(channels)
+        wf.setsampwidth(sample_width)
+        wf.setframerate(rate)
+        wf.writeframes(pcm)
+
+
+def create_podcast_discussion(topic, search_text, video_text, search_sources_text, video_url, filename="research_podcast.wav", configuration=None):
+    """Create a 2-speaker podcast discussion explaining the research topic"""
+    
+    # Use default values if no configuration provided
+    if configuration is None:
+        from agent.configuration import Configuration
+        configuration = Configuration()
+    
+    # Step 1: Generate podcast script
+    script_prompt = f"""
+    Create a natural, engaging podcast conversation between Dr. Sarah (research expert) and Mike (curious interviewer) about "{topic}".
+    
+    Use this research content:
+    
+    SEARCH FINDINGS:
+    {search_text}
+    
+    VIDEO INSIGHTS:
+    {video_text}
+    
+    Format as a dialogue with:
+    - Mike introducing the topic and asking questions
+    - Dr. Sarah explaining key concepts and insights
+    - Natural back-and-forth discussion (5-7 exchanges)
+    - Mike asking follow-up questions
+    - Dr. Sarah synthesizing the main takeaways
+    - Keep it conversational and accessible (3-4 minutes when spoken)
+    
+    Format exactly like this:
+    Mike: [opening question]
+    Dr. Sarah: [expert response]
+    Mike: [follow-up]
+    Dr. Sarah: [explanation]
+    [continue...]
+    """
+    
+    script_response = genai_client.models.generate_content(
+        model=configuration.synthesis_model,
+        contents=script_prompt,
+        config={"temperature": configuration.podcast_script_temperature}
+    )
+    
+    podcast_script = script_response.candidates[0].content.parts[0].text
+    
+    # Step 2: Generate TTS audio
+    tts_prompt = f"TTS the following conversation between Mike and Dr. Sarah:\n{podcast_script}"
+    
+    response = genai_client.models.generate_content(
+        model=configuration.tts_model,
+        contents=tts_prompt,
+        config=types.GenerateContentConfig(
+            response_modalities=["AUDIO"],
+            speech_config=types.SpeechConfig(
+                multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
+                    speaker_voice_configs=[
+                        types.SpeakerVoiceConfig(
+                            speaker='Mike',
+                            voice_config=types.VoiceConfig(
+                                prebuilt_voice_config=types.PrebuiltVoiceConfig(
+                                    voice_name=configuration.mike_voice,
+                                )
+                            )
+                        ),
+                        types.SpeakerVoiceConfig(
+                            speaker='Dr. Sarah',
+                            voice_config=types.VoiceConfig(
+                                prebuilt_voice_config=types.PrebuiltVoiceConfig(
+                                    voice_name=configuration.sarah_voice,
+                                )
+                            )
+                        ),
+                    ]
+                )
+            )
+        )
+    )
+    
+    # Step 3: Save audio file
+    audio_data = response.candidates[0].content.parts[0].inline_data.data
+    wave_file(filename, audio_data, configuration.tts_channels, configuration.tts_rate, configuration.tts_sample_width)
+    
+    print(f"Podcast saved as: {filename}")
+    return podcast_script, filename
+
+
+def create_research_report(topic, search_text, video_text, search_sources_text, video_url, configuration=None):
+    """Create a comprehensive research report by synthesizing search and video content"""
+    
+    # Use default values if no configuration provided
+    if configuration is None:
+        from agent.configuration import Configuration
+        configuration = Configuration()
+    
+    # Step 1: Create synthesis using Gemini
+    synthesis_prompt = f"""
+    You are a research analyst. I have gathered information about "{topic}" from two sources:
+    
+    SEARCH RESULTS:
+    {search_text}
+    
+    VIDEO CONTENT:
+    {video_text}
+    
+    Please create a comprehensive synthesis that:
+    1. Identifies key themes and insights from both sources
+    2. Highlights any complementary or contrasting perspectives
+    3. Provides an overall analysis of the topic based on this multi-modal research
+    4. Keep it concise but thorough (3-4 paragraphs)
+    
+    Focus on creating a coherent narrative that brings together the best insights from both sources.
+    """
+    
+    synthesis_response = genai_client.models.generate_content(
+        model=configuration.synthesis_model,
+        contents=synthesis_prompt,
+        config={
+            "temperature": configuration.synthesis_temperature,
+        }
+    )
+    
+    synthesis_text = synthesis_response.candidates[0].content.parts[0].text
+    
+    # Step 2: Create markdown report
+    report = f"""# Research Report: {topic}
+
+## Executive Summary
+
+{synthesis_text}
+
+## Video Source
+- **URL**: {video_url}
+
+## Additional Sources
+{search_sources_text}
+
+---
+*Report generated using multi-modal AI research combining web search and video analysis*
+"""
+    
+    return report, synthesis_text