Multi-modal researcher

This commit is contained in:
Lance Martin
2025-06-24 18:12:43 -07:00
commit 65ddb019d1
8 changed files with 658 additions and 0 deletions
+1
View File
@@ -0,0 +1 @@
# GEMINI_API_KEY=
+152
View File
@@ -0,0 +1,152 @@
# Multi-Modal Research Agent
This project is a simple research and podcast generation system that uses the unique capabilities of Google's Gemini 2.5 model family. It combines three useful features of the Gemini 2.5 model family. You can pass a research topic and, optionally, a YouTube video URL. The system will then perform research on the topic using search, analyze the video, combine the insights, and generate a report with citations as well as a short podcast on the topic for you. It takes advantage of a few of Gemini's native capabilities:
- 🎥 [Video understanding and native YouTube tool](https://developers.googleblog.com/en/gemini-2-5-video-understanding/): Integrated processing of YouTube videos
- 🔍 [Google search tool](https://developers.googleblog.com/en/gemini-2-5-thinking-model-updates/): Native Google Search tool integration with real-time web results
- 🎙️ [Multi-speaker text-to-speech](https://ai.google.dev/gemini-api/docs/speech-generation): Generate natural conversations with distinct speaker voices
## Quick Start
### Prerequisites
- Python 3.11+
- [uv](https://docs.astral.sh/uv/) package manager
- Google Gemini API key
### Setup
1. **Clone and navigate to the project**:
```bash
git clone <repository-url>
cd mutli-modal-researcher
```
2. **Set up environment variables**:
```bash
cp .env.example .env
```
Edit `.env` and add your Google Gemini API key:
```bash
GEMINI_API_KEY=your_api_key_here
```
3. **Run the development server**:
```bash
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies and start the LangGraph server
uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.11 langgraph dev --allow-blocking
```
4. **Access the application**:
LangGraph will open in your browser.
```bash
╦ ┌─┐┌┐┌┌─┐╔═╗┬─┐┌─┐┌─┐┬ ┬
║ ├─┤││││ ┬║ ╦├┬┘├─┤├─┘├─┤
╩═╝┴ ┴┘└┘└─┘╚═╝┴└─┴ ┴┴ ┴ ┴
- 🚀 API: http://127.0.0.1:2024
- 🎨 Studio UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
- 📚 API Docs: http://127.0.0.1:2024/docs
```
5. Pass a `topic` and optionally a `video_url`.
## Architecture
The system implements a LangGraph workflow with the following nodes:
1. **Search Research Node**: Performs web search using Gemini's Google Search integration
2. **Analyze Video Node**: Analyzes YouTube videos when provided (conditional)
3. **Create Report Node**: Synthesizes findings into a comprehensive markdown report
4. **Create Podcast Node**: Generates a 2-speaker podcast discussion with TTS audio
### Workflow
```
START → search_research → [analyze_video?] → create_report → create_podcast → END
```
The workflow conditionally includes video analysis if a YouTube URL is provided, otherwise proceeds directly to report generation.
### Output
The system generates:
- **Research Report**: Comprehensive markdown report with executive summary and sources
- **Podcast Script**: Natural dialogue between Dr. Sarah (expert) and Mike (interviewer)
- **Audio File**: Multi-speaker TTS audio file (`research_podcast_*.wav`)
## Configuration
The system supports runtime configuration through the `Configuration` class:
### Model Settings
- `search_model`: Model for web search (default: "gemini-2.5-flash")
- `synthesis_model`: Model for report synthesis (default: "gemini-2.5-flash")
- `video_model`: Model for video analysis (default: "gemini-2.5-flash")
- `tts_model`: Model for text-to-speech (default: "gemini-2.5-flash-preview-tts")
### Temperature Settings
- `search_temperature`: Factual search queries (default: 0.0)
- `synthesis_temperature`: Balanced synthesis (default: 0.3)
- `podcast_script_temperature`: Creative dialogue (default: 0.4)
### TTS Settings
- `mike_voice`: Voice for interviewer (default: "Kore")
- `sarah_voice`: Voice for expert (default: "Puck")
- Audio format settings for output quality
## Project Structure
```
├── src/agent/
│ ├── state.py # State definitions (input/output schemas)
│ ├── configuration.py # Runtime configuration class
│ ├── utils.py # Utility functions (TTS, report generation)
│ └── graph.py # LangGraph workflow definition
├── langgraph.json # LangGraph deployment configuration
├── pyproject.toml # Python package configuration
└── .env # Environment variables
```
## Key Components
### State Management
- **ResearchStateInput**: Input schema (topic, optional video_url)
- **ResearchStateOutput**: Output schema (report, podcast_script, podcast_filename)
- **ResearchState**: Complete state including intermediate results
### Utility Functions
- **display_gemini_response()**: Processes Gemini responses with grounding metadata
- **create_podcast_discussion()**: Generates scripted dialogue and TTS audio
- **create_research_report()**: Synthesizes multi-modal research into reports
- **wave_file()**: Saves audio data to WAV format
## Deployment
The application is configured for deployment on:
- **Local Development**: Using LangGraph CLI with in-memory storage
- **LangGraph Platform**: Production deployment with persistent storage
- **Self-Hosted**: Using Docker containers
## Dependencies
Core dependencies managed via `pyproject.toml`:
- `langgraph>=0.2.6` - Workflow orchestration
- `google-genai` - Gemini API client
- `langchain>=0.3.19` - LangChain integrations
- `rich` - Enhanced terminal output
- `python-dotenv` - Environment management
## License
MIT License - see LICENSE file for details.
+9
View File
@@ -0,0 +1,9 @@
{
"dependencies": [
"."
],
"graphs": {
"research_agent": "agent.graph:create_compiled_graph"
},
"env": ".env"
}
+60
View File
@@ -0,0 +1,60 @@
[project]
name = "agent"
version = "0.0.1"
description = "Multi-modal researcher with Gemini"
authors = [
{ name = "Lance Martin", email = "lance@langchain.dev" },
]
readme = "README.md"
license = { text = "MIT" }
requires-python = ">=3.11,<4.0"
dependencies = [
"langgraph>=0.2.6",
"langchain>=0.3.19",
"langchain-google-genai",
"python-dotenv>=1.0.1",
"langgraph-sdk>=0.1.57",
"langgraph-cli",
"langgraph-api",
"fastapi",
"google-genai",
"rich",
]
[project.optional-dependencies]
dev = ["mypy>=1.11.1", "ruff>=0.6.1"]
[build-system]
requires = ["setuptools>=73.0.0", "wheel"]
build-backend = "setuptools.build_meta"
[tool.ruff]
lint.select = [
"E", # pycodestyle
"F", # pyflakes
"I", # isort
"D", # pydocstyle
"D401", # First line should be in imperative mood
"T201",
"UP",
]
lint.ignore = [
"UP006",
"UP007",
# We actually do want to import from typing_extensions
"UP035",
# Relax the convention by _not_ requiring documentation for every function parameter.
"D417",
"E501",
]
[tool.ruff.lint.per-file-ignores]
"tests/*" = ["D", "UP"]
[tool.ruff.lint.pydocstyle]
convention = "google"
[dependency-groups]
dev = [
"langgraph-cli[inmem]>=0.1.71",
"pytest>=8.3.5",
]
+46
View File
@@ -0,0 +1,46 @@
"""Configuration settings for the research and podcast generation app"""
import os
from dataclasses import dataclass, fields
from typing import Optional, Any
from typing_extensions import TypedDict
from langchain_core.runnables import RunnableConfig
@dataclass(kw_only=True)
class Configuration:
"""LangGraph Configuration for the deep research agent."""
# Model settings
search_model: str = "gemini-2.5-flash" # Web search supported model
synthesis_model: str = "gemini-2.5-flash" # Citations supported model
video_model: str = "gemini-2.5-flash" # Citations supported model
tts_model: str = "gemini-2.5-flash-preview-tts"
# Temperature settings for different use cases
search_temperature: float = 0.0 # Factual search queries
synthesis_temperature: float = 0.3 # Balanced synthesis
podcast_script_temperature: float = 0.4 # Creative dialogue
# TTS Configuration
mike_voice: str = "Kore"
sarah_voice: str = "Puck"
tts_channels: int = 1
tts_rate: int = 24000
tts_sample_width: int = 2
@classmethod
def from_runnable_config(
cls, config: Optional[RunnableConfig] = None
) -> "Configuration":
"""Create a Configuration instance from a RunnableConfig."""
configurable = (
config["configurable"] if config and "configurable" in config else {}
)
values: dict[str, Any] = {
f.name: os.environ.get(f.name.upper(), configurable.get(f.name))
for f in fields(cls)
if f.init
}
return cls(**{k: v for k, v in values.items() if v})
+147
View File
@@ -0,0 +1,147 @@
"""LangGraph implementation of the research and podcast generation workflow"""
from langgraph.graph import StateGraph, START, END
from langchain_core.runnables import RunnableConfig
from google.genai import types
from agent.state import ResearchState, ResearchStateInput, ResearchStateOutput
from agent.utils import display_gemini_response, create_podcast_discussion, create_research_report, genai_client
from agent.configuration import Configuration
def search_research_node(state: ResearchState, config: RunnableConfig) -> dict:
"""Node that performs web search research on the topic"""
configuration = Configuration.from_runnable_config(config)
topic = state["topic"]
search_response = genai_client.models.generate_content(
model=configuration.search_model,
contents=f"Research this topic and give me an overview: {topic}",
config={
"tools": [{"google_search": {}}],
"temperature": configuration.search_temperature,
},
)
search_text, search_sources_text = display_gemini_response(search_response)
return {
"search_text": search_text,
"search_sources_text": search_sources_text
}
def analyze_video_node(state: ResearchState, config: RunnableConfig) -> dict:
"""Node that analyzes video content if video URL is provided"""
configuration = Configuration.from_runnable_config(config)
video_url = state.get("video_url")
topic = state["topic"]
if not video_url:
return {"video_text": "No video provided for analysis."}
video_response = genai_client.models.generate_content(
model=configuration.video_model,
contents=types.Content(
parts=[
types.Part(
file_data=types.FileData(file_uri=video_url)
),
types.Part(text=f'Based on the video content, give me an overview of this topic: {topic}')
]
)
)
video_text, _ = display_gemini_response(video_response)
return {"video_text": video_text}
def create_report_node(state: ResearchState, config: RunnableConfig) -> dict:
"""Node that creates a comprehensive research report"""
configuration = Configuration.from_runnable_config(config)
topic = state["topic"]
search_text = state.get("search_text", "")
video_text = state.get("video_text", "")
search_sources_text = state.get("search_sources_text", "")
video_url = state.get("video_url", "")
report, synthesis_text = create_research_report(
topic, search_text, video_text, search_sources_text, video_url, configuration
)
return {
"report": report,
"synthesis_text": synthesis_text
}
def create_podcast_node(state: ResearchState, config: RunnableConfig) -> dict:
"""Node that creates a podcast discussion"""
configuration = Configuration.from_runnable_config(config)
topic = state["topic"]
search_text = state.get("search_text", "")
video_text = state.get("video_text", "")
search_sources_text = state.get("search_sources_text", "")
video_url = state.get("video_url", "")
# Create unique filename based on topic
safe_topic = "".join(c for c in topic if c.isalnum() or c in (' ', '-', '_')).rstrip()
filename = f"research_podcast_{safe_topic.replace(' ', '_')}.wav"
podcast_script, podcast_filename = create_podcast_discussion(
topic, search_text, video_text, search_sources_text, video_url, filename, configuration
)
return {
"podcast_script": podcast_script,
"podcast_filename": podcast_filename
}
def should_analyze_video(state: ResearchState) -> str:
"""Conditional edge to determine if video analysis should be performed"""
if state.get("video_url"):
return "analyze_video"
else:
return "create_report"
def create_research_graph() -> StateGraph:
"""Create and return the research workflow graph"""
# Create the graph with configuration schema
graph = StateGraph(
ResearchState,
input=ResearchStateInput,
output=ResearchStateOutput,
config_schema=Configuration
)
# Add nodes
graph.add_node("search_research", search_research_node)
graph.add_node("analyze_video", analyze_video_node)
graph.add_node("create_report", create_report_node)
graph.add_node("create_podcast", create_podcast_node)
# Add edges
graph.add_edge(START, "search_research")
graph.add_conditional_edges(
"search_research",
should_analyze_video,
{
"analyze_video": "analyze_video",
"create_report": "create_report"
}
)
graph.add_edge("analyze_video", "create_report")
graph.add_edge("create_report", "create_podcast")
graph.add_edge("create_podcast", END)
return graph
def create_compiled_graph():
"""Create and compile the research graph"""
graph = create_research_graph()
return graph.compile()
+34
View File
@@ -0,0 +1,34 @@
from typing_extensions import TypedDict
from typing import Optional
class ResearchStateInput(TypedDict):
"""State for the research and podcast generation workflow"""
# Input fields
topic: str
video_url: Optional[str]
class ResearchStateOutput(TypedDict):
"""State for the research and podcast generation workflow"""
# Final outputs
report: Optional[str]
podcast_script: Optional[str]
podcast_filename: Optional[str]
class ResearchState(TypedDict):
"""State for the research and podcast generation workflow"""
# Input fields
topic: str
video_url: Optional[str]
# Intermediate results
search_text: Optional[str]
search_sources_text: Optional[str]
video_text: Optional[str]
# Final outputs
report: Optional[str]
synthesis_text: Optional[str]
podcast_script: Optional[str]
podcast_filename: Optional[str]
+209
View File
@@ -0,0 +1,209 @@
import os
import wave
from google.genai import Client, types
from rich.console import Console
from rich.markdown import Markdown
from dotenv import load_dotenv
load_dotenv()
# Initialize client
genai_client = Client(api_key=os.getenv("GEMINI_API_KEY"))
def display_gemini_response(response):
"""Extract text from Gemini response and display as markdown with references"""
console = Console()
# Extract main content
text = response.candidates[0].content.parts[0].text
md = Markdown(text)
console.print(md)
# Get candidate for grounding metadata
candidate = response.candidates[0]
# Build sources text block
sources_text = ""
# Display grounding metadata if available
if hasattr(candidate, 'grounding_metadata') and candidate.grounding_metadata:
console.print("\n" + "="*50)
console.print("[bold blue]References & Sources[/bold blue]")
console.print("="*50)
# Display and collect source URLs
if candidate.grounding_metadata.grounding_chunks:
console.print(f"\n[bold]Sources ({len(candidate.grounding_metadata.grounding_chunks)}):[/bold]")
sources_list = []
for i, chunk in enumerate(candidate.grounding_metadata.grounding_chunks, 1):
if hasattr(chunk, 'web') and chunk.web:
title = getattr(chunk.web, 'title', 'No title') or "No title"
uri = getattr(chunk.web, 'uri', 'No URI') or "No URI"
console.print(f"{i}. {title}")
console.print(f" [dim]{uri}[/dim]")
sources_list.append(f"{i}. {title}\n {uri}")
sources_text = "\n".join(sources_list)
# Display grounding supports (which text is backed by which sources)
if candidate.grounding_metadata.grounding_supports:
console.print(f"\n[bold]Text segments with source backing:[/bold]")
for support in candidate.grounding_metadata.grounding_supports[:5]: # Show first 5
if hasattr(support, 'segment') and support.segment:
snippet = support.segment.text[:100] + "..." if len(support.segment.text) > 100 else support.segment.text
source_nums = [str(i+1) for i in support.grounding_chunk_indices]
console.print(f"\"{snippet}\" [dim](sources: {', '.join(source_nums)})[/dim]")
return text, sources_text
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
"""Save PCM data to a wave file"""
with wave.open(filename, "wb") as wf:
wf.setnchannels(channels)
wf.setsampwidth(sample_width)
wf.setframerate(rate)
wf.writeframes(pcm)
def create_podcast_discussion(topic, search_text, video_text, search_sources_text, video_url, filename="research_podcast.wav", configuration=None):
"""Create a 2-speaker podcast discussion explaining the research topic"""
# Use default values if no configuration provided
if configuration is None:
from agent.configuration import Configuration
configuration = Configuration()
# Step 1: Generate podcast script
script_prompt = f"""
Create a natural, engaging podcast conversation between Dr. Sarah (research expert) and Mike (curious interviewer) about "{topic}".
Use this research content:
SEARCH FINDINGS:
{search_text}
VIDEO INSIGHTS:
{video_text}
Format as a dialogue with:
- Mike introducing the topic and asking questions
- Dr. Sarah explaining key concepts and insights
- Natural back-and-forth discussion (5-7 exchanges)
- Mike asking follow-up questions
- Dr. Sarah synthesizing the main takeaways
- Keep it conversational and accessible (3-4 minutes when spoken)
Format exactly like this:
Mike: [opening question]
Dr. Sarah: [expert response]
Mike: [follow-up]
Dr. Sarah: [explanation]
[continue...]
"""
script_response = genai_client.models.generate_content(
model=configuration.synthesis_model,
contents=script_prompt,
config={"temperature": configuration.podcast_script_temperature}
)
podcast_script = script_response.candidates[0].content.parts[0].text
# Step 2: Generate TTS audio
tts_prompt = f"TTS the following conversation between Mike and Dr. Sarah:\n{podcast_script}"
response = genai_client.models.generate_content(
model=configuration.tts_model,
contents=tts_prompt,
config=types.GenerateContentConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
speaker_voice_configs=[
types.SpeakerVoiceConfig(
speaker='Mike',
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name=configuration.mike_voice,
)
)
),
types.SpeakerVoiceConfig(
speaker='Dr. Sarah',
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name=configuration.sarah_voice,
)
)
),
]
)
)
)
)
# Step 3: Save audio file
audio_data = response.candidates[0].content.parts[0].inline_data.data
wave_file(filename, audio_data, configuration.tts_channels, configuration.tts_rate, configuration.tts_sample_width)
print(f"Podcast saved as: {filename}")
return podcast_script, filename
def create_research_report(topic, search_text, video_text, search_sources_text, video_url, configuration=None):
"""Create a comprehensive research report by synthesizing search and video content"""
# Use default values if no configuration provided
if configuration is None:
from agent.configuration import Configuration
configuration = Configuration()
# Step 1: Create synthesis using Gemini
synthesis_prompt = f"""
You are a research analyst. I have gathered information about "{topic}" from two sources:
SEARCH RESULTS:
{search_text}
VIDEO CONTENT:
{video_text}
Please create a comprehensive synthesis that:
1. Identifies key themes and insights from both sources
2. Highlights any complementary or contrasting perspectives
3. Provides an overall analysis of the topic based on this multi-modal research
4. Keep it concise but thorough (3-4 paragraphs)
Focus on creating a coherent narrative that brings together the best insights from both sources.
"""
synthesis_response = genai_client.models.generate_content(
model=configuration.synthesis_model,
contents=synthesis_prompt,
config={
"temperature": configuration.synthesis_temperature,
}
)
synthesis_text = synthesis_response.candidates[0].content.parts[0].text
# Step 2: Create markdown report
report = f"""# Research Report: {topic}
## Executive Summary
{synthesis_text}
## Video Source
- **URL**: {video_url}
## Additional Sources
{search_sources_text}
---
*Report generated using multi-modal AI research combining web search and video analysis*
"""
return report, synthesis_text