Compare commits

...

3 Commits

Author SHA1 Message Date
Jerry Liu 5d5ff51eb2 cr 2025-04-01 20:21:23 -07:00
Jerry Liu 298ea964cf cr 2025-04-01 16:50:13 -07:00
Jerry Liu d9ae5ea3c7 cr 2025-04-01 16:46:26 -07:00
@@ -0,0 +1,452 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "00f6713b-2a32-4f8f-80e5-9a7d9b6e3b90",
"metadata": {},
"source": [
"# Solar Panel Datasheet Comparison Workflow\n",
"\n",
"<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/extract/solar_panel_e2e_comparison.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"\n",
"This notebook demonstrates an endtoend agentic workflow using LlamaExtract and the LlamaIndex eventdriven workflow framework. In this workflow, we:\n",
"\n",
"1. **Extract** structured technical specifications from a solar panel datasheet (e.g. a PDF downloaded from a vendor).\n",
"2. **Load** design requirements (provided as a text blob) for a labgrade solar panel.\n",
"3. **Generate** a detailed comparison report by triggering an event that injects both the extracted data and the requirements into an LLM prompt.\n",
"\n",
"The workflow is designed for renewable energy engineers who need to quickly validate that a solar panel meets specific design criteria.\n",
"\n",
"The following notebook uses the eventdriven syntax (with custom events, steps, and a workflow class) adapted from the technical datasheet and contract review examples."
]
},
{
"cell_type": "markdown",
"id": "36d8e34e-ed98-46ac-b744-1642f6e253d5",
"metadata": {},
"source": [
"## Setup and Load Data\n",
"\n",
"We download the [Honey M TSM-DE08M.08(II) datasheet](https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf) as a PDF.\n",
"\n",
"**NOTE**: The design requirements are already stored in `data/solar_panel_e2e_comparison/design_reqs.txt`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1de7b1b3-c285-492c-8b2e-b37974b4fc63",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2025-04-01 14:47:56-- https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf\n",
"Resolving static.trinasolar.com (static.trinasolar.com)... 47.246.23.232, 47.246.23.234, 47.246.23.227, ...\n",
"Connecting to static.trinasolar.com (static.trinasolar.com)|47.246.23.232|:443... connected.\n",
"WARNING: cannot verify static.trinasolar.com's certificate, issued by CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1,O=DigiCert Inc,C=US:\n",
" Unable to locally verify the issuer's authority.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 1888183 (1.8M) [application/pdf]\n",
"Saving to: data/solar_panel_e2e_comparison/datasheet.pdf\n",
"\n",
"data/solar_panel_e2 100%[===================>] 1.80M 7.47MB/s in 0.2s \n",
"\n",
"2025-04-01 14:47:56 (7.47 MB/s) - data/solar_panel_e2e_comparison/datasheet.pdf saved [1888183/1888183]\n",
"\n"
]
}
],
"source": [
"!wget https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf -O data/solar_panel_e2e_comparison/datasheet.pdf --no-check-certificate"
]
},
{
"cell_type": "markdown",
"id": "89d2f4c9-f785-424d-a409-3381796c457c",
"metadata": {},
"source": [
"## Define the Structured Extraction Schema\n",
"\n",
"We define a new, rich schema called `SolarPanelSchema` to capture key technical details from the datasheet. This schema includes:\n",
"\n",
"- **PowerRange:** Structured as minimum and maximum power output (in Watts).\n",
"- **SolarPanelSpec:** Includes module name, power output range, maximum efficiency, certifications, and a mapping of page citations.\n",
"\n",
"This schema replaces the earlier LM317 schema and will be used when creating our extraction agent."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bfb40d48-36e0-4b1c-97a1-32a1704c582b",
"metadata": {},
"outputs": [],
"source": [
"from pydantic import BaseModel, Field\n",
"from typing import List\n",
"\n",
"\n",
"class PowerRange(BaseModel):\n",
" min_power: float = Field(..., description=\"Minimum power output in Watts\")\n",
" max_power: float = Field(..., description=\"Maximum power output in Watts\")\n",
" unit: str = Field(\"W\", description=\"Power unit\")\n",
"\n",
"\n",
"class SolarPanelSpec(BaseModel):\n",
" module_name: str = Field(..., description=\"Name or model of the solar panel module\")\n",
" power_output: PowerRange = Field(..., description=\"Power output range\")\n",
" maximum_efficiency: float = Field(\n",
" ..., description=\"Maximum module efficiency in percentage\"\n",
" )\n",
" temperature_coefficient: float = Field(\n",
" ..., description=\"Temperature coefficient in %/°C\"\n",
" )\n",
" certifications: List[str] = Field([], description=\"List of certifications\")\n",
" page_citations: dict = Field(\n",
" ..., description=\"Mapping of each extracted field to its page numbers\"\n",
" )\n",
"\n",
"\n",
"class SolarPanelSchema(BaseModel):\n",
" specs: List[SolarPanelSpec] = Field(\n",
" ..., description=\"List of extracted solar panel specifications\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "19dc309e-7cec-43c1-8f6c-72e14df58f8f",
"metadata": {},
"source": [
"## Initialize Extraction Agent\n",
"\n",
"Here we initialize our extraction agent that will be responsible for extracting the schema from the solar panel datasheet."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9d9f4a2-2e14-493d-8a7e-d01159d38b8f",
"metadata": {},
"outputs": [],
"source": [
"from dotenv import load_dotenv\n",
"from llama_cloud_services import LlamaExtract\n",
"from llama_cloud.core.api_error import ApiError\n",
"from llama_cloud import ExtractConfig\n",
"\n",
"# Initialize the LlamaExtract client\n",
"llama_extract = LlamaExtract(\n",
" project_id=\"2fef999e-1073-40e6-aeb3-1f3c0e64d99b\",\n",
" organization_id=\"43b88c8f-e488-46f6-9013-698e3d2e374a\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec0eb2a7-6e02-45da-a6af-227e2f7c81f2",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" existing_agent = llama_extract.get_agent(name=\"solar-panel-datasheet\")\n",
" if existing_agent:\n",
" llama_extract.delete_agent(existing_agent.id)\n",
"except ApiError as e:\n",
" if e.status_code == 404:\n",
" pass\n",
" else:\n",
" raise\n",
"\n",
"extract_config = ExtractConfig(\n",
" extraction_mode=\"BALANCED\",\n",
")\n",
"\n",
"agent = llama_extract.create_agent(\n",
" name=\"solar-panel-datasheet\", data_schema=SolarPanelSchema, config=extract_config\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b4d7bb60-0456-4a2d-8d48-14f9bb3e71d2",
"metadata": {},
"source": [
"## Workflow Overview\n",
"\n",
"The workflow consists of four main steps:\n",
"\n",
"1. **parse_datasheet:** Reads the solar panel datasheet (PDF) and converts its content into text (with page citations).\n",
"2. **load_requirements:** Loads the design requirements (as a text blob) that will be injected into the prompt.\n",
"3. **generate_comparison_report:** Constructs a prompt using the extracted datasheet content and design requirements and triggers the LLM to generate a comparison report.\n",
"4. **output_result:** Logs and returns the final report as the workflows result.\n",
"\n",
"Each step is implemented as an asynchronous function decorated with `@step`, and the workflow is built by subclassing `Workflow`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c482e3a-66b4-4e1b-8d2d-9a9c6b3967f3",
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core.workflow import (\n",
" Event,\n",
" StartEvent,\n",
" StopEvent,\n",
" Context,\n",
" Workflow,\n",
" step,\n",
")\n",
"from llama_index.llms.openai import OpenAI\n",
"from llama_index.core.prompts import ChatPromptTemplate\n",
"from llama_cloud_services import LlamaExtract\n",
"from llama_cloud.core.api_error import ApiError\n",
"from pydantic import BaseModel, Field\n",
"from typing import List\n",
"\n",
"\n",
"# Define output schema for the comparison report (for reference)\n",
"class ComparisonReportOutput(BaseModel):\n",
" component_name: str = Field(\n",
" ..., description=\"The name of the component being evaluated.\"\n",
" )\n",
" meets_requirements: bool = Field(\n",
" ...,\n",
" description=\"Overall indicator of whether the component meets the design criteria.\",\n",
" )\n",
" summary: str = Field(..., description=\"A brief summary of the evaluation results.\")\n",
" details: dict = Field(\n",
" ..., description=\"Detailed comparisons for each key parameter.\"\n",
" )\n",
"\n",
"\n",
"# Define custom events\n",
"\n",
"\n",
"class DatasheetParseEvent(Event):\n",
" datasheet_content: dict\n",
"\n",
"\n",
"class RequirementsLoadEvent(Event):\n",
" requirements_text: str\n",
"\n",
"\n",
"class ComparisonReportEvent(Event):\n",
" report: ComparisonReportOutput\n",
"\n",
"\n",
"class LogEvent(Event):\n",
" msg: str\n",
" delta: bool = False\n",
"\n",
"\n",
"# For our demonstration, we assume that LlamaExtract is used to parse the datasheet into text.\n",
"# We'll also use OpenAI (via LlamaIndex) as our LLM for generating the report.\n",
"\n",
"llm = OpenAI(model=\"gpt-4o\") # or your preferred model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67a0c391-c7f5-4b93-8d6b-9e31b2d7a817",
"metadata": {},
"outputs": [],
"source": [
"class SolarPanelComparisonWorkflow(Workflow):\n",
" \"\"\"\n",
" Workflow to extract data from a solar panel datasheet and generate a comparison report\n",
" against provided design requirements.\n",
" \"\"\"\n",
"\n",
" def __init__(self, agent: LlamaExtract, requirements_path: str, **kwargs):\n",
" super().__init__(**kwargs)\n",
" self.agent = agent\n",
" # Load design requirements from file as a text blob\n",
" with open(requirements_path, \"r\") as f:\n",
" self.requirements_text = f.read()\n",
"\n",
" @step\n",
" async def parse_datasheet(\n",
" self, ctx: Context, ev: StartEvent\n",
" ) -> DatasheetParseEvent:\n",
" # datasheet_path is provided in the StartEvent\n",
" datasheet_path = (\n",
" ev.datasheet_path\n",
" ) # e.g., \"./data/solar_panel_comparison/datasheet.pdf\"\n",
" extraction_result = await self.agent.aextract(datasheet_path)\n",
" datasheet_dict = (\n",
" extraction_result.data\n",
" ) # assumed to be a string with page citations\n",
" await ctx.set(\"datasheet_content\", datasheet_dict)\n",
" ctx.write_event_to_stream(LogEvent(msg=\"Datasheet parsed successfully.\"))\n",
" return DatasheetParseEvent(datasheet_content=datasheet_dict)\n",
"\n",
" @step\n",
" async def load_requirements(\n",
" self, ctx: Context, ev: DatasheetParseEvent\n",
" ) -> RequirementsLoadEvent:\n",
" # Use the pre-loaded requirements text from __init__\n",
" req_text = self.requirements_text\n",
" ctx.write_event_to_stream(LogEvent(msg=\"Design requirements loaded.\"))\n",
" return RequirementsLoadEvent(requirements_text=req_text)\n",
"\n",
" @step\n",
" async def generate_comparison_report(\n",
" self, ctx: Context, ev: RequirementsLoadEvent\n",
" ) -> StopEvent:\n",
" # Build a prompt that injects both the extracted datasheet content and the design requirements\n",
" datasheet_content = await ctx.get(\"datasheet_content\")\n",
" prompt_str = \"\"\"\n",
"You are an expert renewable energy engineer.\n",
"\n",
"Compare the following solar panel datasheet information with the design requirements.\n",
"\n",
"Design Requirements:\n",
"{requirements_text}\n",
"\n",
"Extracted Datasheet Information:\n",
"{datasheet_content}\n",
"\n",
"Generate a detailed comparison report in JSON format with the following schema:\n",
" - component_name: string\n",
" - meets_requirements: boolean\n",
" - summary: string\n",
" - details: dictionary of comparisons for each parameter\n",
"\n",
"For each parameter (Maximum Power, Open-Circuit Voltage, Short-Circuit Current, Efficiency, Temperature Coefficient),\n",
"indicate PASS or FAIL and provide brief explanations and recommendations.\n",
"\"\"\"\n",
"\n",
" # extract from contract\n",
" prompt = ChatPromptTemplate.from_messages([(\"user\", prompt_str)])\n",
"\n",
" # Call the LLM to generate the report using the prompt\n",
" report_output = await llm.astructured_predict(\n",
" ComparisonReportOutput,\n",
" prompt,\n",
" requirements_text=ev.requirements_text,\n",
" datasheet_content=str(datasheet_content),\n",
" )\n",
" ctx.write_event_to_stream(LogEvent(msg=\"Comparison report generated.\"))\n",
" return StopEvent(\n",
" result={\"report\": report_output, \"datasheet_content\": datasheet_content}\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "d205f532-1a11-4a48-b5a8-87a7f85e9ce7",
"metadata": {},
"source": [
"## Running the Workflow\n",
"\n",
"Below, we instantiate and run the workflow. We inject the design requirements as a text blob (no custom code to load) and pass the path to the solar panel datasheet (the HoneyM datasheet from Trina).\n",
"\n",
"The design requirements are:\n",
"\n",
"```\n",
"Solar Panel Design Requirements:\n",
"- Power Output Range: ≥ 350 W\n",
"- Maximum Efficiency: ≥ 18%\n",
"- Certifications: Must include IEC61215 and UL1703\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6b24fa61-a2f5-4ebb-84eb-1c9b48683b1b",
"metadata": {},
"outputs": [],
"source": [
"import nest_asyncio\n",
"\n",
"nest_asyncio.apply()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a68bdffd-ac3c-4dcc-ba35-65939c2a6bfe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Running step parse_datasheet\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Uploading files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.17s/it]\n",
"Creating extraction jobs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.07it/s]\n",
"Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:28<00:00, 88.39s/it]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Step parse_datasheet produced event DatasheetParseEvent\n",
"Running step load_requirements\n",
"Step load_requirements produced event RequirementsLoadEvent\n",
"Running step generate_comparison_report\n",
"Step generate_comparison_report produced event StopEvent\n",
"\n",
"********Final Comparison Report:********\n",
" component_name='TSM-DE08M.08(II)' meets_requirements=True summary='The solar panel TSM-DE08M.08(II) meets all the specified design requirements, making it a suitable choice for the intended application.' details={'Maximum Power Output': \"PASS - The panel's power output ranges from 360 W to 385 W, exceeding the minimum requirement of 350 W.\", 'Open-Circuit Voltage': 'PASS - The datasheet does not specify Voc, but it is assumed to be within the required range based on other compliant parameters.', 'Short-Circuit Current': 'PASS - The datasheet does not specify Isc, but it is assumed to be within the required range based on other compliant parameters.', 'Efficiency': \"PASS - The panel's efficiency is 21.0%, which is above the minimum requirement of 18%.\", 'Temperature Coefficient': 'PASS - The temperature coefficient is -0.34%/°C, which is better than the maximum allowable -0.5%/°C.'}\n",
"\n",
"********Datasheet Content:********\n",
" {'specs': [{'module_name': 'TSM-DE08M.08(II)', 'power_output': {'min_power': 360.0, 'max_power': 385.0, 'unit': 'W'}, 'maximum_efficiency': 21.0, 'temperature_coefficient': -0.34, 'certifications': ['IEC61215/IEC61730/UL1703', 'IEC61701: Salt Mist Corrosion', 'IEC62716: Ammonia Corrosion', 'IEC60068: Blowing Sand', 'ISO9001', 'ISO14001', 'ISO45001', 'ISO14064'], 'page_citations': {}}]}\n"
]
}
],
"source": [
"# Path to design requirements file (e.g., a text file with design criteria for solar panels)\n",
"requirements_path = \"./data/solar_panel_e2e_comparison/design_reqs.txt\"\n",
"\n",
"# Instantiate the workflow\n",
"workflow = SolarPanelComparisonWorkflow(\n",
" agent=agent, requirements_path=requirements_path, verbose=True, timeout=120\n",
")\n",
"\n",
"# Run the workflow; pass the datasheet path in the StartEvent\n",
"result = await workflow.run(\n",
" datasheet_path=\"./data/solar_panel_e2e_comparison/datasheet.pdf\"\n",
")\n",
"print(\"\\n********Final Comparison Report:********\\n\", result[\"report\"])\n",
"print(\"\\n********Datasheet Content:********\\n\", result[\"datasheet_content\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llama_parse",
"language": "python",
"name": "llama_parse"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}