cr

2026-07-01 21:44:37 -04:00 · 2025-04-01 20:21:23 -07:00 · 2025-04-01 16:50:13 -07:00 · 2025-04-01 16:46:26 -07:00
1 changed files with 452 additions and 0 deletions
@@ -0,0 +1,452 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "00f6713b-2a32-4f8f-80e5-9a7d9b6e3b90",
+   "metadata": {},
+   "source": [
+    "# Solar Panel Datasheet Comparison Workflow\n",
+    "\n",
+    "<a href=\"https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/extract/solar_panel_e2e_comparison.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
+    "\n",
+    "\n",
+    "This notebook demonstrates an end‑to‑end agentic workflow using LlamaExtract and the LlamaIndex event‑driven workflow framework. In this workflow, we:\n",
+    "\n",
+    "1. **Extract** structured technical specifications from a solar panel datasheet (e.g. a PDF downloaded from a vendor).\n",
+    "2. **Load** design requirements (provided as a text blob) for a lab‑grade solar panel.\n",
+    "3. **Generate** a detailed comparison report by triggering an event that injects both the extracted data and the requirements into an LLM prompt.\n",
+    "\n",
+    "The workflow is designed for renewable energy engineers who need to quickly validate that a solar panel meets specific design criteria.\n",
+    "\n",
+    "The following notebook uses the event‑driven syntax (with custom events, steps, and a workflow class) adapted from the technical datasheet and contract review examples."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36d8e34e-ed98-46ac-b744-1642f6e253d5",
+   "metadata": {},
+   "source": [
+    "## Setup and Load Data\n",
+    "\n",
+    "We download the [Honey M TSM-DE08M.08(II) datasheet](https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf) as a PDF.\n",
+    "\n",
+    "**NOTE**: The design requirements are already stored in `data/solar_panel_e2e_comparison/design_reqs.txt`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1de7b1b3-c285-492c-8b2e-b37974b4fc63",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--2025-04-01 14:47:56--  https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf\n",
+      "Resolving static.trinasolar.com (static.trinasolar.com)... 47.246.23.232, 47.246.23.234, 47.246.23.227, ...\n",
+      "Connecting to static.trinasolar.com (static.trinasolar.com)|47.246.23.232|:443... connected.\n",
+      "WARNING: cannot verify static.trinasolar.com's certificate, issued by ‘CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1,O=DigiCert Inc,C=US’:\n",
+      "  Unable to locally verify the issuer's authority.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 1888183 (1.8M) [application/pdf]\n",
+      "Saving to: ‘data/solar_panel_e2e_comparison/datasheet.pdf’\n",
+      "\n",
+      "data/solar_panel_e2 100%[===================>]   1.80M  7.47MB/s    in 0.2s    \n",
+      "\n",
+      "2025-04-01 14:47:56 (7.47 MB/s) - ‘data/solar_panel_e2e_comparison/datasheet.pdf’ saved [1888183/1888183]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wget https://static.trinasolar.com/sites/default/files/EU_Datasheet_HoneyM_DE08M.08%28II%29_2021_A.pdf -O data/solar_panel_e2e_comparison/datasheet.pdf --no-check-certificate"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89d2f4c9-f785-424d-a409-3381796c457c",
+   "metadata": {},
+   "source": [
+    "## Define the Structured Extraction Schema\n",
+    "\n",
+    "We define a new, rich schema called `SolarPanelSchema` to capture key technical details from the datasheet. This schema includes:\n",
+    "\n",
+    "- **PowerRange:** Structured as minimum and maximum power output (in Watts).\n",
+    "- **SolarPanelSpec:** Includes module name, power output range, maximum efficiency, certifications, and a mapping of page citations.\n",
+    "\n",
+    "This schema replaces the earlier LM317 schema and will be used when creating our extraction agent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bfb40d48-36e0-4b1c-97a1-32a1704c582b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pydantic import BaseModel, Field\n",
+    "from typing import List\n",
+    "\n",
+    "\n",
+    "class PowerRange(BaseModel):\n",
+    "    min_power: float = Field(..., description=\"Minimum power output in Watts\")\n",
+    "    max_power: float = Field(..., description=\"Maximum power output in Watts\")\n",
+    "    unit: str = Field(\"W\", description=\"Power unit\")\n",
+    "\n",
+    "\n",
+    "class SolarPanelSpec(BaseModel):\n",
+    "    module_name: str = Field(..., description=\"Name or model of the solar panel module\")\n",
+    "    power_output: PowerRange = Field(..., description=\"Power output range\")\n",
+    "    maximum_efficiency: float = Field(\n",
+    "        ..., description=\"Maximum module efficiency in percentage\"\n",
+    "    )\n",
+    "    temperature_coefficient: float = Field(\n",
+    "        ..., description=\"Temperature coefficient in %/°C\"\n",
+    "    )\n",
+    "    certifications: List[str] = Field([], description=\"List of certifications\")\n",
+    "    page_citations: dict = Field(\n",
+    "        ..., description=\"Mapping of each extracted field to its page numbers\"\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "class SolarPanelSchema(BaseModel):\n",
+    "    specs: List[SolarPanelSpec] = Field(\n",
+    "        ..., description=\"List of extracted solar panel specifications\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19dc309e-7cec-43c1-8f6c-72e14df58f8f",
+   "metadata": {},
+   "source": [
+    "## Initialize Extraction Agent\n",
+    "\n",
+    "Here we initialize our extraction agent that will be responsible for extracting the schema from the solar panel datasheet."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9d9f4a2-2e14-493d-8a7e-d01159d38b8f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "from llama_cloud_services import LlamaExtract\n",
+    "from llama_cloud.core.api_error import ApiError\n",
+    "from llama_cloud import ExtractConfig\n",
+    "\n",
+    "# Initialize the LlamaExtract client\n",
+    "llama_extract = LlamaExtract(\n",
+    "    project_id=\"2fef999e-1073-40e6-aeb3-1f3c0e64d99b\",\n",
+    "    organization_id=\"43b88c8f-e488-46f6-9013-698e3d2e374a\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec0eb2a7-6e02-45da-a6af-227e2f7c81f2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try:\n",
+    "    existing_agent = llama_extract.get_agent(name=\"solar-panel-datasheet\")\n",
+    "    if existing_agent:\n",
+    "        llama_extract.delete_agent(existing_agent.id)\n",
+    "except ApiError as e:\n",
+    "    if e.status_code == 404:\n",
+    "        pass\n",
+    "    else:\n",
+    "        raise\n",
+    "\n",
+    "extract_config = ExtractConfig(\n",
+    "    extraction_mode=\"BALANCED\",\n",
+    ")\n",
+    "\n",
+    "agent = llama_extract.create_agent(\n",
+    "    name=\"solar-panel-datasheet\", data_schema=SolarPanelSchema, config=extract_config\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4d7bb60-0456-4a2d-8d48-14f9bb3e71d2",
+   "metadata": {},
+   "source": [
+    "## Workflow Overview\n",
+    "\n",
+    "The workflow consists of four main steps:\n",
+    "\n",
+    "1. **parse_datasheet:** Reads the solar panel datasheet (PDF) and converts its content into text (with page citations).\n",
+    "2. **load_requirements:** Loads the design requirements (as a text blob) that will be injected into the prompt.\n",
+    "3. **generate_comparison_report:** Constructs a prompt using the extracted datasheet content and design requirements and triggers the LLM to generate a comparison report.\n",
+    "4. **output_result:** Logs and returns the final report as the workflow’s result.\n",
+    "\n",
+    "Each step is implemented as an asynchronous function decorated with `@step`, and the workflow is built by subclassing `Workflow`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7c482e3a-66b4-4e1b-8d2d-9a9c6b3967f3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.workflow import (\n",
+    "    Event,\n",
+    "    StartEvent,\n",
+    "    StopEvent,\n",
+    "    Context,\n",
+    "    Workflow,\n",
+    "    step,\n",
+    ")\n",
+    "from llama_index.llms.openai import OpenAI\n",
+    "from llama_index.core.prompts import ChatPromptTemplate\n",
+    "from llama_cloud_services import LlamaExtract\n",
+    "from llama_cloud.core.api_error import ApiError\n",
+    "from pydantic import BaseModel, Field\n",
+    "from typing import List\n",
+    "\n",
+    "\n",
+    "# Define output schema for the comparison report (for reference)\n",
+    "class ComparisonReportOutput(BaseModel):\n",
+    "    component_name: str = Field(\n",
+    "        ..., description=\"The name of the component being evaluated.\"\n",
+    "    )\n",
+    "    meets_requirements: bool = Field(\n",
+    "        ...,\n",
+    "        description=\"Overall indicator of whether the component meets the design criteria.\",\n",
+    "    )\n",
+    "    summary: str = Field(..., description=\"A brief summary of the evaluation results.\")\n",
+    "    details: dict = Field(\n",
+    "        ..., description=\"Detailed comparisons for each key parameter.\"\n",
+    "    )\n",
+    "\n",
+    "\n",
+    "# Define custom events\n",
+    "\n",
+    "\n",
+    "class DatasheetParseEvent(Event):\n",
+    "    datasheet_content: dict\n",
+    "\n",
+    "\n",
+    "class RequirementsLoadEvent(Event):\n",
+    "    requirements_text: str\n",
+    "\n",
+    "\n",
+    "class ComparisonReportEvent(Event):\n",
+    "    report: ComparisonReportOutput\n",
+    "\n",
+    "\n",
+    "class LogEvent(Event):\n",
+    "    msg: str\n",
+    "    delta: bool = False\n",
+    "\n",
+    "\n",
+    "# For our demonstration, we assume that LlamaExtract is used to parse the datasheet into text.\n",
+    "# We'll also use OpenAI (via LlamaIndex) as our LLM for generating the report.\n",
+    "\n",
+    "llm = OpenAI(model=\"gpt-4o\")  # or your preferred model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67a0c391-c7f5-4b93-8d6b-9e31b2d7a817",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SolarPanelComparisonWorkflow(Workflow):\n",
+    "    \"\"\"\n",
+    "    Workflow to extract data from a solar panel datasheet and generate a comparison report\n",
+    "    against provided design requirements.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, agent: LlamaExtract, requirements_path: str, **kwargs):\n",
+    "        super().__init__(**kwargs)\n",
+    "        self.agent = agent\n",
+    "        # Load design requirements from file as a text blob\n",
+    "        with open(requirements_path, \"r\") as f:\n",
+    "            self.requirements_text = f.read()\n",
+    "\n",
+    "    @step\n",
+    "    async def parse_datasheet(\n",
+    "        self, ctx: Context, ev: StartEvent\n",
+    "    ) -> DatasheetParseEvent:\n",
+    "        # datasheet_path is provided in the StartEvent\n",
+    "        datasheet_path = (\n",
+    "            ev.datasheet_path\n",
+    "        )  # e.g., \"./data/solar_panel_comparison/datasheet.pdf\"\n",
+    "        extraction_result = await self.agent.aextract(datasheet_path)\n",
+    "        datasheet_dict = (\n",
+    "            extraction_result.data\n",
+    "        )  # assumed to be a string with page citations\n",
+    "        await ctx.set(\"datasheet_content\", datasheet_dict)\n",
+    "        ctx.write_event_to_stream(LogEvent(msg=\"Datasheet parsed successfully.\"))\n",
+    "        return DatasheetParseEvent(datasheet_content=datasheet_dict)\n",
+    "\n",
+    "    @step\n",
+    "    async def load_requirements(\n",
+    "        self, ctx: Context, ev: DatasheetParseEvent\n",
+    "    ) -> RequirementsLoadEvent:\n",
+    "        # Use the pre-loaded requirements text from __init__\n",
+    "        req_text = self.requirements_text\n",
+    "        ctx.write_event_to_stream(LogEvent(msg=\"Design requirements loaded.\"))\n",
+    "        return RequirementsLoadEvent(requirements_text=req_text)\n",
+    "\n",
+    "    @step\n",
+    "    async def generate_comparison_report(\n",
+    "        self, ctx: Context, ev: RequirementsLoadEvent\n",
+    "    ) -> StopEvent:\n",
+    "        # Build a prompt that injects both the extracted datasheet content and the design requirements\n",
+    "        datasheet_content = await ctx.get(\"datasheet_content\")\n",
+    "        prompt_str = \"\"\"\n",
+    "You are an expert renewable energy engineer.\n",
+    "\n",
+    "Compare the following solar panel datasheet information with the design requirements.\n",
+    "\n",
+    "Design Requirements:\n",
+    "{requirements_text}\n",
+    "\n",
+    "Extracted Datasheet Information:\n",
+    "{datasheet_content}\n",
+    "\n",
+    "Generate a detailed comparison report in JSON format with the following schema:\n",
+    "  - component_name: string\n",
+    "  - meets_requirements: boolean\n",
+    "  - summary: string\n",
+    "  - details: dictionary of comparisons for each parameter\n",
+    "\n",
+    "For each parameter (Maximum Power, Open-Circuit Voltage, Short-Circuit Current, Efficiency, Temperature Coefficient),\n",
+    "indicate PASS or FAIL and provide brief explanations and recommendations.\n",
+    "\"\"\"\n",
+    "\n",
+    "        # extract from contract\n",
+    "        prompt = ChatPromptTemplate.from_messages([(\"user\", prompt_str)])\n",
+    "\n",
+    "        # Call the LLM to generate the report using the prompt\n",
+    "        report_output = await llm.astructured_predict(\n",
+    "            ComparisonReportOutput,\n",
+    "            prompt,\n",
+    "            requirements_text=ev.requirements_text,\n",
+    "            datasheet_content=str(datasheet_content),\n",
+    "        )\n",
+    "        ctx.write_event_to_stream(LogEvent(msg=\"Comparison report generated.\"))\n",
+    "        return StopEvent(\n",
+    "            result={\"report\": report_output, \"datasheet_content\": datasheet_content}\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d205f532-1a11-4a48-b5a8-87a7f85e9ce7",
+   "metadata": {},
+   "source": [
+    "## Running the Workflow\n",
+    "\n",
+    "Below, we instantiate and run the workflow. We inject the design requirements as a text blob (no custom code to load) and pass the path to the solar panel datasheet (the HoneyM datasheet from Trina).\n",
+    "\n",
+    "The design requirements are:\n",
+    "\n",
+    "```\n",
+    "Solar Panel Design Requirements:\n",
+    "- Power Output Range: ≥ 350 W\n",
+    "- Maximum Efficiency: ≥ 18%\n",
+    "- Certifications: Must include IEC61215 and UL1703\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b24fa61-a2f5-4ebb-84eb-1c9b48683b1b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import nest_asyncio\n",
+    "\n",
+    "nest_asyncio.apply()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a68bdffd-ac3c-4dcc-ba35-65939c2a6bfe",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running step parse_datasheet\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Uploading files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.17s/it]\n",
+      "Creating extraction jobs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.07it/s]\n",
+      "Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:28<00:00, 88.39s/it]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Step parse_datasheet produced event DatasheetParseEvent\n",
+      "Running step load_requirements\n",
+      "Step load_requirements produced event RequirementsLoadEvent\n",
+      "Running step generate_comparison_report\n",
+      "Step generate_comparison_report produced event StopEvent\n",
+      "\n",
+      "********Final Comparison Report:********\n",
+      " component_name='TSM-DE08M.08(II)' meets_requirements=True summary='The solar panel TSM-DE08M.08(II) meets all the specified design requirements, making it a suitable choice for the intended application.' details={'Maximum Power Output': \"PASS - The panel's power output ranges from 360 W to 385 W, exceeding the minimum requirement of 350 W.\", 'Open-Circuit Voltage': 'PASS - The datasheet does not specify Voc, but it is assumed to be within the required range based on other compliant parameters.', 'Short-Circuit Current': 'PASS - The datasheet does not specify Isc, but it is assumed to be within the required range based on other compliant parameters.', 'Efficiency': \"PASS - The panel's efficiency is 21.0%, which is above the minimum requirement of 18%.\", 'Temperature Coefficient': 'PASS - The temperature coefficient is -0.34%/°C, which is better than the maximum allowable -0.5%/°C.'}\n",
+      "\n",
+      "********Datasheet Content:********\n",
+      " {'specs': [{'module_name': 'TSM-DE08M.08(II)', 'power_output': {'min_power': 360.0, 'max_power': 385.0, 'unit': 'W'}, 'maximum_efficiency': 21.0, 'temperature_coefficient': -0.34, 'certifications': ['IEC61215/IEC61730/UL1703', 'IEC61701: Salt Mist Corrosion', 'IEC62716: Ammonia Corrosion', 'IEC60068: Blowing Sand', 'ISO9001', 'ISO14001', 'ISO45001', 'ISO14064'], 'page_citations': {}}]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Path to design requirements file (e.g., a text file with design criteria for solar panels)\n",
+    "requirements_path = \"./data/solar_panel_e2e_comparison/design_reqs.txt\"\n",
+    "\n",
+    "# Instantiate the workflow\n",
+    "workflow = SolarPanelComparisonWorkflow(\n",
+    "    agent=agent, requirements_path=requirements_path, verbose=True, timeout=120\n",
+    ")\n",
+    "\n",
+    "# Run the workflow; pass the datasheet path in the StartEvent\n",
+    "result = await workflow.run(\n",
+    "    datasheet_path=\"./data/solar_panel_e2e_comparison/datasheet.pdf\"\n",
+    ")\n",
+    "print(\"\\n********Final Comparison Report:********\\n\", result[\"report\"])\n",
+    "print(\"\\n********Datasheet Content:********\\n\", result[\"datasheet_content\"])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "llama_parse",
+   "language": "python",
+   "name": "llama_parse"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
Author	SHA1	Message	Date
Jerry Liu	5d5ff51eb2	cr	2025-04-01 20:21:23 -07:00
Jerry Liu	298ea964cf	cr	2025-04-01 16:50:13 -07:00
Jerry Liu	d9ae5ea3c7	cr	2025-04-01 16:46:26 -07:00